That's very interesting. Did you use the Mime4J library to do the
heavy lifting or did you parse all the message yourself?
I used javax.mail, started from a good mail parsing example included.
Parsed html with javax.swing.text.html.HTMLEditorKit.
Thanks for the hint. I'll take a look.
Not so sure about ignoring numbers though. Certainly, need to
capture IP addresses, HTML and CSS colour settings and also domain
names. I can see there will be a lot of tweaking involved.
The catch with numbers is, I recieved some CSV files, containing
database table dumps, hundereds of thousands of lines, each containing
unique codes.
I understand where you are coming from now.
Ok, so the problem as I see it at the moment is that James isn't feeding
the Bayes algorithm with quality tokens upon which it can work its
statistical magic effectively.
I have to break any email down into its constituent parts (e.g headers,
body, attachments) and then intelligently extract whatever useful
metadata (or in the case of the email body its actual data) I can get.
So when I talk about capturing 'numbers' I'm talking in the context of
one of these constituent email parts and not necessarily the email as a
whole. I can see that it might even be beneficial in the future to have
plugins that specialize in extracting tokens from particular mime
types... but not just yet!
When I say a 'token' I'm thinking about an object which not only
represents a string and how often that string has been seen in a ham or
spam corpus but also a context and a timestamp. The context records
which part of the email we are talking about and the timestamp records
the date and time of the last recorded occurrence of this token in an email.
I think the context is important because it lets the Bayes algorithm
learn for example that 'Free!' seen in the Subject: header is more
spammy than the same string seen in the body of an email.
The timestamp will enable the otherwise ever increasing spam and ham
corpus to be kept in check by deleting those tokens whose counts haven't
risen above say 2 in 6 months. I got this idea from Spamprobe [1].
IP and domain names, I don't think so.
Suppose you use dot as delimiter. Then, each byte of IP address
becomes a token, and gets own weight. Much the same with domains.
Bayes should take care of rest.
So following on from what I said above I'm talking about IP addresses
and domain names as seen in the context of headers. Here's an example
from a recent spam: -
Received: from tedwoodsports.dh.bytemark.co.uk (HELO User) (89.16.177.117)
by banddtruckparts.com with ESMTPA; 16 Oct 2012 21:01:11 -0400
In this example I would extract a token with context: HEADER-RECEIVED-IP
and value: '89.16.177.117'. In fact knowing how IP addresses are
constructed I could also record a similar token with value: 89.16.177
because it may be statistically significant that any of the 256
addresses that fall into this range are either spam or maybe even ham.
I don't know that but at least I'm giving the Bayes algorithm extra info
that it may find statistically significant. If it isn't then that token
will be deleted after a while anyway.
Similarly I would also create tokens with the context:
HEADER-RECEIVED-FROM-DOMAIN and the following values: -
tedwoodsports.dh.bytemark.co.uk
dh.bytemark.co.uk
bytemark.co.uk
co.uk
uk
I'm keen to capture phrases (ie. capturing two or more sequential
words) as I've heard they improve detection at the expense of a
larger token database.
Any pointers?
That's pretty straightforward actually. Suppose you have a sentence
"Mary had a little lamb" then you would generate the following token
values in addition to the single word tokens if you were capturing a
phrase size of 2: -
Maryhad
hada
alittle
littlelamb
I recommend you read Paul Graham's 'Better Bayesian Filtering' [2]
(especially the bit titled 'Tokens'). It's fascinating stuff... or
maybe I'm getting too old and geeky :-)
Image info needs extracting too. So things like the width, height,
bit depth, type of encoding, Exif data and any tags should all be
captured.
...what would you use to extract image info?
I haven't used any graphics libraries recently but a quick scan suggests
'Commons Sanselan' [3] which happily is an Apache project now. When it
comes to extracting meta data from MS Documents I think Apache Poi [4]
is still a good choice.
David.
[1] http://spamprobe.sourceforge.net/
[2] http://www.paulgraham.com/better.html
[3] http://commons.apache.org/imaging/
[4] http://poi.apache.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]