That's very interesting. Did you use the Mime4J library to do the heavy lifting or did you parse all the message yourself?

I used javax.mail, started from a good mail parsing example included.
Parsed html with javax.swing.text.html.HTMLEditorKit.

Thanks for the hint.  I'll take a look.


Not so sure about ignoring numbers though. Certainly, need to capture IP addresses, HTML and CSS colour settings and also domain names. I can see there will be a lot of tweaking involved.

The catch with numbers is, I recieved some CSV files, containing database table dumps, hundereds of thousands of lines, each containing unique codes.

I understand where you are coming from now.

Ok, so the problem as I see it at the moment is that James isn't feeding the Bayes algorithm with quality tokens upon which it can work its statistical magic effectively.

I have to break any email down into its constituent parts (e.g headers, body, attachments) and then intelligently extract whatever useful metadata (or in the case of the email body its actual data) I can get. So when I talk about capturing 'numbers' I'm talking in the context of one of these constituent email parts and not necessarily the email as a whole. I can see that it might even be beneficial in the future to have plugins that specialize in extracting tokens from particular mime types... but not just yet!

When I say a 'token' I'm thinking about an object which not only represents a string and how often that string has been seen in a ham or spam corpus but also a context and a timestamp. The context records which part of the email we are talking about and the timestamp records the date and time of the last recorded occurrence of this token in an email.

I think the context is important because it lets the Bayes algorithm learn for example that 'Free!' seen in the Subject: header is more spammy than the same string seen in the body of an email.

The timestamp will enable the otherwise ever increasing spam and ham corpus to be kept in check by deleting those tokens whose counts haven't risen above say 2 in 6 months. I got this idea from Spamprobe [1].

IP and domain names, I don't think so.
Suppose you use dot as delimiter. Then, each byte of IP address becomes a token, and gets own weight. Much the same with domains.
Bayes should take care of rest.

So following on from what I said above I'm talking about IP addresses and domain names as seen in the context of headers. Here's an example from a recent spam: -

  Received: from tedwoodsports.dh.bytemark.co.uk (HELO User) (89.16.177.117)
    by banddtruckparts.com with ESMTPA; 16 Oct 2012 21:01:11 -0400


In this example I would extract a token with context: HEADER-RECEIVED-IP and value: '89.16.177.117'. In fact knowing how IP addresses are constructed I could also record a similar token with value: 89.16.177 because it may be statistically significant that any of the 256 addresses that fall into this range are either spam or maybe even ham. I don't know that but at least I'm giving the Bayes algorithm extra info that it may find statistically significant. If it isn't then that token will be deleted after a while anyway.

Similarly I would also create tokens with the context: HEADER-RECEIVED-FROM-DOMAIN and the following values: -

  tedwoodsports.dh.bytemark.co.uk
  dh.bytemark.co.uk
  bytemark.co.uk
  co.uk
  uk

I'm keen to capture phrases (ie. capturing two or more sequential words) as I've heard they improve detection at the expense of a larger token database.

Any pointers?


That's pretty straightforward actually. Suppose you have a sentence "Mary had a little lamb" then you would generate the following token values in addition to the single word tokens if you were capturing a phrase size of 2: -

  Maryhad
  hada
  alittle
  littlelamb

I recommend you read Paul Graham's 'Better Bayesian Filtering' [2] (especially the bit titled 'Tokens'). It's fascinating stuff... or maybe I'm getting too old and geeky :-)

Image info needs extracting too. So things like the width, height, bit depth, type of encoding, Exif data and any tags should all be captured.

...what would you use to extract image info?

I haven't used any graphics libraries recently but a quick scan suggests 'Commons Sanselan' [3] which happily is an Apache project now. When it comes to extracting meta data from MS Documents I think Apache Poi [4] is still a good choice.

David.

[1] http://spamprobe.sourceforge.net/
[2] http://www.paulgraham.com/better.html
[3] http://commons.apache.org/imaging/
[4] http://poi.apache.org/


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to