Re: Bayesian Analysis for v3

David Legg Sat, 27 Oct 2012 10:11:07 -0700

That's very interesting. Did you use the Mime4J library to do theheavy lifting or did you parse all the message yourself?
I used javax.mail, started from a good mail parsing example included.
Parsed html with javax.swing.text.html.HTMLEditorKit.


Thanks for the hint.  I'll take a look.

Not so sure about ignoring numbers though. Certainly, need tocapture IP addresses, HTML and CSS colour settings and also domainnames. I can see there will be a lot of tweaking involved.
The catch with numbers is, I recieved some CSV files, containingdatabase table dumps, hundereds of thousands of lines, each containingunique codes.


I understand where you are coming from now.

Ok, so the problem as I see it at the moment is that James isn't feedingthe Bayes algorithm with quality tokens upon which it can work itsstatistical magic effectively.

I have to break any email down into its constituent parts (e.g headers,body, attachments) and then intelligently extract whatever usefulmetadata (or in the case of the email body its actual data) I can get.So when I talk about capturing 'numbers' I'm talking in the context ofone of these constituent email parts and not necessarily the email as awhole. I can see that it might even be beneficial in the future to haveplugins that specialize in extracting tokens from particular mimetypes... but not just yet!

When I say a 'token' I'm thinking about an object which not onlyrepresents a string and how often that string has been seen in a ham orspam corpus but also a context and a timestamp. The context recordswhich part of the email we are talking about and the timestamp recordsthe date and time of the last recorded occurrence of this token in an email.

I think the context is important because it lets the Bayes algorithmlearn for example that 'Free!' seen in the Subject: header is morespammy than the same string seen in the body of an email.

The timestamp will enable the otherwise ever increasing spam and hamcorpus to be kept in check by deleting those tokens whose counts haven'trisen above say 2 in 6 months. I got this idea from Spamprobe [1].

IP and domain names, I don't think so.
Suppose you use dot as delimiter. Then, each byte of IP addressbecomes a token, and gets own weight. Much the same with domains.
Bayes should take care of rest.

So following on from what I said above I'm talking about IP addressesand domain names as seen in the context of headers. Here's an examplefrom a recent spam: -


  Received: from tedwoodsports.dh.bytemark.co.uk (HELO User) (89.16.177.117)
    by banddtruckparts.com with ESMTPA; 16 Oct 2012 21:01:11 -0400

In this example I would extract a token with context: HEADER-RECEIVED-IPand value: '89.16.177.117'. In fact knowing how IP addresses areconstructed I could also record a similar token with value: 89.16.177because it may be statistically significant that any of the 256addresses that fall into this range are either spam or maybe even ham.I don't know that but at least I'm giving the Bayes algorithm extra infothat it may find statistically significant. If it isn't then that tokenwill be deleted after a while anyway.

Similarly I would also create tokens with the context:HEADER-RECEIVED-FROM-DOMAIN and the following values: -


  tedwoodsports.dh.bytemark.co.uk
  dh.bytemark.co.uk
  bytemark.co.uk
  co.uk
  uk

I'm keen to capture phrases (ie. capturing two or more sequentialwords) as I've heard they improve detection at the expense of alarger token database.
Any pointers?

That's pretty straightforward actually. Suppose you have a sentence"Mary had a little lamb" then you would generate the following tokenvalues in addition to the single word tokens if you were capturing aphrase size of 2: -


  Maryhad
  hada
  alittle
  littlelamb

I recommend you read Paul Graham's 'Better Bayesian Filtering' [2](especially the bit titled 'Tokens'). It's fascinating stuff... ormaybe I'm getting too old and geeky :-)

Image info needs extracting too. So things like the width, height,bit depth, type of encoding, Exif data and any tags should all becaptured.
...what would you use to extract image info?

I haven't used any graphics libraries recently but a quick scan suggests'Commons Sanselan' [3] which happily is an Apache project now. When itcomes to extracting meta data from MS Documents I think Apache Poi [4]is still a good choice.


David.

[1] http://spamprobe.sourceforge.net/
[2] http://www.paulgraham.com/better.html
[3] http://commons.apache.org/imaging/
[4] http://poi.apache.org/


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Bayesian Analysis for v3

Reply via email to