Hi Josip,

Thanks for your comments.

On 24/10/12 15:42, Josip Almasi wrote:

I think I'll wait till it works with java 7. (workaround didn't work for me)

I didn't know that. I'm Ok with Java 6 for the moment as that is the default with Ubuntu 12.04. Still not quite comfortable with this iced tea business though... I prefer 100% Java beans :-)

So my first plan is to make the tokenizer more intelligent. It should carefully extract far more meta-data from the email.

Wrote some mail parsing code, parses plain text and html, ignores other MIME types. For others, I guess only headers should be taken into account. Malformed MIMEs are real issue there. So I used heuristics to avoid them - number of tokens and size of tokens.
Also, better ignore numbers, or use them as delimiters.
Of course, all message parts need to be processed. That's not cheap, and should be limited, by max allowed time and/or number of tokens.

That's very interesting. Did you use the Mime4J library to do the heavy lifting or did you parse all the message yourself?

That's a good point about malformed MIMEs. Even with the relatively small number of spams I've collected I noticed a number of deviant practices.

Not so sure about ignoring numbers though. Certainly, need to capture IP addresses, HTML and CSS colour settings and also domain names. I can see there will be a lot of tweaking involved.

I'm keen to capture phrases (ie. capturing two or more sequential words) as I've heard they improve detection at the expense of a larger token database.

Image info needs extracting too. So things like the width, height, bit depth, type of encoding, Exif data and any tags should all be captured. I quite often get large (several megabyte) emails from China containing pictures of products for me and the current James setup gives up with messages of that size. Or rather it creates thousands of random tokens full of base64 segments!


I worry how big the spam folder may get if I'm not deleting spam messages.

Well, I'm not deleting any spam:) You never know when you may need some;)
Right now I have 143286 unread in my junk folder, total is 250k+, all correctly marked as 100% spam, 850MB.


I'm envious... erm.... I think! No seriously, that's got to be useful to you someday. Maybe I should start collecting them instead of deleting them too. I wonder how many of those are addressed to 'johnsmithsvt' :-)

Happy tokenizing!
David.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to