Hi Josip,
Thanks for your comments.
On 24/10/12 15:42, Josip Almasi wrote:
I think I'll wait till it works with java 7. (workaround didn't work
for me)
I didn't know that. I'm Ok with Java 6 for the moment as that is the
default with Ubuntu 12.04. Still not quite comfortable with this iced
tea business though... I prefer 100% Java beans :-)
So my first plan is to make the tokenizer more intelligent. It
should carefully extract far more meta-data from the email.
Wrote some mail parsing code, parses plain text and html, ignores
other MIME types. For others, I guess only headers should be taken
into account.
Malformed MIMEs are real issue there. So I used heuristics to avoid
them - number of tokens and size of tokens.
Also, better ignore numbers, or use them as delimiters.
Of course, all message parts need to be processed. That's not cheap,
and should be limited, by max allowed time and/or number of tokens.
That's very interesting. Did you use the Mime4J library to do the heavy
lifting or did you parse all the message yourself?
That's a good point about malformed MIMEs. Even with the relatively
small number of spams I've collected I noticed a number of deviant
practices.
Not so sure about ignoring numbers though. Certainly, need to capture
IP addresses, HTML and CSS colour settings and also domain names. I can
see there will be a lot of tweaking involved.
I'm keen to capture phrases (ie. capturing two or more sequential words)
as I've heard they improve detection at the expense of a larger token
database.
Image info needs extracting too. So things like the width, height, bit
depth, type of encoding, Exif data and any tags should all be captured.
I quite often get large (several megabyte) emails from China containing
pictures of products for me and the current James setup gives up with
messages of that size. Or rather it creates thousands of random tokens
full of base64 segments!
I worry how big the spam folder may get if I'm not deleting spam
messages.
Well, I'm not deleting any spam:) You never know when you may need some;)
Right now I have 143286 unread in my junk folder, total is 250k+, all
correctly marked as 100% spam, 850MB.
I'm envious... erm.... I think! No seriously, that's got to be useful
to you someday. Maybe I should start collecting them instead of
deleting them too. I wonder how many of those are addressed to
'johnsmithsvt' :-)
Happy tokenizing!
David.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]