On Wed, 2003-03-12 at 23:19, Gordon Schumacher wrote:
> Here's one I've been seeing lately... subject lines and message bodies that 
> look like this:
> "S'end you'r Ad's to 3.5 M'illio'n De'sktops E'very'day."
> 
> I've done a fair amount of natural language processing work.  Let me think 
> about some clever ways to check for this sort of thing... perhaps a 
> dictionary lookup against punctuation-stripped words?  Would a lookup via 
> ispell be too much overhead?

Probably if it were applied to the entire message, but probably not if
it only runs in the DATA.HEADER.SUBJECT context.

Maybe another way would be to check against a list of characters (maybe
there are others used besides apostrophes?) and if there are more than
some limit a META.SUSPICIOUS_SUBJECT_CHARS=Y token could be generated.

I've seen other cases where words are broken not by single chars but by
HTML comments.  I think POPFile catches these, but I'm not certain.  T h
e n   t h e r e   a r e   m e s s a g e s   l i k e   t h i s.  I think
POPFile also catches these, labelling them "spacedout" or something like
that, so it might be helpful to peek at that code to build a Tokenizer
that will work the same if POPFile itself is not used as the Tokenizer.

- Marty

-- 
Marty Lamb
Martian Software
<mlamb at martiansoftware dot com>

----
: The tarproxy-list mailing list is archived at
:   http://www.mail-archive.com/tarproxy-list%40martiansoftware.com/
:
: To unsubscribe from this list, follow the instructions at
:   http://www.martiansoftware.com/contact.html
:
: TarProxy's project page can be found at
:   http://www.martiansoftware.com/tarproxy

Reply via email to