Is there a way to improve the way that ASSP parses certain special,
non-printing, characters?  I'm having trouble with spam emails that have
their body heavily obfuscated with "soft hyphens" slipping through.  They
all seem to have multipart bodies, first with an iso-8559-1 text part with
*=AD* interterspersed in words and then an html part with *­* all over
the place.  These are the "soft hyphen," a hyphen that only prints if it is
needed to break the word to the next line.  It's clever.  The user doesn't
see the character, but ASSP thinks it's a word boundary.

The part first part

Content-Type: text/plain; charset="*iso-8859-1*"
Content-Transfer-Encoding: quoted-printable

will be plain text, and have have spammy words with *=AD* inserted in the
middle of them, for example, "This is a sentence with spammy phrase." could
be written something like

This is a sentence with sp=ADammy p=ADhr=ADase.


The next mime part is the html, which does the same thing, but uses ­
(html for soft hyphen) mid-word.  So, something like:

<p>This is a sentence with sp&shy;ammy p&shy;hr&shy;ase in it</p>


The whole body of the message is filled with these soft hyphens anywhere
that there's spammy words/phrases, and in many cases, there are soft
hyphens every couple of letters across the entire body.  When I do an
analysis, it appears that the soft hyphen tricks ASSP into thinking that
each part of the word is a separate word, so for sp&shy;ammy
p&shy;hr&shy;ase, it thinks the words are

sp ammy p hr ase


I am using HTML::strip.  Would TreeBuilder work better?  I'm concerned
about performance there.

Is there a way (and is it a good idea) to somehow instruct ASSP to treat
certain html special characters as ones to ignore, and others to be treated
as a word separator?  My thinking is that if it doesn't display, then it
should be ignored when doing bayesian / HMM evaluation.

https://cs.stanford.edu/people/miles/iso8859.html has a bunch of Control
Characters and Special Characters that don't print - or in the case of the
soft hyphen, only print when the contained word is at the end of a line.  I
suspect that other characters will be abused in the same way.

This kind of obfuscation goes hand in hand with my previous questions about
considering some non-Latin characters that look like Latin characters as
those Latin alphabet characters.

Thanks
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to