Thanks again for the explanation.  Looking forward to a future release when
soft-hyphens (and additional control characters?) are essentially ignored.

On Wed, Sep 7, 2022 at 9:14 AM Thomas Eckardt <thomas.ecka...@thockar.com>
wrote:

> If unicode normalization NFKC does'nt fulfill your requirement, you may
> enable 'DoTransliterate' - by accepting some performance penalties.
>
> The "Unicode Technical Standard #39" http://www.unicode.org/reports/tr39/
> will give you some more information and
> https://www.unicode.org/Public/security/revision-05/intentional.txt shows
> a nice table for cyrillic and greek.
> If someone expects an ASCII mail, those translations may somehow help. But
> in all other cases (100% cyrillic/greek/....), such a character replacement
> is contra-productive (for example: not all cyrillic letters have a valid
> latin replacement).
>
> > potentially treat look-alike characters as the latin character for
> bayesian purposes
>
> The HMM and Bayesian engines are using heuristic mechanism. Trying to
> treat single characters as latin (or anything else) will not worth the
> effort. Over a short periode of time, both engines will have learned also
> obscured words (word combinations).
>
>
> Thomas
>
>
>
>
> Von:        "K Post" <nntp.p...@gmail.com>
> An:        "ASSP development mailing list" <
> assp-test@lists.sourceforge.net>
> Datum:        06.09.2022 21:31
> Betreff:        Re: [Assp-test] soft hyphen fooling Bayesian analysis
> ------------------------------
>
>
>
> Eager to see what you come up with in terms of ignoring the soft hyphen.
>
>  Your <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>> regex is clear, and I
> understand using that for scoring purposes, but I'm looking for a way to
> potentially treat look-alike characters as the latin character for bayesian
> purposes and/or to catch commonly obscured words (like GeekSquad).  Is it
> okay if I reply further in my  August 1 post here to keep that in the same
> thread?
>
> On Tue, Sep 6, 2022 at 2:06 PM Thomas Eckardt <
> *thomas.ecka...@thockar.com* <thomas.ecka...@thockar.com>> wrote:
> >HTML::strip
>
> html parsing to get text parts has nothing to do with html de(en)coding
>
>
> >iso-8559-1
> ASSP processes all content as UTF-8
>
>
> >&shy;
> ASSP is aware about this - and replaces soft-hyphens with hard-hyphens -
> and multiple concurrent hard-hyphens with a single one
> How ever - the option to remove the soft-hyphens instead, sounds somehow
> better. Tests are still running.
>
> >My thinking is that if it doesn't display.....
> ASSP does'nt know if something displayed or not (and will never know it)
>
>
> >I suspect that other characters will be abused in the same way
> &nbsp; as well as several BIG5, numerical and other unicode characters are
> already special handled by assp. Other CTL-chars are ignored by assp.
> Everything is converted to UTF8, unicode normalized (including grapheme
> clusters), stemmed and simplyfied.
>
>
> >This kind of obfuscation goes hand in hand with my previous questions
> about considering some non-Latin characters that look like Latin characters
> as those Latin alphabet characters.
>
> With some unicode knowledge, some help from the analyzer and some regex
> knowledge - such things are easy to find
> for example : <<<\P{Cyrillic}\p{Cyrillic}+\P{Cyrillic}>>>
> finds a sequence where cyrillic (a p b ....) are used in words - commonly
> used by spammers
>
> Thomas
>
>
>
> Von:        "K Post" <*nntp.p...@gmail.com* <nntp.p...@gmail.com>>
> An:        "ASSP development mailing list" <
> *assp-test@lists.sourceforge.net* <assp-test@lists.sourceforge.net>>
> Datum:        06.09.2022 16:16
> Betreff:        [Assp-test] soft hyphen fooling Bayesian analysis
> ------------------------------
>
>
>
>
> Is there a way to improve the way that ASSP parses certain special,
> non-printing, characters?  I'm having trouble with spam emails that have
> their body heavily obfuscated with "soft hyphens" slipping through.  They
> all seem to have multipart bodies, first with an iso-8559-1 text part with
> *=AD* interterspersed in words and then an html part with *&shy;* all
> over the place.  These are the "soft hyphen," a hyphen that only prints if
> it is needed to break the word to the next line.  It's clever.  The user
> doesn't see the character, but ASSP thinks it's a word boundary.
>
> The part first part
> Content-Type: text/plain; charset="*iso-8859-1*"
> Content-Transfer-Encoding: quoted-printable
> will be plain text, and have have spammy words with *=AD* inserted in the
> middle of them, for example, "This is a sentence with spammy phrase." could
> be written something like
> This is a sentence with sp=ADammy p=ADhr=ADase.
>
> The next mime part is the html, which does the same thing, but uses &shy;
> (html for soft hyphen) mid-word.  So, something like:
> <p>This is a sentence with sp&shy;ammy p&shy;hr&shy;ase in it</p>
>
> The whole body of the message is filled with these soft hyphens anywhere
> that there's spammy words/phrases, and in many cases, there are soft
> hyphens every couple of letters across the entire body.  When I do an
> analysis, it appears that the soft hyphen tricks ASSP into thinking that
> each part of the word is a separate word, so for sp&shy;ammy
> p&shy;hr&shy;ase, it thinks the words are
> sp ammy p hr ase
>
> I am using HTML::strip.  Would TreeBuilder work better?  I'm concerned
> about performance there.
>
> Is there a way (and is it a good idea) to somehow instruct ASSP to treat
> certain html special characters as ones to ignore, and others to be treated
> as a word separator?  My thinking is that if it doesn't display, then it
> should be ignored when doing bayesian / HMM evaluation.
>
> *https://cs.stanford.edu/people/miles/iso8859.html*
> <https://cs.stanford.edu/people/miles/iso8859.html> has a bunch of
> Control Characters and Special Characters that don't print - or in the case
> of the soft hyphen, only print when the contained word is at the end of a
> line.  I suspect that other characters will be abused in the same way.
>
> This kind of obfuscation goes hand in hand with my previous questions
> about considering some non-Latin characters that look like Latin characters
> as those Latin alphabet characters.
>
> Thanks
>
>
>
>
>
> [Anhang "attz351u.txt" gelöscht von Thomas Eckardt/eck] [Anhang
> "att8gq15.txt" gelöscht von Thomas Eckardt/eck]
>
> _______________________________________________
> Assp-test mailing list
> *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net>
> *https://lists.sourceforge.net/lists/listinfo/assp-test*
> <https://lists.sourceforge.net/lists/listinfo/assp-test>*[Anhang
> "att8rbj5.txt" gelöscht von Thomas Eckardt/eck] [Anhang "atthrsos.txt"
> gelöscht von Thomas Eckardt/eck] *
>
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to