Re: How does SA detect non-english language?

Robert Menschel Sat, 27 Aug 2005 10:19:17 -0700

Hello John,

Friday, August 26, 2005, 6:25:14 AM, you wrote:


JH> Hello,

JH> We have had a complaint from a user that some of his Japanese mail
JH> (being received by us) is always marked by SA as spam. As a University
JH> it is natural for us to receive foreign mail messages.

Understood.

JH>   X-Spam-Status: Yes, score=13.7 required=8.0 tests=BAYES_99,HTML_20_30,
JH>         HTML_MESSAGE,MANGLED_LOOK,SARE_HTML_P_MANY3,SARE_RAND_2,
JH> SARE_RECV_IP_218216,SARE_SUB_ENC_ISO2022JP,SARE_SUB_PCT_LETTER,
JH>         SUBJ_ALL_CAPS autolearn=unavailable version=3.0.4

JH> Unfortunately at the time I had left included in our site-wide
JH> configuration some of the specific 'ENG' SARE rules, so that explains
JH> the SARE_SUB_ENC_ISO2022JP matching and bumping the score up a bit. The
JH> SARE_RECV_IP_218216 is also a bit worrying (the message may have passed
JH> through a known spam relay).

If you're using the latest SARE version, SARE_RECV_IP_218216 should be
scoring only 0.964, because we have detected ham coming through that
range of servers (though spam:ham > 100:1). If you can send me some
confirmed ham (full emails, headers and all), I can add those to my
corpus and that will help drive the score down.

MANGLED_LOOK is the larger concern, with a score of 2.3. Like the ENG
rules, the MANGLED rules file should not be used if you expect any
significant non-English ham.  I would remove that file from your
collection.

The 70_sare_obfu*.cf file set is slowly replacing MANGLED, and seems
to be successful in avoiding most language problems.

SARE_RAND_2 also scores 2.5 -- That tests for a specific string
suggesting that a broken ratware configuration inserted something like
%RND into the email. I suppose it's possible, but it seems unlikely
that the Japanese email would match that pattern.  If you can send me
the exact email which does so, maybe I can track that down.

SARE_HTML_P_MANY3 scores only 0.217, so that's not much of a concern.

SARE_SUB_PCT_LETTER with a score of 1.152 is also a significant
contributor, matching a percent sign, followed by a single letter,
then word break. There is no percent sign in the raw subject you
posted, so I assume it's in the code after translation. Seems strange.
Again, a copy of that exact email would help me analyze this.

The biggest concern, as Matt pointed out, is your BAYES_99. If this is
indeed ham, then you need to train these ham, because your Bayes
system believes firmly that these are spam.

Bob Menschel

Re: How does SA detect non-english language?

Reply via email to