Re: Japanese False Postives with Spam Assassin 3.01 and RH WS 3.0

2004-12-01 Thread alan premselaar
Johnson, Robert F wrote:
Hi,
I have been having a high occurrence of Japanese false positives since
upgrading from Spam Assassin 2.64 on RedHat 7.3 with MimeDefang 2.31 to
Spam Assassin 3.01 on RedHat Workstation 3.0 installed site wide via
MimeDefang 2.44.  I am wondering if this is due to the problem with Red
Hat 9.0 Unicode UTF-8.  I had no issues with Japanese false positives in
the RH 7.3 based environment.
I've a few articles regarding this issue, but need some help
understanding correct LANG configurations for Spam Assassin 3.01 on
RedHat Workstation 3.0 installed site wide via MimeDefang 2.44.
I currently have the following set in /etc/sysconfig/ i18n:  ( we are US
based)
LANG=en_US
SUPPORTED=en_US
I compiled Spam Assassin from tar ball with LANG set to en_US (export
LANG=en_US).  Are these settings correct?  Could this be causing the
Japanese false positives?  

Are there any other known issues that can cause Japanese false positives
using Spam Assassin 3.01?
Thanks for any help!
Rob

Rob,
  just a couple obvious questions.  what are your ok_locales and 
ok_languages settings in your sa-mimedefang.cf file set to?

what rules are the japanese emails hitting when they're tagged as false 
positives?

I'm based in Japan, just recently upgraded to SA 3.01 with MD 2.49 and 
using a MySQL based bayes database and I've been noticing some 
quirkiness with Japanese email as well, but haven't really pinned it 
down yet.

alan


RE: Japanese False Postives with Spam Assassin 3.01 and RH WS 3.0

2004-12-01 Thread Johnson, Robert F


-Original Message-
From: alan premselaar [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 30, 2004 5:55 PM
To: Johnson, Robert F
Cc: users@spamassassin.apache.org
Subject: Re: Japanese False Postives with Spam Assassin 3.01 and RH WS
3.0

Johnson, Robert F wrote:
 Hi,

 I have been having a high occurrence of Japanese false positives
since
 upgrading from Spam Assassin 2.64 on RedHat 7.3 with MimeDefang 2.31
to
 Spam Assassin 3.01 on RedHat Workstation 3.0 installed site wide via
 MimeDefang 2.44.  I am wondering if this is due to the problem with
Red
 Hat 9.0 Unicode UTF-8.  I had no issues with Japanese false positives
in
 the RH 7.3 based environment.

 I've a few articles regarding this issue, but need some help
 understanding correct LANG configurations for Spam Assassin 3.01 on
 RedHat Workstation 3.0 installed site wide via MimeDefang 2.44.

 I currently have the following set in /etc/sysconfig/ i18n:  ( we are
US
 based)

 LANG=en_US
 SUPPORTED=en_US

 I compiled Spam Assassin from tar ball with LANG set to en_US (export
 LANG=en_US).  Are these settings correct?  Could this be causing the
 Japanese false positives?

 Are there any other known issues that can cause Japanese false
positives
 using Spam Assassin 3.01?

 Thanks for any help!

 Rob


Rob,

   just a couple obvious questions.  what are your ok_locales and
ok_languages settings in your sa-mimedefang.cf file set to?

what rules are the japanese emails hitting when they're tagged as false
positives?

I'm based in Japan, just recently upgraded to SA 3.01 with MD 2.49 and
using a MySQL based bayes database and I've been noticing some
quirkiness with Japanese email as well, but haven't really pinned it
down yet.

alan

[Johnson, Robert F] 

Thanks for your reply.

I had ok_locales set to all but didn't have ok_languages explicitly set.
I think that is ok since the default value is supposed to be all.

Based on spt checking of a couple of dozen examples, I didn't see any
significant pattern of out of the box rules being involved, mostly SARE
or WIKI rules.  The most heavily implicated were the following:
(MANGLED and SARE_SUB_CASH_CHAR were probably had the biggest impact.

SARE Rules
SARE_SUB_CASH_CHAR
SARE_RAND_2

WIKI Rules
MANGLED_LIST
MANGLED_LIPS
J_CHICKENPOX_12
J_CHICKENPOX_22
HTML_BACKHAIR_4

Out of the Box:
GAPPY_SUBJECT
FREE_SAMPLE
OBSCURED_EMAIL

Rob


Re: Japanese False Postives with Spam Assassin 3.01 and RH WS 3.0

2004-12-01 Thread Daniel Quinlan
Johnson, Robert F [EMAIL PROTECTED] writes:

 Based on spt checking of a couple of dozen examples, I didn't see any
 significant pattern of out of the box rules being involved, mostly SARE
 or WIKI rules.  The most heavily implicated were the following:
 (MANGLED and SARE_SUB_CASH_CHAR were probably had the biggest impact.
 
 SARE Rules
 SARE_SUB_CASH_CHAR
 SARE_RAND_2
 
 WIKI Rules
 MANGLED_LIST
 MANGLED_LIPS
 J_CHICKENPOX_12
 J_CHICKENPOX_22
 HTML_BACKHAIR_4

The last of those is a default rule, but it has almost a zero score.
 
 Out of the Box:
 GAPPY_SUBJECT
 FREE_SAMPLE
 OBSCURED_EMAIL

The problem doesn't sound like it's SpamAssassin despite the subject
line of this email, rather it's third-party rulesets.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/


Re: Japanese False Postives with Spam Assassin 3.01 and RH WS 3.0

2004-12-01 Thread alan premselaar
Daniel Quinlan wrote:
Johnson, Robert F [EMAIL PROTECTED] writes:

Based on spt checking of a couple of dozen examples, I didn't see any
significant pattern of out of the box rules being involved, mostly SARE
or WIKI rules.  The most heavily implicated were the following:
(MANGLED and SARE_SUB_CASH_CHAR were probably had the biggest impact.
SARE Rules
SARE_SUB_CASH_CHAR
SARE_RAND_2
WIKI Rules
MANGLED_LIST
MANGLED_LIPS
J_CHICKENPOX_12
J_CHICKENPOX_22
HTML_BACKHAIR_4

The last of those is a default rule, but it has almost a zero score.
 

Out of the Box:
GAPPY_SUBJECT
FREE_SAMPLE
OBSCURED_EMAIL

The problem doesn't sound like it's SpamAssassin despite the subject
line of this email, rather it's third-party rulesets.
Daniel
I hit GAPPY_SUBJECT and OBSCURED_EMAIL *A LOT* ... i don't have any 3rd 
party rulesets installed.

as a side note, i've been recently trying to update the 
JAPAN_UCE_SUBJECT rule as there's another phrase that's being used 
recently, and for some reason it hasn't been triggering.

I think part of the problem is that I have to enter it in ISO-2022-JP 
charset and it contains at least 2 escape(d) characters so the regex 
might night be accurate. (still working on that)

alan


Re[2]: Japanese False Postives with Spam Assassin 3.01 and RH WS 3.0

2004-12-01 Thread Robert Menschel
Hello Robert,

Tuesday, November 30, 2004, 9:25:52 PM, Daniel wrote:

DQ The problem doesn't sound like it's SpamAssassin despite the subject
DQ line of this email, rather it's third-party rulesets.

I agree.

DQ Johnson, Robert F [EMAIL PROTECTED] writes:

 Based on spt checking of a couple of dozen examples, I didn't see any
 significant pattern of out of the box rules being involved, mostly SARE
 or WIKI rules.  The most heavily implicated were the following:
 (MANGLED and SARE_SUB_CASH_CHAR were probably had the biggest impact.
 
 SARE Rules
 SARE_SUB_CASH_CHAR
 SARE_RAND_2

Can you email a couple of examples to me that hit these rules to me,
preferably in a zip or gz file? I maintain the Subject rules file for
SARE, and would like to refine/rescore SARE_SUB_CASH_CHAR to help
avoid your FPs. I'll also forward the info to the SARE ninja that
maintains our Random rules file.

 WIKI Rules
 MANGLED_LIST
 MANGLED_LIPS
 J_CHICKENPOX_12
 J_CHICKENPOX_22

All of these are language-related rules, which work well in English,
might be subject to an occasional misfire in a non-English Western
European language, and can readily misfire in any
non-Latin/non-Romance language. If you regularly get non-spam in
Japanese, you should probably drop the entire MANGLED and CHICKENPOX
families. If you're using Tripwire, you should drop that also since it
too can misfire on Japanese non-spam.

Bob Menschel





Re[2]: Japanese False Postives with Spam Assassin 3.01 and RH WS 3.0

2004-12-01 Thread Johnson, Robert F
-Original Message-
From: Robert Menschel [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 01, 2004 9:56 AM
To: Johnson, Robert F; users@spamassassin.apache.org
Subject: Re[2]: Japanese False Postives with Spam Assassin 3.01 and RH
WS
3.0

Hello Robert,

Tuesday, November 30, 2004, 9:25:52 PM, Daniel wrote:

DQ The problem doesn't sound like it's SpamAssassin despite the
subject
DQ line of this email, rather it's third-party rulesets.

I agree.

DQ Johnson, Robert F [EMAIL PROTECTED] writes:

 Based on spt checking of a couple of dozen examples, I didn't see
any
 significant pattern of out of the box rules being involved, mostly
SARE
 or WIKI rules.  The most heavily implicated were the following:
 (MANGLED and SARE_SUB_CASH_CHAR were probably had the biggest
impact.

 SARE Rules
 SARE_SUB_CASH_CHAR
 SARE_RAND_2

Can you email a couple of examples to me that hit these rules to me,
preferably in a zip or gz file? I maintain the Subject rules file for
SARE, and would like to refine/rescore SARE_SUB_CASH_CHAR to help
avoid your FPs. I'll also forward the info to the SARE ninja that
maintains our Random rules file.

 WIKI Rules
 MANGLED_LIST
 MANGLED_LIPS
 J_CHICKENPOX_12
 J_CHICKENPOX_22

All of these are language-related rules, which work well in English,
might be subject to an occasional misfire in a non-English Western
European language, and can readily misfire in any
non-Latin/non-Romance language. If you regularly get non-spam in
Japanese, you should probably drop the entire MANGLED and CHICKENPOX
families. If you're using Tripwire, you should drop that also since it
too can misfire on Japanese non-spam.

Bob Menschel



[Johnson, Robert F] Bob,

Thanks for the reply.

I will try to get some example for your analysis.  I may have to attempt
a repro of the issue.  I will let you know soon.

Could the SARE team provide a guideline regarding the best SARE and WIKI
rules sets to work in an environment that supports the following
languages? 

Maybe some sort of a local language compatibility matrix would be useful
to many users.  I would be happy to help put that together in any way I
could.

Japanese, Korean, traditional and simplified Chinese, English, assorted
European.

Regards,

Rob



Re[3]: Japanese False Postives with Spam Assassin 3.01 and RH WS 3.0

2004-12-01 Thread Robert Menschel
Hello Robert,

Wednesday, December 1, 2004, 10:29:12 AM, you wrote:

JRF Could the SARE team provide a guideline regarding the best SARE
JRF and WIKI rules sets to work in an environment that supports the
JRF following languages?

Unfortunately, the SARE team is heavily North American, and as such
has not yet been able to develop good rules for non-European
languages.

The best we've done is to break out a few of our English-specific
rules into *_eng.cf files, so people who need to receive non-English
emails don't get burned by them.

That move isn't 100% perfect, and we may need to add SARE_SUB_CASH_CHR
to that list if we can't fix your problem with that rule.

Bob Menschel