Re: I am getting all external domain emails subject tagged as SpamSpam

2009-10-01 Thread empiric

more logs

Oct  1 13:22:20 mail amavis[17226]: (17226-02) LMTP RCPT
TO:u...@example.com ORCPT=rfc822;u...@example.com\r\n
Oct  1 13:22:20 mail amavis[17226]: (17226-02) LMTP 250 2.1.5 Recipient
u...@example.com OK
Oct  1 13:22:20 mail amavis[17226]: (17226-02) LMTP::10024
/var/lib/amavis/tmp/amavis-20091001T131825-17226:
mohsinaliz...@hotmail.com - moh...@example.com,u...@example.com
SIZE=1911 Received: from mail.example.com ([127.0.0.1]) by localhost
(mail.example.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP; Thu,  1
Oct 2009 13:22:20 +0600 (PKST)
Oct  1 13:22:20 mail amavis[17226]: (17226-02) Checking: k-6-c3dQQGNL
mohsinaliz...@hotmail.com - moh...@example.com,u...@example.com
Oct  1 13:22:20 mail amavis[17226]: (17226-02) query_keys: u...@example.com,
user@, example.com, .example.com, .com.pk, .pk, .
Oct  1 13:22:20 mail amavis[17226]: (17226-02)
lookup_hash(u...@example.com), no matches
Oct  1 13:22:20 mail amavis[17226]: (17226-02) lookup (bypass_virus_checks)
= undef, u...@example.com does not match
Oct  1 13:22:20 mail amavis[17226]: (17226-02) lookup (bypass_header_checks)
= true,  u...@example.com matches, result=1,
matching_key=(constant:1)
Oct  1 13:22:20 mail amavis[17226]: (17226-02) query_keys: u...@example.com,
user@, example.com, .example.com, .com.pk, .pk, .
Oct  1 13:22:20 mail amavis[17226]: (17226-02)
lookup_hash(u...@example.com), no matches
Oct  1 13:22:20 mail amavis[17226]: (17226-02) lookup (bypass_banned_checks)
= undef, u...@example.com does not match
Oct  1 13:22:20 mail amavis[17226]: (17226-02) lookup (banned_filename), 1
matches for u...@example.com, results: (constant:DEFAULT)=DEFAULT
Oct  1 13:22:20 mail amavis[17226]: (17226-02) collect banned table[0]:
u...@example.com, tables: DEFAULT=Amavis::Lookup::RE=ARRAY(0x8c680e8)
Oct  1 13:22:20 mail amavis[17226]: (17226-02) skip banned check for
u...@example.com, same tables as previous, result =
Oct  1 13:22:20 mail amavis[17226]: (17226-02) p.path u...@example.com:
P=p003,L=1,M=multipart/alternative | P=p001,L=1/1,M=text/plain,T=txt
Oct  1 13:22:20 mail amavis[17226]: (17226-02) skip banned check for
u...@example.com, same tables as previous, result =
Oct  1 13:22:20 mail amavis[17226]: (17226-02) p.path u...@example.com:
P=p003,L=1,M=multipart/alternative | P=p002,L=1/2,M=text/html,T=html
Oct  1 13:22:31 mail amavis[17226]: (17226-02) query_keys: u...@example.com,
user@, example.com, .example.com, .com.pk, .pk, .
Oct  1 13:22:31 mail amavis[17226]: (17226-02)
lookup_hash(u...@example.com), no matches
Oct  1 13:22:31 mail amavis[17226]: (17226-02) lookup (bypass_virus_checks)
= undef, u...@example.com does not match
Oct  1 13:22:31 mail amavis[17226]: (17226-02) lookup (spam_tag2_level) =
true,  u...@example.com matches, result=4.31,
matching_key=(constant:4.31)
Oct  1 13:22:31 mail amavis[17226]: (17226-02) lookup (spam_tag3_level) =
undef, u...@example.com does not match
Oct  1 13:22:31 mail amavis[17226]: (17226-02) lookup (spam_kill_level) =
true,  u...@example.com matches, result=4.31,
matching_key=(constant:4.31)
Oct  1 13:22:31 mail amavis[17226]: (17226-02) lookup (bypass_spam_checks)
= true,  u...@example.com matches, result=1,
matching_key=(constant:1)
Oct  1 13:22:31 mail amavis[17226]: (17226-02) final_destiny PASS, recip
u...@example.com
Oct  1 13:22:31 mail amavis[17226]: (17226-02) lookup (clean_quarantine_to)
= true,  u...@example.com matches, result=clean-quarantine,
matching_key=(constant:clean-quarantine)
Oct  1 13:22:31 mail amavis[17226]: (17226-02) lookup = undef,
u...@example.com, no lookup tables
Oct  1 13:22:31 mail amavis[17226]: (17226-02) query_keys: u...@example.com,
user@, example.com, .example.com, .com.pk, .pk, .
Oct  1 13:22:31 mail amavis[17226]: (17226-02)
lookup_hash(u...@example.com), no matches
Oct  1 13:22:31 mail amavis[17226]: (17226-02) lookup_acl(u...@example.com)
matches key example.com, result=1
Oct  1 13:22:31 mail amavis[17226]: (17226-02) lookup (local_domains) =
true,  u...@example.com matches, result=1, matching_key=example.com
Oct  1 13:22:31 mail amavis[17226]: (17226-02) headers CLUSTERING:
u...@example.com joining cluster
Oct  1 13:22:31 mail amavis[17226]: (17226-02) (about to connect to
[127.0.0.1]:10025) FWD via SMTP: mohsinaliz...@hotmail.com -
moh...@example.com,u...@example.com
Oct  1 13:22:31 mail amavis[17226]: (17226-02) sending RCPT
TO:u...@example.com
Oct  1 13:22:31 mail amavis[17226]: (17226-02) response to RCPT TO for
u...@example.com: 250 2.1.5 Ok
Oct  1 13:22:32 mail amavis[17226]: (17226-02) FWD via SMTP:
mohsinaliz...@hotmail.com - moh...@example.com,u...@example.com, 250
2.6.0 Ok, id=17226-02, from MTA([127.0.0.1]:10025): 250 2.0.0 Ok: queued as
E0EAD19B349
Oct  1 13:22:32 mail amavis[17226]: (17226-02) dsn: from MTA 250 Clean
mohsinaliz...@hotmail.com - u...@example.com: on_succ=0, on_dly=1,
on_fail=1, never=0, warn_sender=, DSN_passed_on=1
Oct  1 13:22:32 mail amavis[17226]: (17226-02) DSN: SUCC from MTA 250 Clean,
no DSN requested: 

Re: SA 3.3.0 and sa-compile

2009-10-01 Thread Zdenek Herman
I have same problem.
Any solution ?

Regards

Zdenek Herman
zdenek.her...@ille.cz
tel: 777 730 218
http://www.cistaposta.cz



to...@starbridge.org napsal(a):
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi,
 i'm running SA 3.3.0 (3.3.0-alpha3-r808953) and i've some problem with
 compiled rules.

 sa-compile runs without errors, and SA seems to works fine when restarted.
 But some body rules are now not detected.

 exemple of simple body rule (for testing):

 body TONIO_SPAM_TEST/toniospam/i
 describe TONIO_SPAM_TESTMentions Generic toniospamtest
 score   TONIO_SPAM_TEST 5

 if i commented out
 loadplugin Mail::SpamAssassin::Plugin::Rule2XSBody
 in v320.pre, body rules is working again.

 I've tested with SA 3.2.5 and it's working fine with Rule2XSBody active.
 I've tried to delete compiled rules and compile again: same result.

 Some info on my environnement:
 debian testing
 perl v5.10.0
 xsubpp version 2.200401 (from debian perl package)
 re2c version 0.13.5-1

 Thanks for your help
 Regards
 Tonio

 NB: sorry for this second post, but i've made a mistake with the
 previous one (replying to  an other thread)

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iEYEARECAAYFAkrDzE4ACgkQ8FtMlUNHQINOIgCeIgXvgz5VafWgZmeb7RhS3vvo
 7ZUAn0+ANE9/uzBbSTcCsn26PGVHlflt
 =sq17
 -END PGP SIGNATURE-


   


Re: .cn Oddity

2009-10-01 Thread John Hardin

On Thu, 1 Oct 2009, Warren Togami wrote:


uri T_CN_URL  /[^\/]+\.cn(?:$|\/|\?)/i
describe T_CN_URL Contains a URL in the .cn domain

uri T_CN_8_URL  /[\/.]+\w{8}\.cn(?:$|\/|\?)/i
describe T_CN_8_URL Contains a URL in the .cn domain of exactly 8 characters 
long


http://ruleqa.spamassassin.org/20090930-r820211-n/T_CN_URL/detail
Last night's masscheck.  63243 out of 124241 spam hits T_CN_URL, nearly 51%.

7263 T_CN_URL hits in 15517 spam corpus
7200 T_CN_8_URL hits in 15517 spam corpus

Does this make any sense?  This is funny.  Could someone add this rule to the 
sandbox?  I'm just curious.


I note that neither is anchored at the beginning of the URI, so they may 
be hitting on .cn embedded somewhere within the path part.


That doesn't explain 51%, though.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Therapeutic Phrenologist - send email for affordable rate schedule.
---
 Approximately 9051420 firearms legally purchased in the U.S. this year


Re: SA 3.3.0 and sa-compile

2009-10-01 Thread John Hardin

On Thu, 1 Oct 2009, Zdenek Herman wrote:


I have same problem.
Any solution ?

to...@starbridge.org napsal(a):


i'm running SA 3.3.0 (3.3.0-alpha3-r808953) and i've some problem with
compiled rules.

sa-compile runs without errors, and SA seems to works fine when 
restarted. But some body rules are now not detected.


A suggestion to both of you, based on sa-compile support requests seen 
earlier on the list: run sa-compile with the debug option turned on, 
publish the debugging output and intermediate files on a webserver 
somewhere, and post the URIs for that info here so they can be examined.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Therapeutic Phrenologist - send email for affordable rate schedule.
---
 Approximately 9051420 firearms legally purchased in the U.S. this year


Re: SA 3.3.0 and sa-compile

2009-10-01 Thread Justin Mason
On Thu, Oct 1, 2009 at 16:15, John Hardin jhar...@impsec.org wrote:
 On Thu, 1 Oct 2009, Zdenek Herman wrote:

 I have same problem.
 Any solution ?

 to...@starbridge.org napsal(a):

 i'm running SA 3.3.0 (3.3.0-alpha3-r808953) and i've some problem with
 compiled rules.

 sa-compile runs without errors, and SA seems to works fine when
 restarted. But some body rules are now not detected.

 A suggestion to both of you, based on sa-compile support requests seen
 earlier on the list: run sa-compile with the debug option turned on, publish
 the debugging output and intermediate files on a webserver somewhere, and
 post the URIs for that info here so they can be examined.

even better: open a Bugzilla entry and do the same.  That's how we
track (possible) bugs and prioritize them.

-- 
--j.


Re: SA 3.3.0 and sa-compile

2009-10-01 Thread John Hardin

On Thu, 1 Oct 2009, Justin Mason wrote:


On Thu, Oct 1, 2009 at 16:15, John Hardin jhar...@impsec.org wrote:

On Thu, 1 Oct 2009, Zdenek Herman wrote:


I have same problem.
Any solution ?

to...@starbridge.org napsal(a):


i'm running SA 3.3.0 (3.3.0-alpha3-r808953) and i've some problem with
compiled rules.

sa-compile runs without errors, and SA seems to works fine when
restarted. But some body rules are now not detected.


A suggestion to both of you, based on sa-compile support requests seen
earlier on the list: run sa-compile with the debug option turned on, publish
the debugging output and intermediate files on a webserver somewhere, and
post the URIs for that info here so they can be examined.


even better: open a Bugzilla entry and do the same.  That's how we
track (possible) bugs and prioritize them.


And the bugzilla entry could have the logs as attachments. I wasn't sure 
if it was appropriate to open a bug yet, but if Justin suggests it then I 
guess it is...


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Therapeutic Phrenologist - send email for affordable rate schedule.
---
 Approximately 9051420 firearms legally purchased in the U.S. this year


Re: .cn Oddity

2009-10-01 Thread Ned Slider

John Hardin wrote:

On Thu, 1 Oct 2009, Warren Togami wrote:


uri T_CN_URL  /[^\/]+\.cn(?:$|\/|\?)/i
describe T_CN_URL Contains a URL in the .cn domain

uri T_CN_8_URL  /[\/.]+\w{8}\.cn(?:$|\/|\?)/i
describe T_CN_8_URL Contains a URL in the .cn domain of exactly 8 
characters long


http://ruleqa.spamassassin.org/20090930-r820211-n/T_CN_URL/detail
Last night's masscheck.  63243 out of 124241 spam hits T_CN_URL, 
nearly 51%.


7263 T_CN_URL hits in 15517 spam corpus
7200 T_CN_8_URL hits in 15517 spam corpus

Does this make any sense?  This is funny.  Could someone add this rule 
to the sandbox?  I'm just curious.


I note that neither is anchored at the beginning of the URI, so they may 
be hitting on .cn embedded somewhere within the path part.


That doesn't explain 51%, though.



I run my own custom .cn tld URI rule, and whilst it's right down in 
percentage terms atm, in the past it has certainly hit on around 50% 
plus of all spam containing a URI. So depending on the corpus, I'm not 
surprised by the 51%.


uri LOCAL_URI_CNm{https?://.{1,40}\.cn\b}
describeLOCAL_URI_CNcontains link to Chinese tld



Re: SA 3.3.0 and sa-compile

2009-10-01 Thread to...@starbridge.org
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Justin Mason a écrit :
 On Thu, Oct 1, 2009 at 16:15, John Hardin jhar...@impsec.org wrote:
 On Thu, 1 Oct 2009, Zdenek Herman wrote:

 I have same problem.
 Any solution ?

 to...@starbridge.org napsal(a):

 i'm running SA 3.3.0 (3.3.0-alpha3-r808953) and i've some problem with
 compiled rules.

 sa-compile runs without errors, and SA seems to works fine when
 restarted. But some body rules are now not detected.
 A suggestion to both of you, based on sa-compile support requests seen
 earlier on the list: run sa-compile with the debug option turned on,
publish
 the debugging output and intermediate files on a webserver somewhere, and
 post the URIs for that info here so they can be examined.

 even better: open a Bugzilla entry and do the same.  That's how we
 track (possible) bugs and prioritize them.

thank for your answers.
It's done:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6214
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkrE1EEACgkQ8FtMlUNHQINOJgCdG7Piu3Phd1Mb2iYl7dmX1pV7
b0UAn1yITwVbWgddDiUlJtdQgCWsb4QL
=mPa4
-END PGP SIGNATURE-



Re: .cn Oddity

2009-10-01 Thread John Hardin

On Thu, 1 Oct 2009, Ned Slider wrote:


John Hardin wrote:

 On Thu, 1 Oct 2009, Warren Togami wrote:

  uri T_CN_URL  /[^\/]+\.cn(?:$|\/|\?)/i
  describe T_CN_URL Contains a URL in the .cn domain
 
  uri T_CN_8_URL  /[\/.]+\w{8}\.cn(?:$|\/|\?)/i
  describe T_CN_8_URL Contains a URL in the .cn domain of exactly 8 
  characters long
 
  http://ruleqa.spamassassin.org/20090930-r820211-n/T_CN_URL/detail
  Last night's masscheck.  63243 out of 124241 spam hits T_CN_URL, nearly 
  51%.
 
  7263 T_CN_URL hits in 15517 spam corpus

  7200 T_CN_8_URL hits in 15517 spam corpus
 
  Does this make any sense?  This is funny.  Could someone add this rule 
  to the sandbox?  I'm just curious.


 I note that neither is anchored at the beginning of the URI, so they may
 be hitting on .cn embedded somewhere within the path part.

 That doesn't explain 51%, though.


I run my own custom .cn tld URI rule, and whilst it's right down in 
percentage terms atm, in the past it has certainly hit on around 50% plus of 
all spam containing a URI. So depending on the corpus, I'm not surprised by 
the 51%.


uri LOCAL_URI_CNm{https?://.{1,40}\.cn\b}
describeLOCAL_URI_CNcontains link to Chinese tld


Yours may still hit .cn in the path part. May I suggest:

  m;^https?://[^/?]+\.cn\b;

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  If healthcare is a Right means that the government is obligated
  to provide the people with hospitals, physicians, treatments and
  medications at low or no cost, then the right to free speech means
  the government is obligated to provide the people with printing
  presses and public address systems, the right to freedom of
  religion means the government is obligated to build churches for the
  people, and the right to keep and bear arms means the government is
  obligated to provide the people with guns, all at low or no cost.
---
 Approximately 9052800 firearms legally purchased in the U.S. this year


Re: .cn Oddity

2009-10-01 Thread Benny Pedersen

On tor 01 okt 2009 18:26:01 CEST, John Hardin wrote

m;^https?://[^/?]+\.cn\b;


replace ; with / no ?

m/\bhttps?://[^/?]+\.cn\b/i

--
xpoint



Re: .cn Oddity

2009-10-01 Thread jdow

From: John Hardin jhar...@impsec.org
Sent: Thursday, 2009/October/01 09:26



On Thu, 1 Oct 2009, Ned Slider wrote:


John Hardin wrote:

 On Thu, 1 Oct 2009, Warren Togami wrote:

  uri T_CN_URL  /[^\/]+\.cn(?:$|\/|\?)/i
  describe T_CN_URL Contains a URL in the .cn domain

  uri T_CN_8_URL  /[\/.]+\w{8}\.cn(?:$|\/|\?)/i
  describe T_CN_8_URL Contains a URL in the .cn domain of exactly 8 
 characters long


  http://ruleqa.spamassassin.org/20090930-r820211-n/T_CN_URL/detail
  Last night's masscheck.  63243 out of 124241 spam hits T_CN_URL, 
 nearly 51%.


  7263 T_CN_URL hits in 15517 spam corpus
  7200 T_CN_8_URL hits in 15517 spam corpus

  Does this make any sense?  This is funny.  Could someone add this 
 rule to the sandbox?  I'm just curious.


 I note that neither is anchored at the beginning of the URI, so they 
may

 be hitting on .cn embedded somewhere within the path part.

 That doesn't explain 51%, though.


I run my own custom .cn tld URI rule, and whilst it's right down in 
percentage terms atm, in the past it has certainly hit on around 50% plus 
of all spam containing a URI. So depending on the corpus, I'm not 
surprised by the 51%.


uri LOCAL_URI_CN m{https?://.{1,40}\.cn\b}
describe LOCAL_URI_CN contains link to Chinese tld


Yours may still hit .cn in the path part. May I suggest:

  m;^https?://[^/?]+\.cn\b;


Regardless of their correctness, would you care to expound on the success
of these two rules, John? I like what works not political correctness. I
think these are two interesting observations. Of course, they won't work
very well for somebody doing business with China or embedded within the
.cn TLD.

{^_-} 



Re: .cn Oddity

2009-10-01 Thread John Hardin

On Thu, 1 Oct 2009, Benny Pedersen wrote:


On tor 01 okt 2009 18:26:01 CEST, John Hardin wrote

m;^https?://[^/?]+\.cn\b;


replace ; with / no ?

m/\bhttps?://[^/?]+\.cn\b/i


No. The point to m; is so that you can embed / in the RE without escaping 
them. You are changing the RE delimiters.


m{...} is fine _if_ you don't use {m,n} syntax, in which case it becomes 
confusing.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  If healthcare is a Right means that the government is obligated
  to provide the people with hospitals, physicians, treatments and
  medications at low or no cost, then the right to free speech means
  the government is obligated to provide the people with printing
  presses and public address systems, the right to freedom of
  religion means the government is obligated to build churches for the
  people, and the right to keep and bear arms means the government is
  obligated to provide the people with guns, all at low or no cost.
---
 Approximately 9052800 firearms legally purchased in the U.S. this year


Re: SA 3.3.0 and sa-compile

2009-10-01 Thread Benny Pedersen

On tor 01 okt 2009 18:09:38 CEST, to...@starbridge.org wrote

thank for your answers.
It's done:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6214


also

spamassassin 21 -D -t msg  output.log
and another time with the plugin disabled shows it work (this time  
with output.log)


add output.log to the ticket

--
xpoint



Re: Understanding the hostKarma Lists

2009-10-01 Thread jdow

From: Marc Perkel m...@perkel.com
Sent: Wednesday, 2009/September/30 16:41





Blaine Fleming wrote: 
Marc Perkel wrote:

 I like it.

RCVD_IN_HOSTKARMA_BL
RCVD_IN_HOSTKARMA_WL
RCVD_IN_HOSTKARMA_YL
RCVD_IN_HOSTKARMA_BR

Let's go with it.
   
Marc, have you updated your wiki to reflect the new rules?  I think that

will pretty well settle any debate or question people have.

--Blaine

 
Yes - the wiki is updated.




I installed it on my personal mail for testing, Marc. I forwarded an
email that failed within minutes of installing it. The bozo was in the
whitelist and hit quite a few rules including a 5.0001 Bayes 99. It
still got through with a 4.9 total because of the bogus whitelist
rule hit and its bogus score. Whitelists aren't is my rule.

{^_^}


Re: Understanding the hostKarma Lists

2009-10-01 Thread Warren Togami

On 10/01/2009 12:42 PM, jdow wrote:

From: Marc Perkel m...@perkel.com
Sent: Wednesday, 2009/September/30 16:41





Blaine Fleming wrote: Marc Perkel wrote:
I like it.

RCVD_IN_HOSTKARMA_BL
RCVD_IN_HOSTKARMA_WL
RCVD_IN_HOSTKARMA_YL
RCVD_IN_HOSTKARMA_BR

Let's go with it.
Marc, have you updated your wiki to reflect the new rules? I think that
will pretty well settle any debate or question people have.

--Blaine


Yes - the wiki is updated.



I installed it on my personal mail for testing, Marc. I forwarded an
email that failed within minutes of installing it. The bozo was in the
whitelist and hit quite a few rules including a 5.0001 Bayes 99. It
still got through with a 4.9 total because of the bogus whitelist
rule hit and its bogus score. Whitelists aren't is my rule.

{^_^}


spamassassin's default scores do not give big negative scores to any of 
the whitelist rules for a good reason.  They are mainly informational.


Warren


Re: .cn Oddity

2009-10-01 Thread John Hardin

On Thu, 1 Oct 2009, jdow wrote:


From: John Hardin jhar...@impsec.org


 Yours may still hit .cn in the path part. May I suggest:

   m;^https?://[^/?]+\.cn\b;


Regardless of their correctness, would you care to expound on the success
of these two rules, John? I like what works not political correctness.
I think these are two interesting observations. Of course, they won't 
work very well for somebody doing business with China or embedded within 
the .cn TLD.


what works is based on the accuracy of the corpora. If the corpora show 
lots of spam with .cn TLD URIs and little or no ham with such, then that 
rule will hit often, and have a good S/O, and get a high score.


I too am surprised that .cn TLDs appear in 51% of the spam corpus but I 
haven't looked into it in any detail. I can certainly check it against my 
own corpora and see if it's reasonable - but then again, I don't do any 
business with anyone in china, and I _do_ get a fair amount of bulk emails 
from manufacturers in china purportedly looking for business partners.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  If healthcare is a Right means that the government is obligated
  to provide the people with hospitals, physicians, treatments and
  medications at low or no cost, then the right to free speech means
  the government is obligated to provide the people with printing
  presses and public address systems, the right to freedom of
  religion means the government is obligated to build churches for the
  people, and the right to keep and bear arms means the government is
  obligated to provide the people with guns, all at low or no cost.
---
 Approximately 9052800 firearms legally purchased in the U.S. this year


Re: .cn Oddity

2009-10-01 Thread Warren Togami

On 10/01/2009 01:05 PM, John Hardin wrote:

On Thu, 1 Oct 2009, jdow wrote:


From: John Hardin jhar...@impsec.org


Yours may still hit .cn in the path part. May I suggest:

m;^https?://[^/?]+\.cn\b;


Regardless of their correctness, would you care to expound on the success
of these two rules, John? I like what works not political correctness.
I think these are two interesting observations. Of course, they won't
work very well for somebody doing business with China or embedded
within the .cn TLD.


what works is based on the accuracy of the corpora. If the corpora
show lots of spam with .cn TLD URIs and little or no ham with such, then
that rule will hit often, and have a good S/O, and get a high score.

I too am surprised that .cn TLDs appear in 51% of the spam corpus but I
haven't looked into it in any detail. I can certainly check it against
my own corpora and see if it's reasonable - but then again, I don't do
any business with anyone in china, and I _do_ get a fair amount of bulk
emails from manufacturers in china purportedly looking for business
partners.



The Oddity I was pointing out at the beginning of the thread is not 
prevalence of .cn URI's, but rather most of them appear to be exactly 8 
characters long.  Could someone please commit my T_CN_8_URL rule to the 
sandbox so we can see if that trend holds beyond my own corpa?


Warren


Re: .cn Oddity

2009-10-01 Thread Warren Togami

On 10/01/2009 01:16 PM, Warren Togami wrote:

On 10/01/2009 01:05 PM, John Hardin wrote:

On Thu, 1 Oct 2009, jdow wrote:


From: John Hardin jhar...@impsec.org


Yours may still hit .cn in the path part. May I suggest:

m;^https?://[^/?]+\.cn\b;


Regardless of their correctness, would you care to expound on the
success
of these two rules, John? I like what works not political correctness.
I think these are two interesting observations. Of course, they won't
work very well for somebody doing business with China or embedded
within the .cn TLD.


what works is based on the accuracy of the corpora. If the corpora
show lots of spam with .cn TLD URIs and little or no ham with such, then
that rule will hit often, and have a good S/O, and get a high score.

I too am surprised that .cn TLDs appear in 51% of the spam corpus but I
haven't looked into it in any detail. I can certainly check it against
my own corpora and see if it's reasonable - but then again, I don't do
any business with anyone in china, and I _do_ get a fair amount of bulk
emails from manufacturers in china purportedly looking for business
partners.



The Oddity I was pointing out at the beginning of the thread is not
prevalence of .cn URI's, but rather most of them appear to be exactly 8
characters long. Could someone please commit my T_CN_8_URL rule to the
sandbox so we can see if that trend holds beyond my own corpa?

Warren


(And yes I'm fully aware even this narrowed rule is prejudiced and 
unsafe.  This is is partly out of curiosity, and also wondering if it 
could be made useful if meta booleaned with something else.)


Warren


Re: Hostkarma: to be or not to be in SA defaults

2009-10-01 Thread Marc Perkel



SM wrote:

Hi Marc,
At 09:32 30-09-2009, Marc Perkel wrote:
I have a lot of mighty servers set up ad have servers at 4 locations. 
I have 50mb bought and using about 30 of it now. I am not sure what 
it takes to support a default SA inclusion. Does anyone know if what 
I described sounds like it is enough?


They can still be a soft target.  Most of the DNSBLs were unprepared 
to deal with denial of service attacks.  Some of them have closed down 
after an attack.  That can be a problem for users as most people have 
a configure and forget setup or it's a default vendor setup.


The bandwidth may be enough for current usage.  The more mirrors you 
have, the better.  If your DNSBL is effective, you might be able to 
get help with that.  The problems with your setup is not worse than 
other resources that are commonly used by users from this mailing list.


Someone pointed out that it's not a good idea to do more DNS lookups 
as it affects the performance of SpamAssassin.  It does not matter 
whether your DNSBL is included in the default configuration as people 
will use it if they believe that it is effective in stopping spam.  If 
you are concerned about marketing, then it may matter to you. :-)


Regards,
-sm



I guess that if HOSTKARMA were included in the default build then I will 
need more mirrors to handle the load.




Re: Understanding the hostKarma Lists

2009-10-01 Thread Marc Perkel




Updated that as well.

R-Elists wrote:

  
  
  marc
  
  dont forget this one
  
  http://wiki.apache.org/spamassassin/MarcPerkelsExperiments
  
  - rh
  
  

 From:
Marc Perkel [mailto:m...@perkel.com]
snip

Yes - the wiki is updated.

  





Re: .cn Oddity

2009-10-01 Thread Ned Slider

Warren Togami wrote:

On 10/01/2009 01:05 PM, John Hardin wrote:

On Thu, 1 Oct 2009, jdow wrote:


From: John Hardin jhar...@impsec.org


Yours may still hit .cn in the path part. May I suggest:

m;^https?://[^/?]+\.cn\b;


Regardless of their correctness, would you care to expound on the 
success

of these two rules, John? I like what works not political correctness.
I think these are two interesting observations. Of course, they won't
work very well for somebody doing business with China or embedded
within the .cn TLD.


what works is based on the accuracy of the corpora. If the corpora
show lots of spam with .cn TLD URIs and little or no ham with such, then
that rule will hit often, and have a good S/O, and get a high score.

I too am surprised that .cn TLDs appear in 51% of the spam corpus but I
haven't looked into it in any detail. I can certainly check it against
my own corpora and see if it's reasonable - but then again, I don't do
any business with anyone in china, and I _do_ get a fair amount of bulk
emails from manufacturers in china purportedly looking for business
partners.



The Oddity I was pointing out at the beginning of the thread is not 
prevalence of .cn URI's, but rather most of them appear to be exactly 8 
characters long.  Could someone please commit my T_CN_8_URL rule to the 
sandbox so we can see if that trend holds beyond my own corpa?


Warren



Warren,

Seems to hold true here to an extent. From my recent confirmed spam 
archive I see:


# cat spam* | grep '\.cn\b' | grep http | wc -l
1088

# cat spam* | grep '\.\w\{8\}\.cn\b' | grep http | wc -l
908

# cat spam* | grep '\/\w\{8\}\.cn\b' | grep http | wc -l
23


so 85% of .cn URIs also match the {8}.cn pattern. Not quite as high as 
your findings, but very high nevertheless.






Re: .cn Oddity

2009-10-01 Thread jdow

From: Warren Togami wtog...@redhat.com
Sent: Thursday, 2009/October/01 10:24



On 10/01/2009 01:16 PM, Warren Togami wrote:

On 10/01/2009 01:05 PM, John Hardin wrote:

On Thu, 1 Oct 2009, jdow wrote:


From: John Hardin jhar...@impsec.org


Yours may still hit .cn in the path part. May I suggest:

m;^https?://[^/?]+\.cn\b;


Regardless of their correctness, would you care to expound on the
success
of these two rules, John? I like what works not political correctness.
I think these are two interesting observations. Of course, they won't
work very well for somebody doing business with China or embedded
within the .cn TLD.


what works is based on the accuracy of the corpora. If the corpora
show lots of spam with .cn TLD URIs and little or no ham with such, then
that rule will hit often, and have a good S/O, and get a high score.

I too am surprised that .cn TLDs appear in 51% of the spam corpus but I
haven't looked into it in any detail. I can certainly check it against
my own corpora and see if it's reasonable - but then again, I don't do
any business with anyone in china, and I _do_ get a fair amount of bulk
emails from manufacturers in china purportedly looking for business
partners.



The Oddity I was pointing out at the beginning of the thread is not
prevalence of .cn URI's, but rather most of them appear to be exactly 8
characters long. Could someone please commit my T_CN_8_URL rule to the
sandbox so we can see if that trend holds beyond my own corpa?

Warren


(And yes I'm fully aware even this narrowed rule is prejudiced and unsafe. 
This is is partly out of curiosity, and also wondering if it could be made 
useful if meta booleaned with something else.)


Warren


I just had a thought, Warren. Look up Chinese numerology. 8 signifies
wealth or sudden prosperity. Conversely, I suspect few Chinese names
are four characters. Four is a pun on death. Some social sites might
like 5 letters - me. 7 is right out, it's a vulgar word in Cantonese.
9 is also slang or vulgar in Cantonese.

I wonder how many companies that deal with China have figured out that
an 888 toll free number is WONDERFUL, Wealth, wealth, wealth.

I understand numerology is quite important to the Chinese. (Of course,
I am not claiming to be an expert. The above is mostly Wikipoodle and
surmise.)

{^_-} 



Re: .cn Oddity

2009-10-01 Thread jdow

From: Ned Slider n...@unixmail.co.uk
Sent: Thursday, 2009/October/01 10:48



Warren Togami wrote:

On 10/01/2009 01:05 PM, John Hardin wrote:

On Thu, 1 Oct 2009, jdow wrote:


From: John Hardin jhar...@impsec.org


Yours may still hit .cn in the path part. May I suggest:

m;^https?://[^/?]+\.cn\b;


Regardless of their correctness, would you care to expound on the 
success

of these two rules, John? I like what works not political correctness.
I think these are two interesting observations. Of course, they won't
work very well for somebody doing business with China or embedded
within the .cn TLD.


what works is based on the accuracy of the corpora. If the corpora
show lots of spam with .cn TLD URIs and little or no ham with such, then
that rule will hit often, and have a good S/O, and get a high score.

I too am surprised that .cn TLDs appear in 51% of the spam corpus but I
haven't looked into it in any detail. I can certainly check it against
my own corpora and see if it's reasonable - but then again, I don't do
any business with anyone in china, and I _do_ get a fair amount of bulk
emails from manufacturers in china purportedly looking for business
partners.



The Oddity I was pointing out at the beginning of the thread is not 
prevalence of .cn URI's, but rather most of them appear to be exactly 8 
characters long.  Could someone please commit my T_CN_8_URL rule to the 
sandbox so we can see if that trend holds beyond my own corpa?


Warren



Warren,

Seems to hold true here to an extent. From my recent confirmed spam 
archive I see:


# cat spam* | grep '\.cn\b' | grep http | wc -l
1088

# cat spam* | grep '\.\w\{8\}\.cn\b' | grep http | wc -l
908

# cat spam* | grep '\/\w\{8\}\.cn\b' | grep http | wc -l
23


so 85% of .cn URIs also match the {8}.cn pattern. Not quite as high as 
your findings, but very high nevertheless.


Based on my last note about Chinese numerology I bet if you have a large
Chinese ham corpus you'd pick up on 8 as a magic number there, too. I am
intrigued enough I'd LOVE to know if that's right.

{^_^} 



Re: .cn Oddity

2009-10-01 Thread John Hardin

On Thu, 1 Oct 2009, Warren Togami wrote:

The Oddity I was pointing out at the beginning of the thread is not 
prevalence of .cn URI's, but rather most of them appear to be exactly 8 
characters long.  Could someone please commit my T_CN_8_URL rule to the 
sandbox so we can see if that trend holds beyond my own corpa?


I've put a .CN 8 URI rule into my sandbox file but it may be a few days 
before it gets committed, my stuff is in flux right now...


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  USMC Rules of Gunfighting #9: Accuracy is relative: most combat
  shooting standards will be more dependent on pucker factor than
  the inherent accuracy of the gun.
---
 Approximately 9055560 firearms legally purchased in the U.S. this year


Re: DNSWL and JMF White false positives, what to do exactly?

2009-10-01 Thread mouss
Karsten Bräckelmann wrote:
 On Wed, 2009-09-30 at 23:35 +0200, mouss wrote:
 Warren Togami wrote:
 I scanned my spam folders and found a few false positives that hit on
 either DNSWL 
 FP with DNSWL?

 FP = False Positive = legitimaite mail tagged as spam
 DNSWL = Whitelist
 
 False positive. Something, that matches (positive) the criterion for a
 certain test, but should not (false).
 
 if your system adds points because of dnswl, you have a serious problem. ..

 or do you mean FN (false negative)?
 
 Granted, the wording (FPs that hit ham rules) could need some polish,
 but I believe Warren was talking about spam that falsely hits ham rules.
 
 


you can certainly devise a system to detect alpha(foo) where alpha is a
function mapping a Banach space to a Hilbert Space, and define what FP,
FN, FX mean in the context you consider. you can also say let PI=69,
... . but conventions are here for a reason. they allow us to
understand each others more easily. the fact that children of today can
solve computation problems that great scientists of the old times
couldn't handle is thanks to conventions (think of a/b * c/d =
(a*c)/(b*d), which looks trivial today, but wasn't before).

when talking about spam or intrusion detection, FN means missing and
FP means false alarm. if we allow defining FN and FP differently, then
we'll need to rewrite a lot of books, reports, articles, ...




Re: DNSWL and JMF White false positives, what to do exactly?

2009-10-01 Thread mouss
RW wrote:
 On Wed, 30 Sep 2009 23:35:31 +0200
 mouss mo...@ml.netoyen.net wrote:
 
 Warren Togami wrote:
 I scanned my spam folders and found a few false positives that hit
 on either DNSWL 
 FP with DNSWL?

 FP = False Positive = legitimaite mail tagged as spam
 DNSWL = Whitelist
 
 The term  false-positive can apply to any test. A test for ham
 that matches a spam is a false-positive, it's a matter of context.

spam too can be (re)defined. and actually any term. but it is assumed
here that we talk about spam detection. so false negative means miss
and false positive means false alarm. this is the common terminology
inherited from intrusion detection.

I used to have a clock that was anti-clockwise. but it was for fun. I
always understood what clockwise meant.


Re: DNSWL and JMF White false positives, what to do exactly?

2009-10-01 Thread Karsten Bräckelmann
On Fri, 2009-10-02 at 00:08 +0200, mouss wrote:
 Karsten Bräckelmann wrote:
  False positive. Something, that matches (positive) the criterion for a
  certain test, but should not (false).

I stand to what I said.

 you can certainly devise a system to detect alpha(foo) where alpha is a
 function mapping a Banach space to a Hilbert Space, and define what FP,
 FN, FX mean in the context you consider. you can also say let PI=69,
 ... . but conventions are here for a reason. they allow us to
 understand each others more easily. the fact that children of today can
 solve computation problems that great scientists of the old times
 couldn't handle is thanks to conventions (think of a/b * c/d =
 (a*c)/(b*d), which looks trivial today, but wasn't before).
 
 when talking about spam or intrusion detection, FN means missing and
 FP means false alarm. if we allow defining FN and FP differently, then
 we'll need to rewrite a lot of books, reports, articles, ...

IFF you are talking about the black box that spam detection is, that is
true.

If you are talking about a rule like FORGED_MUA_OUTLOOK, it appears to
be that simple. However, it is not. You are looking at a single test,
which -- if positive -- either is correct or wrong.

Same for RCVD_IN_DNSWL. If it positively matches, it either it is
correct, or wrong. A false positive is a match, that is wrong. No matter
the score you assign the test.


This concept is NOT specific to spam detection, or even computer
science. As a matter of fact, when I first really grasped the concept, a
medical scientist explained it to me.

Yes, a FP for a rule that identifies *ham* actually evaluated positive
on a spam. It only appears to be spam centric on this list, cause it is
mainly dedicated to identifying spam, not ham.

You might want to ask wikipedia as well. And don't focus on the spam
filtering *example*, which again exclusively talks about a rule
identifying spam. Not ham.


-- 
char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: DNSWL and JMF White false positives, what to do exactly?

2009-10-01 Thread LuKreme
On Oct 1, 2009, at 18:36, Karsten Bräckelmann guent...@rudersport.de  
wrote:



Same for RCVD_IN_DNSWL. If it positively matches, it either it is
correct, or wrong. A false positive is a match, that is wrong. No  
matter

the score you assign the test.


Lke others havecsaid, you can make the words mean whatever you want.  
However, if you want to be understood you need to speak the Lingua  
Franca. If you choose to use a term differently than everyone else you  
WILL be misunderstood and corrected.


Saying everyone else is wrong isn't going to help.
 

Re: DNSWL and JMF White false positives, what to do exactly?

2009-10-01 Thread RW
On Fri, 02 Oct 2009 00:14:52 +0200
mouss mo...@ml.netoyen.net wrote:

 RW wrote:

  The term  false-positive can apply to any test. A test for ham
  that matches a spam is a false-positive, it's a matter of context.
 
 spam too can be (re)defined. and actually any term. but it is assumed
 here that we talk about spam detection. so false negative means miss
 and false positive means false alarm. this is the common terminology
 inherited from intrusion detection.

The term comes from statistics, not intrusion detection. I don't
know much about the latter, perhaps people in that field are a little
sloppy in their usage, more  likely all the tests are expressed as
tests for intrusion, so the same kind of issue doesn't arise.

The source of your confusion is that you are mixing-up the terminology
of the overall classification and individual test results. Think of
this way, in a fingerprint comparison the meanings of TP, TN, FP and FN
are obvious and intrinsic to the test, it would be absurd to switch
them around depending on whether it's evidence for the defence or
prosecution.


Re: Do I need to do anything to maintain MySQL?

2009-10-01 Thread Steven W. Orr
On 09/24/09 09:21, quoth Benny Pedersen:
 On tor 24 sep 2009 04:57:57 CEST, Steven W. Orr wrote
 Since I haven't *ever* touched this table for cleanup, the above
 described cron job will not delete any rows for that period of time.
 
 you will have less problems with innodb then myisam
 
 here is my complete spamassassin sql setup, not showing tables that is
 standard here
 
 CREATE TABLE `awl` (
   `username` varchar(100) NOT NULL default '',
   `email` varchar(200) NOT NULL default '',
   `ip` varchar(10) NOT NULL default '',
   `count` int(11) default '0',
   `totscore` float default '0',
   `lastupdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update
 CURRENT_TIMESTAMP,
   PRIMARY KEY  (`username`,`email`,`ip`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
 
 CREATE TABLE `bayes_seen` (
   `id` int(11) NOT NULL default '0',
   `msgid` varchar(200) character set utf8 collate utf8_bin NOT NULL
 default '',
   `flag` char(1) NOT NULL default '',
   `lastupdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update
 CURRENT_TIMESTAMP,
   PRIMARY KEY  (`id`,`msgid`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
 
 this 2 table will need to be expired from cron
 
 CREATE TABLE `bayes_token` (
   `id` int(11) NOT NULL default '0',
   `token` char(5) NOT NULL default '',
   `spam_count` int(11) NOT NULL default '0',
   `ham_count` int(11) NOT NULL default '0',
   `atime` int(11) NOT NULL default '0',
   PRIMARY KEY  (`id`,`token`),
   KEY `bayes_token_idx1` (`token`),
   KEY `bayes_token_idx2` (`id`,`atime`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
 
 last table will expire in standard way, this setup is
 working in 3.2.5 and its not bugging down my mysql server
 
 if you change your db to lastupdate now() then all data
 will get added as today even thay are not added for real
 today, but the expire will expire okay later
 

I have all my SA tables up and running using InnoDB and using the above table
definitions. I just have one question:

Will the cronjob that was described here earlier

#!/bin/sh
howfar='where lastupdate  date_sub(now(), interval 3 month)'
mysql -h localhost -u sa -pssaa spamassassin EOF
delete from awl $howfar ;
delete from bayes_seen $howfar ;
EOF

also clean up the bayes_token table, or is there another cron job I should use
for that?

And, why is bayes_token.atime int(11) instead of
timestamp NOT NULL default CURRENT_TIMESTAMP on update
?

Is this a part of the design or is it more efficient?

TIA

-- 
Time flies like the wind. Fruit flies like a banana. Stranger things have  .0.
happened but none stranger than this. Does your driver's license say Organ ..0
Donor?Black holes are where God divided by zero. Listen to me! We are all- 000
individuals! What if this weren't a hypothetical question?
steveo at syslang.net



signature.asc
Description: OpenPGP digital signature