Re: why does that mail not get any bayes-classification

2016-06-10 Thread RW
On Sat, 11 Jun 2016 04:52:48 +0200
Reindl Harald wrote:

> Am 10.06.2016 um 23:52 schrieb RW:
> > On Fri, 10 Jun 2016 16:57:45 +0200
> > Reindl Harald wrote:
> >  
> >> see attachemnt, no bayes tag at all looks like a major bug
> >> somewhere  
> >
> > In the absence of any debug it's hard to say.  
> 
> hence i attached the sample

An email is not debug. I can't run it on *your* system.

> > It is possible for no tokens to make it through the selection, in
> > which case there is no result. That's more likely than normal in
> > your case since you don't train on headers.  
> 
> if you would have looked at the message you would have seen that
> there is content and not only headers and it looks like the message
> has just incorrect mime-definitions (missing end headers)

Of course I looked at it. And I ran it through spamassassin.

Aside from header tokens, what made it past the token selection on my
database was only:

   'marcus','Marcus','enclosed','invoice','business' and 'thank'

It's quite possible that all the body tokens in that email were
in the neutral range on your system, which would cause Bayes to exit
without producing a classification. 

In the absence of any debug against your database, there is nothing
particularly suspicious here.


Re: why does that mail not get any bayes-classification

2016-06-10 Thread David B Funk

On Sat, 11 Jun 2016, Reindl Harald wrote:




Am 10.06.2016 um 23:52 schrieb RW:

On Fri, 10 Jun 2016 16:57:45 +0200
Reindl Harald wrote:


see attachemnt, no bayes tag at all looks like a major bug somewhere


In the absence of any debug it's hard to say.


hence i attached the sample


It is possible for no tokens to make it through the selection, in which
case there is no result. That's more likely than normal in your case
since you don't train on headers.


if you would have looked at the message you would have seen that there is 
content and not only headers and it looks like the message has just incorrect 
mime-definitions (missing end headers)


since thunderbird shows the attachment as well as the mail content that would 
be a way for spammers to completly trick out SA


There may be a bug but I don't it is in the SA distro.

I took your sample and fed it to my SA kit. First time thru it hit BAYES_50, I
then did a "sa-learn --spam < /tmp/ignored_by_bayes_stripped.eml" and retested 
it. It then hit BAYES_999.


So I'd say standard SA + Bayes works on that message. Somebody at your site may
have done some modifications to your SA that is causing you problems.


--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Where to find DETAIL for spamassassin default RULES

2016-06-10 Thread Bill Cole

On 10 Jun 2016, at 3:09, jimimaseye wrote:


REGEXP:  I dont mind having a go at reading them (I have written some
myself) but, as you know, even though some are easy and obvious 
sometimes it
can be like reading music - a blur of blobs, dots and squiggles that 
take a
lot of deciphering.  Of course, many of them rely on 'functionality' 
of the
plugins (which I cant say I would fully understand) and the 
understanding of
a RULE structure (some are easy and obvious, some are very 
convoluted).


(I recently developed this one from scratch:  Its an RFC2822 email 
address

validator:
^(?=.{1,64}@)("[^<>@\\]+"|(?!\.|.*\.(\.|@))[^<>
@\\"]+)@(\[(\d{1,3}\.){3}\d{1,3}\]|\[IPv6:(?:[A-Fa-f\d]{1,4}:){7}[A-Fa-f\d]{1,4}\]|(?=.{1,255}$)((?!-|\.|\d+($|\.))[a-zA-Z\d-]{0,62}[a-zA-Z\d])(|\.(?!-|\.|\d+($|\.))[a-zA-Z\d-]{0,62}[a-zA-Z\d]){1,126})$

Very proud of it too.  )



So, you thought validating email addresses was a problem demanding a 
solution? And you "solved" it with a regular expression?


Congratulations on now having 2 problems. They should be very happy 
together.


Re: why does that mail not get any bayes-classification

2016-06-10 Thread Reindl Harald



Am 10.06.2016 um 23:52 schrieb RW:

On Fri, 10 Jun 2016 16:57:45 +0200
Reindl Harald wrote:


see attachemnt, no bayes tag at all looks like a major bug somewhere


In the absence of any debug it's hard to say.


hence i attached the sample


It is possible for no tokens to make it through the selection, in which
case there is no result. That's more likely than normal in your case
since you don't train on headers.


if you would have looked at the message you would have seen that there 
is content and not only headers and it looks like the message has just 
incorrect mime-definitions (missing end headers)


since thunderbird shows the attachment as well as the mail content that 
would be a way for spammers to completly trick out SA




signature.asc
Description: OpenPGP digital signature


Re: Email with attachment caused 100% CPU usage.

2016-06-10 Thread Bill Cole

On 10 Jun 2016, at 3:40, Merijn van den Kroonenberg wrote:


On 9 Jun 2016, at 0:53, Henrik K wrote:


Garbage text/plain is known problem..


text/html too. From GMail.

Last week I had a *perfectly legitimate* message with a 151KB logical
single line of HTML (QP encoded of course) freeze up a server scaled 
for

10k users.
[snip]


Are there publically available some mails which might cause these kind 
of

problems? It would be interesting for testing set-ups.


Not that I'm aware of. This particular case involved a customer 
organization who I wouldn't ever consider asking to share their mail for 
any reason, as they specialize in a field where the privacy issues would 
be insurmountable.


One could mock up such messages by hand rather easily: find & copy or 
create (maybe Word would be a choice tool for this...) a bloated HTML 
page, replace all the line breaks with spaces in a proper text editor. 
Attach that to an email message or just make it the sole body part of a 
pure text/html message.


It  MAY be that this was a case where someone  wrote something big-ish 
and highly-formatted inside GMail or Google Docs and mailed it so the 
structure is GMail's fault. It could also be that some MUA or document 
generation tool constructed the mail and it came through GMail via SMTP 
submission. I do not know which it is but I would bet on the latter, 
given Google's core expertise in working with HTML. My reflexes in these 
sorts of cases where I have to handle customer non-spam mail are to not 
look too closely at anything non-critical to solving the problem at hand 
and swiftly forget anything unimportant I happen to see (which gets 
easier every year...)


Re: Email with attachment caused 100% CPU usage.

2016-06-10 Thread Bill Cole

On 9 Jun 2016, at 1:40, Olivier wrote:


Mark London  writes:


On 6/8/2016 1:20 PM, John Hardin wrote:

On Wed, 8 Jun 2016, Mark London wrote:
Hi - We received an email with several large postscript 
attachments,

and the content type was "text/plain".   This caused our
spamassassin


Sorry to jump in, but should SA trust the content-type or the file(1)
type, or should try to compare both and do something if they 
missmatch?


No.

The root of this problem isn't a type mismatch. PostScript *IS* (or at 
least *can be*) plain text. If my recollection is correct, PS was 
originally specified to use "Base85" encoding for anything that went 
outside the "ascii85" (33-117) subset of ASCII, which just happened to 
be the digits used in Base85 encoding. Many Unix-ish machines have the 
'atob' and 'btoa' utilities on them to do that encoding and decoding. As 
long as a proper mail-safe transfer encoding is used (Base64 or QP) 
there is no limit on the decoded "line" length in text/plain (Thanks, 
Microsoft!)  of PostScript, so an unanchored and/or imprudently loose 
regular expression can end up doing a LOT of false starts in a big chunk 
of Postscript that claims correctly to be text/plain with (probably) Q-P 
encoding used only to soft-break it into transport-safe lines. The root 
of the problem is using unanchored and/or imprudently loose regular 
expression rules. If it's not an innocent PS today because there was a 
type mismatch, it will be correctly-typed  HTML tomorrow.


A handy heuristic: if a rule does not start with '^' immediately 
followed by something restrictive (even '^.{0,80}' followed by a string 
of literals) or has '.*' anywhere, it's risky.


This is probably most critical with 'rawbody' rules and rules that use 
the multiline option. It is important to understand that "rawbody" isn't 
the body parts of 'full' (pristine RFC2822 format with any CTE) but 
rather the text/* body parts with any Content-Transfer-Encoding 
transformation reversed. Due to the odd way SA deals with line ends 
(there are WONTFIX bugs on it and probably a mention on some wiki 
page...) you basically have to use the multiline option to anchor to 
line-ends, and that can inadvertently land you in trouble.




Re: Email with attachment caused 100% CPU usage.

2016-06-10 Thread Martin Gregorie
On Fri, 2016-06-10 at 18:26 -0400, Bill Cole wrote:
> It will be interesting to see the stats on scantimes this week to see
> if my tightening up on sloppy rules has an impact. I expect it will,
> since I now have a concrete theory to explain that long tail out to 2
> minutes, which before now I've ignored as pure noise.
> 
Thanks for the heads-up. 

It prodded me into scanning my local rule set for unguarded '.*' in
local rules. I found a few - more or less what I expected and virtually
all in my older rules - and limited them all to runs of up to 32
characters before running my spam corpus against them. This showed only
two instances of a rule now failing to fire. Finding the rule and
changing it to allow up to 64 characters to match in the 'don't care'
parts of the regex fixed that. 

FWIW this is a rule that looks for two consecutive URLs in body text. I
knew that some of these can be quite long, but since, in all the spam
I've inspected they were used as a more or less neat final line in the
message, they probably don't exceed 32 chars by much. So yes, while I
know that 64 chars is probably overkill, its not so much overkill that
its worth further fiddling.
   

Martin



Re: Email with attachment caused 100% CPU usage.

2016-06-10 Thread Sidney Markowitz
Merijn van den Kroonenberg wrote on 10/06/16 10:17 PM:
> What does this mean, can still a single operation take more than this
> time_limit?

There is a fundamental difficulty in perl that the built-in timer alarm
facility cannot always interrupt the built-in regular expression matching
facility. That means there is not a good way of protecting against a rule
pattern that can take a very long time to process certain input being given
that input. There is some effort to do that with the time_limit setting and
spawning of individual child processes in spamd that can be individually
killed if one of them gets hung up, but that only goes so far.

More technical discussion of how to better deal with this probably would be
better in the dev mailing list.

There have been a number of bug reports having to do with some "toxic" email
bogging down SpamAssassin which were tracked down to some use of, for example,
.* in a rule pattern. It can be tricky to fix a pattern to keep this from
happening. When you write your own rules for local use you are more likely to
run into this problem because you probably have less experience in writing
patterns to avoid it and don't run your rules through the many test cases of
our mass check system.

 Sidney



Re: Email with attachment caused 100% CPU usage.

2016-06-10 Thread Bill Cole

On 10 Jun 2016, at 6:17, Merijn van den Kroonenberg wrote:

[...]

From the manual:

This is a best-effort advisory setting, processing will not be 
abruptly

aborted at an arbitrary point in processing when the time limit is
exceeded, but only on reaching one of locations in the program flow
equipped with a time test. Currently equipped with the test are the 
main

checking loop, asynchronous DNS lookups, plugins which are calling
external programs. Rule evaluation is guarded by starting a timer 
(alarm)

on each set of compiled rules.


The last line is critical. The alarm isn't even on individual rules but 
rather on precompiled sets of rules. I already knew it was soft but that 
particular detail had not stuck in any of the many times I've read that 
man page. Thanks for quoting it.



What does this mean, can still a single operation take more than this
time_limit? But I guess the timer on the rules means the rules at 
least

cannot take more than time_limit, right?


Nope: time_limit is in seconds and I had it set to 270. The default is 
300 which is 1/2 the canonical SMTP EOD timeout but matches the 
canonical "server timeout" (how long a server should wait for a client 
command. See RFC 5321) and I'm sure there's misunderstanding on that 
distinction. Since under peak loads the plumbing of that particular 
system can add whole seconds and some clients can be a bit impatient, I 
gave it what I thought was very generous additional headroom although it 
really didn't need it. The actual normal spamd scantime distribution 
there roughly 2/3 under 1 second, 90% under 2 seconds, 99% under 10 
seconds, 99.9% under a minute. There's a long thin tail out to ~2 
minutes, but until last week I'd never had anything actually hit the 
timeout that I can recall, and the 1st bad rule was hardly fresh. I 
think it is >8 years old and the others I found had not caused trouble 
in their multiple (3?) years of existence.


I figured out which rule was the proximal cause by running the message 
through the spamassassin script with rule debugging on so I could narrow 
down the bad rule based on what matched right before 
TIME_LIMIT_EXCEEDED. I didn't dtrace the process to nail it down, but my 
hypothesis is that ultimately Perl is calling its low-level internal 
equivalent of regexec() which, like POSIX regexec(), has no timeout 
facility: it runs until it matches or exhausts the input of starting 
points. At the Perl level, I've never encountered any way to limit the 
time '=~' takes to operate, but I'm no Larry Wall so maybe there's some 
arcane way to do that. It seems clear that if Perl has such a feature to 
break out of an operator routine that is taking too much clock time, it 
has not been used in SpamAssassin. I'd bet on there NOT being such a 
feature and further, that the process doing the match might not even die 
immediately with a SIGKILL while inside that call.


What most annoys me about this is that the potential for blowing up 
systems with REs is the first thing one learns about them in a formal 
setting (rather than just by reading man pages above the BUGS section.) 
Back when I first got that warning the emphasis was on an ability of REs 
to compile to disastrous scale but back then that meant a few megabytes, 
and who cares about that today. However, I also got the warning decades 
ago that '.*' could cause a RE to take a long time if you didn't take 
care to limit your input size and write the RE to rule out most starting 
points fast, but again absolute sizes matter and until last week I'd not 
envisioned "optimizing" HTML by removing all formally unnecessary 
whitespace including line breaks. This is obviously somewhat rare, but 
it's apparently A Thing HTML Parsers Like and this was a big hunk of 
HTML, so I guess optimizing parsing was important...


It will be interesting to see the stats on scantimes this week to see if 
my tightening up on sloppy rules has an impact. I expect it will, since 
I now have a concrete theory to explain that long tail out to 2 minutes, 
which before now I've ignored as pure noise.




Re: why does that mail not get any bayes-classification

2016-06-10 Thread RW
On Fri, 10 Jun 2016 16:57:45 +0200
Reindl Harald wrote:

> see attachemnt, no bayes tag at all looks like a major bug somewhere

In the absence of any debug it's hard to say.

It is possible for no tokens to make it through the selection, in which
case there is no result. That's more likely than normal in your case
since you don't train on headers.


Re: SA bayes file db permission issue

2016-06-10 Thread Martin Gregorie
On Fri, 2016-06-10 at 15:38 -0400, Joseph Brennan wrote:

> Look out for big-endian and little-endian, too. That affects
> databases. 
> This bit us once when we copied a berkeley db from solaris to linux. 
> Endian-ness is based on the cpu hardware, but apparently Macs and
> most hardware used for Linux (like Intel) are both little-endian-- so
> it is probably not the answer in this case.
> 
Has to be an implementation difference in that case, e.g UTF-8 vs ASCII
or somebody decided that using an int was wasteful and used a short
instead.

> This is a nice test I found:
> echo -n I | od -to2 | awk '{ print substr($2,6,1); exit}'
> 
> 1 little-endian
> 0 big-endian
> 
Very nice indeed. Thanks for posting it.


Martin



Re: SA bayes file db permission issue

2016-06-10 Thread RW
On Fri, 10 Jun 2016 15:38:44 -0400
Joseph Brennan wrote:

>  wrote:
> 
> > The main database file is binary anyway.  
> 
> 
> Look out for big-endian and little-endian, too. That affects
> databases. This bit us once when we copied a berkeley db from solaris
> to linux. 

That may have changed; they are supposed to be compatible:

http://www.oracle.com/technetwork/database/berkeleydb/db-faq-095848.html

It's just a bit less efficient.

> Endian-ness is based on the cpu hardware, but apparently
> Macs and most hardware used for Linux (like Intel) are both
> little-endian-- so it is probably not the answer in this case.

IIRC older OS X macs used big-endian powerpc processors.


Re: SA bayes file db permission issue

2016-06-10 Thread Joseph Brennan



 wrote:


The main database file is binary anyway.



Look out for big-endian and little-endian, too. That affects databases. 
This bit us once when we copied a berkeley db from solaris to linux. 
Endian-ness is based on the cpu hardware, but apparently Macs and most 
hardware used for Linux (like Intel) are both little-endian-- so it is 
probably not the answer in this case.


This is a nice test I found:
echo -n I | od -to2 | awk '{ print substr($2,6,1); exit}'

1 little-endian
0 big-endian

Joseph Brennan
Columbia U





Re: How SA reactes to a bunch of garbage characters

2016-06-10 Thread Matus UHLAR - fantomas

On 09.06.16 10:43, Olivier wrote:

For years I am having FuzzyOcr pluging running, but it helps little,
because it has it's own list of words to keep updated.

I am wondering if, instead of using that own list of words, the result
was injected back into the body of the main message.


I raised this issue some years ago. The result was that pushing OCR-ed data
bach to SA for evaluating BAYES and other rules could cause troubles,
because freely availabel OCR SW was not very presice.


Most of the time, what will be injected back is plain garbade:
w_T___l_e?_

But other time the result is interesting like a proper English sentence
full of spam.


what exactly do you use for OCR? 10 years ago I made a comparison between
gocr, ocrad and tesseract, where gocr gave best results.

Now, since google sponsors tesseract development, the scaning looks much
much better, and I started thinking about tryint that again.


So how SA will react if I reinject the garbage? Wil lit just ignore it?


would be nice to see trhe results.
I'm mostly afraid about FUZZY_* rules...

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Save the whales. Collect the whole set.


Re: SA bayes file db permission issue

2016-06-10 Thread RW
On Fri, 10 Jun 2016 00:08:01 +0100
Martin Gregorie wrote:

> On Thu, 2016-06-09 at 15:01 -0700, John Hardin wrote:
> > On Thu, 9 Jun 2016, Martin Gregorie wrote:
> >   
> > > On Thu, 2016-06-09 at 16:54 -0400, Yu Qian wrote:  
> > >> Ok, I found out. so the db files generated on Mac can not be
> > >> used  
> > on  
> > >> Linux. vice versa.  
> > >
> > > Newline symbols differ: '/n' is 0x0a (LF) for Linux, 0x0d (CR) for
> > > Macs.   
> > 
> > WTF? I thought Mac's OS was based on Mach, which is an offshoot of
> > Unix?
> >   
> Since MACs used CR from their debut I thought this got carried over
> into OS-X for file compatibility reasons. Seems that I was wrong
> (except for Excel for OS X, which still uses CR for CSV files.

The main database file is binary anyway.


Re: Email with attachment caused 100% CPU usage.

2016-06-10 Thread Merijn van den Kroonenberg
>
>
> Am 10.06.2016 um 04:49 schrieb Bill Cole:
>> On 9 Jun 2016, at 0:53, Henrik K wrote:
>>
>>> Garbage text/plain is known problem..
>>
>> text/html too. From GMail.
>>
>> Last week I had a *perfectly legitimate* message with a 151KB logical
>> single line of HTML (QP encoded of course) freeze up a server scaled for
>> 10k users. It did it slowly over a day, because it took a spamd child
>> ~20 minutes to scan
>
> why in the world do you allow a single spamd child to scan 20 minutes
> for a message and what happens if all your childs have such mails to
> proceed - that's hardly scaled for 10k users on rainy days
>
> time_limit 20
>
> read the manual, it works like shortcircuit meaning all other rules
> already finished (RBL/URIBL in any case) will give their score and so
> you don't open the machine widely while stop easy DOS attacks with
> handcrafted mails

>From the manual:

This is a best-effort advisory setting, processing will not be abruptly
aborted at an arbitrary point in processing when the time limit is
exceeded, but only on reaching one of locations in the program flow
equipped with a time test. Currently equipped with the test are the main
checking loop, asynchronous DNS lookups, plugins which are calling
external programs. Rule evaluation is guarded by starting a timer (alarm)
on each set of compiled rules.

What does this mean, can still a single operation take more than this
time_limit? But I guess the timer on the rules means the rules at least
cannot take more than time_limit, right?

>
> if the server is not a feature-phone when you don't have a result within
> 20 seconds you hardly get one 5 minutes later (besides that in a proper
> setup rejecting based on teh result the client don't wait that long and
> comes again and again)
>
>




Re: Email with attachment caused 100% CPU usage.

2016-06-10 Thread Reindl Harald



Am 10.06.2016 um 04:49 schrieb Bill Cole:

On 9 Jun 2016, at 0:53, Henrik K wrote:


Garbage text/plain is known problem..


text/html too. From GMail.

Last week I had a *perfectly legitimate* message with a 151KB logical
single line of HTML (QP encoded of course) freeze up a server scaled for
10k users. It did it slowly over a day, because it took a spamd child
~20 minutes to scan


why in the world do you allow a single spamd child to scan 20 minutes 
for a message and what happens if all your childs have such mails to 
proceed - that's hardly scaled for 10k users on rainy days


time_limit 20

read the manual, it works like shortcircuit meaning all other rules 
already finished (RBL/URIBL in any case) will give their score and so 
you don't open the machine widely while stop easy DOS attacks with 
handcrafted mails


if the server is not a feature-phone when you don't have a result within 
20 seconds you hardly get one 5 minutes later (besides that in a proper 
setup rejecting based on teh result the client don't wait that long and 
comes again and again)




signature.asc
Description: OpenPGP digital signature


Re: Email with attachment caused 100% CPU usage.

2016-06-10 Thread Merijn van den Kroonenberg
> On 9 Jun 2016, at 0:53, Henrik K wrote:
>
>> Garbage text/plain is known problem..
>
> text/html too. From GMail.
>
> Last week I had a *perfectly legitimate* message with a 151KB logical
> single line of HTML (QP encoded of course) freeze up a server scaled for
> 10k users.
> [snip]

Are there publically available some mails which might cause these kind of
problems? It would be interesting for testing set-ups.




Re: Where to find DETAIL for spamassassin default RULES

2016-06-10 Thread jimimaseye
Thanks for the replies guys

So in essence, there is no user friendly method as there were before.

On 09/06/2016 14:19, Joe Quinn wrote:
> I have a bookmark in Firefox that points to
> http://ruleqa.spamassassin.org/?rule=%s&srcpath=&g=Change which is the
> status page for the nightly rule updates and is likely what you are
> looking for.
>
> I give it a keyword too, so I can type "ruleqa RULENAME" and it will
> replace the "%s" with whatever I type. 

As for looking up and search those nightly listings, its true I can find an
individual rule, but then I cant exactly see how to drill in to it and see
its expression or detail - I can only see a load of links showing how
effective it is in tests (its not really what I was looking for).  Am I
missing something?


REGEXP:  I dont mind having a go at reading them (I have written some
myself) but, as you know, even though some are easy and obvious sometimes it
can be like reading music - a blur of blobs, dots and squiggles that take a
lot of deciphering.  Of course, many of them rely on 'functionality' of the
plugins (which I cant say I would fully understand) and the understanding of
a RULE structure (some are easy and obvious, some are very convoluted).

(I recently developed this one from scratch:  Its an RFC2822 email address
validator:
^(?=.{1,64}@)("[^<>@\\]+"|(?!\.|.*\.(\.|@))[^<>
@\\"]+)@(\[(\d{1,3}\.){3}\d{1,3}\]|\[IPv6:(?:[A-Fa-f\d]{1,4}:){7}[A-Fa-f\d]{1,4}\]|(?=.{1,255}$)((?!-|\.|\d+($|\.))[a-zA-Z\d-]{0,62}[a-zA-Z\d])(|\.(?!-|\.|\d+($|\.))[a-zA-Z\d-]{0,62}[a-zA-Z\d]){1,126})$

Very proud of it too.  )




--
View this message in context: 
http://spamassassin.1065346.n5.nabble.com/Where-to-find-DETAIL-for-spamassassin-default-RULES-tp121218p121250.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.