Re: More text/plain questions

2014-07-02 Thread Karsten Bräckelmann
On Wed, 2014-07-02 at 19:10 -0600, Philip Prindeville wrote:
> On Jul 2, 2014, at 5:16 PM, Karsten Bräckelmann  
> wrote:

> > That RE is a single, straight-forward alternation with two alternatives.
> > 
> > The first one translates to a single char in a given, specific range.
> > Basically, anything but the ampersand. The second alternative is an
> > ampersand, that is not followed by #x.
> > 
> > The (?!pattern) is a zero-width negative look-ahead assertion. A zero
> > width means, it does not consume what it matches. Thus, the second
> > alternation ultimately will match a single ampersand only. The /g global
> > matching then continues where it left of after the last matching
> > attempt. In the case of that ampersand followed by #x, that still is
> > right after the ampersand.

> Okay, so what I was trying to do is skip any ampersand followed by
> #x; as part of the matched text (but include ampersands not
> followed by #x; as part of the match).

That is the result of the plain s/&#x[0-9A-F]{4};//g global substitution
I posted.

You should define what you ultimately want to achieve. Not, what you
right now think is a step-stone and part of the solution.


> So that if I had the text:
> 
> This that & those.
> 
> The first @match would be counted as $chars:
> 
> T,h,i,s, ,t,h,a,t, ,&, ,t,h,o,s,.
> 
> and the 2nd @match would be:
> 
> e
> 
> counting as $uchars.
> 
> So in the first case, the e would be skipped over as part of the 
> capture.

Skipped over, since it is part of the capture. That kind of contradicts
itself...

Do you want all of those (HTML entity string) matches? The raw matches
themselves? Or is that just an attempt of debug visualization? Do you
actually want its number only?

This has quite an impact on the Perl code and logic / math involved.


Number of HTML entity escapes, length(char) of reminder:

  my $number = $data =~ s/&#x[0-9A-F]{4};//g;

  print "number:  ", $number, "\n";
  print "other:   ", length $data," = '", $data, "'\n";


Do need the complete HTML entity escapes. Quick hack to compute reminder.

  my @matches = $data =~ /&#x[0-9A-F]{4};/g;

  print "matches: ", scalar @matches, " = ", join(',', @matches), "\n";
  print "other:   ", length ($data) - 8*(scalar @matches), "\n";


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: More text/plain questions

2014-07-02 Thread Philip Prindeville

On Jul 2, 2014, at 5:16 PM, Karsten Bräckelmann  wrote:

> On Wed, 2014-07-02 at 14:44 -0600, Philip Prindeville wrote:
>> Okay, was tinkering with the code below but the zero-width lookahead is
>> not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output
>> is bogus (you can run this and see what I mean).
>> 
>> What am I doing wrong?
> 
> You are using an overly complex and fugly test case. ;)  Seriously, a
> stripped down test string does not require more than about 4 instances
> of plain chars and HTML entities. Much easier on the eye.
> 
> 
>>my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g;
> 
> That RE is a single, straight-forward alternation with two alternatives.
> 
> The first one translates to a single char in a given, specific range.
> Basically, anything but the ampersand. The second alternative is an
> ampersand, that is not followed by #x.
> 
> The (?!pattern) is a zero-width negative look-ahead assertion. A zero
> width means, it does not consume what it matches. Thus, the second
> alternation ultimately will match a single ampersand only. The /g global
> matching then continues where it left of after the last matching
> attempt. In the case of that ampersand followed by #x, that still is
> right after the ampersand.
> 
>  line: Thе R
>  matches: T,h,#,x,0,4,3,5,;, ,R

Okay, so what I was trying to do is skip any ampersand followed by #x; as 
part of the matched text (but include ampersands not followed by #x; as 
part of the match).

So that if I had the text:

This that & those.

The first @match would be counted as $chars:

T,h,i,s, ,t,h,a,t, ,&, ,t,h,o,s,.

and the 2nd @match would be:

e

counting as $uchars.

So in the first case, the e would be skipped over as part of the capture.

What’s the opposite of a zero-width lookahead?  I.e. a match that advances the 
cursor but doesn’t copy the matching text into the capture buffer?


> 
> The offending ampersand part of the HTML entity encoding correctly is
> not matched. The following chars do match the "anything but an
> ampersand" first alternative.
> 
> 
> I am unsure what you are trying to achieve. If you want to compare the
> number of HTML entities with the number of regular chars, wouldn't it be
> easier to simply drop them flat?
> 
>  $data =~ s/&#x[0-9A-F]{4};//g;
> 
> Or just plain match and count?
> 
>  @matches = $data =~ /&#x[0-9A-F]{4};/g;
> 
> 
> -- 
> char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
> 



Re: More text/plain questions

2014-07-02 Thread Karsten Bräckelmann
On Wed, 2014-07-02 at 14:44 -0600, Philip Prindeville wrote:
> Okay, was tinkering with the code below but the zero-width lookahead is
> not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output
> is bogus (you can run this and see what I mean).
> 
> What am I doing wrong?

You are using an overly complex and fugly test case. ;)  Seriously, a
stripped down test string does not require more than about 4 instances
of plain chars and HTML entities. Much easier on the eye.


> my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g;

That RE is a single, straight-forward alternation with two alternatives.

The first one translates to a single char in a given, specific range.
Basically, anything but the ampersand. The second alternative is an
ampersand, that is not followed by #x.

The (?!pattern) is a zero-width negative look-ahead assertion. A zero
width means, it does not consume what it matches. Thus, the second
alternation ultimately will match a single ampersand only. The /g global
matching then continues where it left of after the last matching
attempt. In the case of that ampersand followed by #x, that still is
right after the ampersand.

  line: Thе R
  matches: T,h,#,x,0,4,3,5,;, ,R

The offending ampersand part of the HTML entity encoding correctly is
not matched. The following chars do match the "anything but an
ampersand" first alternative.


I am unsure what you are trying to achieve. If you want to compare the
number of HTML entities with the number of regular chars, wouldn't it be
easier to simply drop them flat?

  $data =~ s/&#x[0-9A-F]{4};//g;

Or just plain match and count?

  @matches = $data =~ /&#x[0-9A-F]{4};/g;


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: getting tons of SPAM

2014-07-02 Thread Karsten Bräckelmann
On Wed, 2014-07-02 at 14:11 -0700, motty cruz wrote:
> bayan filter is not running: according to header,

Yes. As I pointed out to you yesterday.
  http://markmail.org/message/atqa6lv2mgplxlhg

I also mentioned the most likely cause for the BAYES_* rule hits
missing. No reaction on your part, though.

 "There's no BAYES_* rule hit. That means your manual training of ham and
  spam has been done as the wrong user. You need to do the training as the
  same user Amavis / SA runs as."


> # sa-learn --dump magic

What user did you run that command as? What user does Amavis run as?


For completeness, here's the additional advice again from above
referenced post, to NOT use catch-all.

 "Earlier header pastes suggest you are using catch-all. Just, don't.

  Not using catch-all will *significantly* reduce the amount of spam,
  simply by completely eliminating the bulk of spam to otherwise false
  addresses."


If you want the list to keep helping you, you should directly get back
to suggestions and comments, rather than repeating your question.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: getting tons of SPAM

2014-07-02 Thread John Hardin

On Wed, 2 Jul 2014, Jeremy McSpadden wrote:


pastebin .. and do not edit the message, do not remove headers or email 
addresses


Though you *can* mangle your own domain if you want to keep that private.

Please use "example.com" for that.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 2 days until the 238th anniversary of the Declaration of Independence


Re: getting tons of SPAM

2014-07-02 Thread Jeff Mincy
   From: John Hardin 
   Date: Wed, 2 Jul 2014 14:45:07 -0700 (PDT)
   
   On Wed, 2 Jul 2014, motty cruz wrote:
   
   > bayan filter is not running: according to header,
   >
   > X-Virus-Scanned: amavisd-new at fqdn.com
   > X-Spam-Flag: NO
   > X-Spam-Score: -0.009
   > X-Spam-Level:
   > X-Spam-Status: No, score=-0.009 tagged_above=-999 required=5.3
   >tests=[HTML_MESSAGE=0.001, T_RP_MATCHES_RCVD=-0.01]
   >autolearn=unavailable
   > Received: from
   >
   > # sa-learn --dump magic
   > Error Opening file /usr/local/share/GeoIP/GeoIPv6.dat
   > 0.000  0  3  0  non-token data: bayes db version
   > 0.000  0   3338  0  non-token data: nspam
   > 0.000  0784  0  non-token data: nham
   >
   > any ideas?
   
Note the "autolearn=unavailable" part.
The Bayes database is probably locked doing an expire.

Also, the GeoIP data file should be fixed:
 Error Opening file /usr/local/share/GeoIP/GeoIPv6.dat

   You need to post samples (to pastebin). We can't make comments on what 
   *should* be hitting unless we can see the message itself.

Yep.
-jeff


Re: getting tons of SPAM

2014-07-02 Thread Jeremy McSpadden
pastebin .. and do not edit the message, do not remove headers or email 
addresses


--
Jeremy McSpadden
Flux Labs | http://www.fluxlabs.net | Endless 
Solutions
Office : 850-250-5590x501 | Cell : 850-890-2543 | Fax : 850-254-2955


Re: getting tons of SPAM

2014-07-02 Thread motty cruz
looks like gmail won't allow out going email with sample of spam emails,




On Wed, Jul 2, 2014 at 2:45 PM, John Hardin  wrote:

> On Wed, 2 Jul 2014, motty cruz wrote:
>
>  bayan filter is not running: according to header,
>>
>> X-Virus-Scanned: amavisd-new at fqdn.com
>> X-Spam-Flag: NO
>> X-Spam-Score: -0.009
>> X-Spam-Level:
>> X-Spam-Status: No, score=-0.009 tagged_above=-999 required=5.3
>>tests=[HTML_MESSAGE=0.001, T_RP_MATCHES_RCVD=-0.01]
>>autolearn=unavailable
>> Received: from
>>
>> # sa-learn --dump magic
>> Error Opening file /usr/local/share/GeoIP/GeoIPv6.dat
>> 0.000  0  3  0  non-token data: bayes db version
>> 0.000  0   3338  0  non-token data: nspam
>> 0.000  0784  0  non-token data: nham
>>
>> any ideas?
>>
>
> You need to post samples (to pastebin). We can't make comments on what
> *should* be hitting unless we can see the message itself.
>
>
> --
>  John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
>  jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
>  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> ---
>   The third basic rule of firearms safety:
>   Keep your booger hook off the bang switch!
>
> ---
>  2 days until the 238th anniversary of the Declaration of Independence
>


Re: getting tons of SPAM

2014-07-02 Thread John Hardin

On Wed, 2 Jul 2014, motty cruz wrote:


bayan filter is not running: according to header,

X-Virus-Scanned: amavisd-new at fqdn.com
X-Spam-Flag: NO
X-Spam-Score: -0.009
X-Spam-Level:
X-Spam-Status: No, score=-0.009 tagged_above=-999 required=5.3
   tests=[HTML_MESSAGE=0.001, T_RP_MATCHES_RCVD=-0.01]
   autolearn=unavailable
Received: from

# sa-learn --dump magic
Error Opening file /usr/local/share/GeoIP/GeoIPv6.dat
0.000  0  3  0  non-token data: bayes db version
0.000  0   3338  0  non-token data: nspam
0.000  0784  0  non-token data: nham

any ideas?


You need to post samples (to pastebin). We can't make comments on what 
*should* be hitting unless we can see the message itself.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The third basic rule of firearms safety:
  Keep your booger hook off the bang switch!
---
 2 days until the 238th anniversary of the Declaration of Independence


Re: getting tons of SPAM

2014-07-02 Thread motty cruz
bayan filter is not running: according to header,

X-Virus-Scanned: amavisd-new at fqdn.com
X-Spam-Flag: NO
X-Spam-Score: -0.009
X-Spam-Level:
X-Spam-Status: No, score=-0.009 tagged_above=-999 required=5.3
tests=[HTML_MESSAGE=0.001, T_RP_MATCHES_RCVD=-0.01]
autolearn=unavailable
Received: from

# sa-learn --dump magic
Error Opening file /usr/local/share/GeoIP/GeoIPv6.dat
0.000  0  3  0  non-token data: bayes db version
0.000  0   3338  0  non-token data: nspam
0.000  0784  0  non-token data: nham

any ideas?


On Wed, Jul 2, 2014 at 9:05 AM, Steve Bergman  wrote:

>
>  whereis sa-update
>> sa-update: /usr/local/bin/sa-update
>>
>
> Yeah. You're a /usr/*local*/bin guy.
>
> At age 51, I'm I've become a /usr/bin guy. LOL.
>
> :-)
>


Re: More text/plain questions

2014-07-02 Thread Amir Caspi
On Jul 2, 2014, at 12:58 PM, David F. Skoll  wrote:

> I don't think so.  Any MUA that tried to convert "е" to a
> Unicode character in a text/plain part with implicit US-ASCII charset
> and 7bit content transfer encoding is broken.  An MUA should diplay
> exactly "е" in this situation.  It's a different story for
> text/html parts, of course.

For what it's worth, I just received a spam that basically is the same as what 
Philip complained about.  I've posted a spample here:

http://pastebin.com/Y2YGwL49

There _is_ a text/html part, and that's what's displaying in my MUA (Apple 
Mail).

Sadly, as can be seen from the spample, the score doesn't quite reach 5.0 ... 
Bayes training could help since it only scored BAYES_50, but I'm wondering if 
this character encoding is designed to sidestep Bayes -- how does Bayes treat 
these for tokens?  If you randomize the characters being replaced (from 
plaintext to encoded), then there are lots of combinations for any given word, 
which then means each combination is a different token, no?  I don't know if 
spammers are taking the "care" to randomize the letter replacement, but if so, 
does this scheme actually "foil" Bayes due to each permutation being considered 
a different token?  If so, is there a way to mitigate that?

I'm wondering if we shouldn't write a rule looking for lots of �[0-9]{3}; 
patterns... say, 500 of them in one email.  Or, would we expect legitimate 
emails to have these?

Is there also a rule for UTF8-encoded Subject line?  If so, it didn't pop.

--- Amir



Re: More text/plain questions

2014-07-02 Thread John Hardin

On Wed, 2 Jul 2014, Philip Prindeville wrote:


On Jul 2, 2014, at 12:37 PM, John Hardin  wrote:


On Wed, 2 Jul 2014, Philip Prindeville wrote:


Given that it’s text/plain with an implicit charset=“us-ascii” and an implicit 
content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} doesn’t really 
parse into a 16-bit character, would it? That would be a broken MUA that made such 
a leap...


Nope. The content-transfer-encoding is only for the *transfer* part of the 
process. Once the content reaches the MUA that content can be further parsed by 
the MUA according to other encoding rules, such as these escape sequences for 
Unicode characters. That's perfectly valid. How else would you send, for 
example, a c-cedille in spanish text via a 7-bit-clean channel?


This is a trick question, right?

You do that with base64 or quoted-printable, which are the interoperable 
standards.


Apologies, you are right. I was focused on something else this morning and 
dashed off a fast - and wrong - answer.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  WSJ on the Financial Stimulus package: "...today there are 700,000
  fewer jobs than [the administration] predicted we would have if we
  had done nothing at all."
---
 2 days until the 238th anniversary of the Declaration of Independence

Re: More text/plain questions

2014-07-02 Thread Philip Prindeville
Okay, was tinkering with the code below but the zero-width lookahead is not 
disqualifying ampersand followed by #x[0-9A-F]{4}; so the output is bogus (you 
can run this and see what I mean).

What am I doing wrong?



#!/usr/bin/perl -w

use warnings;
use strict;

my $data = <<__EOF__;
Thе Rеаl RеаѕоnThе 
Ꮯоmіng 
Ꮯоllарѕе...Thе 
rеаl rеаѕоn ᎳHY 
HоmеlаndSеcurіtу 
rеcеntlу рurchаѕеd1.7 
Bіllіоn Rоundѕ оf 
аmmunіtіоn...Ꮃhаt Yоu 
Muѕt Dо Tо Ꭼnѕurе 
YоurSаfеtуHоmеlаnd 
ѕеcurіtу іѕ thеrе 
tо ѕеcurеthе hоmеlаnd 
оnlу... Sо thеѕе 
Ьullеtѕаrе rеаlу 
mеаnt fоr thеThіѕ іѕ 
аn 
еmаіlаdvеrtіѕеmеnt
 thаt wаѕ ѕеnt tо уоu 
Ьу Ρаtrіоt Survіvаl 
Ρlаn. If уоuwіѕh tо 
nоlоngеr rеcеіvе 
mеѕѕаgеѕ thаt 
рrоmоtе ѕurvіvаl 
tірѕ, 
рlеаѕеclіck hеrе 
tо unѕuЬѕcrіЬе.4 Unstable as 
water, thou shalt not excel because thou wentest up to thy fathers bed then 
defiledst thou it he went up to my couch.34 And Pharaohnechoh made Eliakim the 
son of Josiah king in the room of Josiah his father, and turned his name to 
Jehoiakim, and took Jehoahaz away and he came to Egypt, and died there.37  And 
the thing was good in the eyes of Pharaoh, and in the eyes o!
f all his servants.
__EOF__

my $chars = 0;
my $uchars = 0;

for (split("\n", $data)) {
print STDERR "line: ", $_, "\n";

my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g;
print STDERR "matches: ", join(',', @matches), " count ", scalar @matches, "\n";
my $chars += scalar @matches;
print STDERR "chars: ", $chars, "\n";

@matches = m/&#x[0-9A-F]{4};/g;
print STDERR "matches: ", join(',', @matches), " count ", scalar @matches, "\n";
my $uchars += scalar @matches;
print STDERR "uchars: ", $uchars, "\n";

print STDERR "\n";

}



Re: More text/plain questions

2014-07-02 Thread Philip Prindeville

On Jul 2, 2014, at 12:37 PM, John Hardin  wrote:

> On Wed, 2 Jul 2014, Philip Prindeville wrote:
> 
>> Given that it’s text/plain with an implicit charset=“us-ascii” and an 
>> implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} 
>> doesn’t really parse into a 16-bit character, would it? That would be a 
>> broken MUA that made such a leap...
> 
> Nope. The content-transfer-encoding is only for the *transfer* part of the 
> process. Once the content reaches the MUA that content can be further parsed 
> by the MUA according to other encoding rules, such as these escape sequences 
> for Unicode characters. That's perfectly valid. How else would you send, for 
> example, a c-cedille in spanish text via a 7-bit-clean channel?

This is a trick question, right?

You do that with base64 or quoted-printable, which are the interoperable 
standards.

You don’t pick some implicit encoding which no one else has agreed upon.


> 
>> Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather 
>> than the unicode16 or UTF-8 character with that hex value?
> 
> I'd only expect that in a very old MUA (i.e. that does not support Unicode), 
> or display of the raw message content at user request.


How is it supposed to guess what the encoding implicitly means?  We have the 
MIME spec so that all of this is formally specified.


> 
>> I wouldn’t want a message where someone gives a couple of examples of 
>> encoding Ѐ for instance being flagged as SPAM, but if the text is 20% 
>> or more of these sequences then I would say that’s SPAM-sign.
> 
> That's valid 7-bit encoding for transfer. It's relying on the user's MUA to 
> convert the encoded Unicode values to glyphs for display.

No, 7-bit CTE means it’s 7-bit content. Period.

If you want 8-bit or 16-bit or 32-bit content over a 7-bit CHANNEL, you use a 
7-bit safe encoding like base64 or quoted-printable.

Citing RFC-2045:

6.  Content-Transfer-Encoding Header Field

   Many media types which could be usefully transported via email are
   represented, in their "natural" format, as 8bit character or binary
   data.  Such data cannot be transmitted over some transfer protocols.
   For example, RFC 821 (SMTP) restricts mail messages to 7bit US-ASCII
   data with lines no longer than 1000 characters including any trailing
   CRLF line separator.

   It is necessary, therefore, to define a standard mechanism for
   encoding such data into a 7bit short line format.  Proper labelling
   of unencoded material in less restrictive formats for direct use over
   less restrictive transports is also desireable.  This document
   specifies that such encodings will be indicated by a new "Content-
   Transfer-Encoding" header field.  This field has not been defined by
   any previous standard.

…

6.2.  Content-Transfer-Encodings Semantics

   …

   The quoted-printable and base64 encodings transform their input from
   an arbitrary domain into material in the "7bit" range, thus making it
   safe to carry over restricted transports.  The specific definition of
   the transformations are given below.

   The proper Content-Transfer-Encoding label must always be used.
   Labelling unencoded data containing 8bit characters as "7bit" is not
   allowed, nor is labelling unencoded non-line-oriented data as
   anything other than "binary" allowed.

   …

   NOTE ON THE RELATIONSHIP BETWEEN CONTENT-TYPE AND CONTENT-TRANSFER-
   ENCODING: It may seem that the Content-Transfer-Encoding could be
   inferred from the characteristics of the media that is to be encoded,
   or, at the very least, that certain Content-Transfer-Encodings could
   be mandated for use with specific media types.  There are several
   reasons why this is not the case. First, given the varying types of
   transports used for mail, some encodings may be appropriate for some
   combinations of media types and transports but not for others.  (For
   example, in an 8bit transport, no encoding would be required for text
   in certain character sets, while such encodings are clearly required
   for 7bit SMTP.)

So you can’t infer the content-type from the content-transfer-encoding or 
vice-versa.

And RFC-2046:

4.1.2.  Charset Parameter

   A critical parameter that may be specified in the Content-Type field
   for "text/plain" data is the character set.  This is specified with a
   "charset" parameter, as in:

 Content-type: text/plain; charset=iso-8859-1

   Unlike some other parameter values, the values of the charset
   parameter are NOT case sensitive.  The default character set, which
   must be assumed in the absence of a charset parameter, is US-ASCII.

so you can’t render Unicode or UTF-8 or ISO-8859-X characters because the 
charset is implicitly US-ASCII and doesn’t have any characters beyond 0111 
binary.

In short, it’s not Unicode unless it EXPLICITLY SAYS UNICODE.

And see also RFC-2152, which I won’t quote here.

Lastly, RFC-3629:

8.  MIME registration


   This memo ser

Re: More text/plain questions

2014-07-02 Thread David F. Skoll
On Wed, 2 Jul 2014 11:37:33 -0700 (PDT)
John Hardin  wrote:

> Nope. The content-transfer-encoding is only for the *transfer* part
> of the process. Once the content reaches the MUA that content can be
> further parsed by the MUA according to other encoding rules, such as
> these escape sequences for Unicode characters.

I don't think so.  Any MUA that tried to convert "е" to a
Unicode character in a text/plain part with implicit US-ASCII charset
and 7bit content transfer encoding is broken.  An MUA should diplay
exactly "е" in this situation.  It's a different story for
text/html parts, of course.

> That's perfectly valid. How else would you send, for example, a
> c-cedille in spanish text via a 7-bit-clean channel?

With the appropriate charset and content-transfer-encoding, such as
ISO-8859-1, quoted-printable, and =E7.

> I would say that's more a case of those characters shouldn't be
> present if the language is en-us than an encoding issue. The presence
> of lots of those is either a sign that the text isn't English, or is
> obfuscated. How do you reliably tell the language of the message?

I would say the presence of ꯍ in a text/plain part is either
a bug in spam-generating software or a researcher trying to send
something to a colleague. :)

Regards,

David.


Pyzor with aliases.

2014-07-02 Thread Steve Bergman
I've been watching today, and have pretty much confirmed that if you use 
Pyzor with spamass-milter, and have it run as the recipient user, you do 
need to include a "pyzor --homedir /whateverdir/" in local.cf. Otherwise 
you will get mysterious, and unlogged crashes, with unfindable 
backtraces, from pyzor, which let the email through to the recipient, 
but also result in an embarrassing bounce message back to the sender.


Many here may already know all this. But I wanted to have it all here, 
succinctly, for new folks facing the same issues that I was earlier this 
week.


-Steve Bergman


Re: More text/plain questions

2014-07-02 Thread John Hardin

On Wed, 2 Jul 2014, John Hardin wrote:


On Wed, 2 Jul 2014, Philip Prindeville wrote:


 Given that it’s text/plain with an implicit charset=“us-ascii” and an
 implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4}
 doesn’t really parse into a 16-bit character, would it? That would be a
 broken MUA that made such a leap...


Nope. The content-transfer-encoding is only for the *transfer* part of the 
process. Once the content reaches the MUA that content can be further parsed 
by the MUA according to other encoding rules, such as these escape sequences 
for Unicode characters. That's perfectly valid. How else would you send, for 
example, a c-cedille in spanish text via a 7-bit-clean channel?



 Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather
 than the unicode16 or UTF-8 character with that hex value?


I'd only expect that in a very old MUA (i.e. that does not support Unicode), 
or display of the raw message content at user request.


...that said, I primarily use a text-based MUA, and it did not render 
Unicode glyphs for that sample...


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Of the twenty-two civilizations that have appeared in history,
  nineteen of them collapsed when they reached the moral state the
  United States is in now.  -- Arnold Toynbee
---
 2 days until the 238th anniversary of the Declaration of Independence

Re: More text/plain questions

2014-07-02 Thread John Hardin

On Wed, 2 Jul 2014, Philip Prindeville wrote:

Given that it’s text/plain with an implicit charset=“us-ascii” and an 
implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} 
doesn’t really parse into a 16-bit character, would it? That would be a 
broken MUA that made such a leap...


Nope. The content-transfer-encoding is only for the *transfer* part of the 
process. Once the content reaches the MUA that content can be further 
parsed by the MUA according to other encoding rules, such as these escape 
sequences for Unicode characters. That's perfectly valid. How else would 
you send, for example, a c-cedille in spanish text via a 7-bit-clean 
channel?


Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. 
rather than the unicode16 or UTF-8 character with that hex value?


I'd only expect that in a very old MUA (i.e. that does not support 
Unicode), or display of the raw message content at user request.


I wouldn’t want a message where someone gives a couple of examples of 
encoding Ѐ for instance being flagged as SPAM, but if the text is 
20% or more of these sequences then I would say that’s SPAM-sign.


That's valid 7-bit encoding for transfer. It's relying on the user's MUA 
to convert the encoded Unicode values to glyphs for display.


I would say that's more a case of those characters shouldn't be present if 
the language is en-us than an encoding issue. The presence of lots of 
those is either a sign that the text isn't English, or is obfuscated. How 
do you reliably tell the language of the message?


It would probably be a good idea to add those sequences to the replacetags 
letter REs so that the FUZZY rules will catch them.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Of the twenty-two civilizations that have appeared in history,
  nineteen of them collapsed when they reached the moral state the
  United States is in now.  -- Arnold Toynbee
---
 2 days until the 238th anniversary of the Declaration of Independence

More text/plain questions

2014-07-02 Thread Philip Prindeville
I got the following MIME body part below, and I’m wondering if it would make 
sense to filter on this as well.

Given that it’s text/plain with an implicit charset=“us-ascii” and an implicit 
content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} doesn’t really 
parse into a 16-bit character, would it? That would be a broken MUA that made 
such a leap...

Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather than 
the unicode16 or UTF-8 character with that hex value?

There might be times when someone has sent an attachment improperly encoded 
this way which might have embedded binary values in it, but that’s kind of 
buggy anyway… it should have been done as base64 and application/octet-stream 
in the worst of cases if it has arbitrary binary data.

I wouldn’t want a message where someone gives a couple of examples of encoding 
Ѐ for instance being flagged as SPAM, but if the text is 20% or more of 
these sequences then I would say that’s SPAM-sign.

Anyway, here’s the body I saw:

--1388-8200-b67c-e579-9c27-df36-12fa-a2eb
Content-Type: text/plain;

Thе Rеаl RеаѕоnThе 
Ꮯоmіng 
Ꮯоllарѕе...Thе 
rеаl rеаѕоn ᎳHY 
HоmеlаndSеcurіtу 
rеcеntlу рurchаѕеd1.7 
Bіllіоn Rоundѕ оf 
аmmunіtіоn...Ꮃhаt Yоu 
Muѕt Dо Tо Ꭼnѕurе 
YоurSаfеtуHоmеlаnd 
ѕеcurіtу іѕ thеrе 
tо ѕеcurеthе hоmеlаnd 
оnlу... Sо thеѕе 
Ьullеtѕаrе rеаlу 
mеаnt fоr thеThіѕ іѕ 
аn 
еmаіlаdvеrtіѕеmеnt
 thаt wаѕ ѕеnt tо уоu 
Ьу Ρаtrіоt Survіvаl 
Ρlаn. If уоuwіѕh tо 
nоlоngеr rеcеіvе 
mеѕѕаgеѕ thаt 
рrоmоtе ѕurvіvаl 
tірѕ, 
рlеаѕеclіck hеrе 
tо unѕuЬѕcrіЬе.4 Unstable as 
water, thou shalt not excel because thou wentest up to thy fathers bed then 
defiledst thou it he went up to my couch.34 And Pharaohnechoh made Eliakim the 
son of Josiah king in the room of Josiah his father, and turned his name to 
Jehoiakim, and took Jehoahaz away and he came to Egypt, and died there.37  And 
the thing was good in the eyes of Pharaoh, and in the eyes o!
f all his servants.

--1388-8200-b67c-e579-9c27-df36-12fa-a2eb

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 11:12 AM, John Hardin wrote:


A week or so back they briefly listed some of the MailControl.com MTAs,
due to apparent exploits. They were quickly removed, though.


So the message here is that some DNSBL's are better than others about 
including and removing addresses quickly and responsibly. Perhaps. I 
take no position on that.


But that does not address the issue of collateral damage to users which 
share an ISP's email server with someone else who happened to get a spam 
through and reported back to the DNSBL.


Not long ago, I had another client blocked from sending response emails 
to their on-line customers about their purchases. Turned out one of the 
users on the hosting provider's system had sent some spam. Now the 
hosting provider (Webfaction) is quite responsible, very diligent, and 
has *fantastic* support. (I can recommend them for dynamic language 
language apps with no reservations.) But guess what? The DNSBL's 
interface for interacting with them was down. For over a week. (We're 
sorry, but... Please come back when... No guaranty as to...) And emails 
to the affected customers were blocked for all that time.


I use DNSBL's. But I don't like them. SA is indispensable. I like it. 
But it's a huge compilation of kluges that happen to mostly work.


Expedient. Pragmatic. Not a real solution to the actual problem.

-Steve



Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 11:10 AM, Jim Popovitch wrote:


Just a heads-up... that sort of biting comment is probably not welcome


I'm familiar with adapting to the relative insularities of various 
lists. But thanks for the head-up, Jim.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread John Hardin

On Wed, 2 Jul 2014, Axb wrote:

If a sender's IP is listed @Spamhaus , he has a serious problem reaching 
many, many destinations. If he's been expoited, you get good evidence and 
fast delisting processsing and I have yet to see a real FP with ZEN.


A week or so back they briefly listed some of the MailControl.com MTAs, 
due to apparent exploits. They were quickly removed, though.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  There is no better measure of the unthinking contempt of the
  environmentalist movement for civilization than their call to
  turn off the lights and sit in the dark.-- Sultan Knish
---
 2 days until the 238th anniversary of the Declaration of Independence


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Jim Popovitch
On Wed, Jul 2, 2014 at 11:54 AM, Steve Bergman  wrote:
>> I suggest you join the SDLU list where you can discuss anti spam
>> philosophy.
>>
>
> Thanks. I suggest that you consult for an ISP-dependent business someday.
> ;-)
>
> It's an education, too.
>
> -Steve


Just a heads-up... that sort of biting comment is probably not welcome
on the SDLU list.

-Jim P.


Re: getting tons of SPAM

2014-07-02 Thread Steve Bergman



whereis sa-update
sa-update: /usr/local/bin/sa-update


Yeah. You're a /usr/*local*/bin guy.

At age 51, I'm I've become a /usr/bin guy. LOL.

:-)


Re: getting tons of SPAM

2014-07-02 Thread Axb

On 07/02/2014 05:32 PM, Steve Bergman wrote:



On 07/02/2014 10:10 AM, Axb wrote:



writing rules for the stuff SA tends to miss seems like a good place to
start off.


Well, there's a full time job, eh? Hope it pays well, because its
tedious, eternal, and thankless.


It pays quite well. Tuning SA isn't trivial and it does take time to 
keep your boss/clients happy.


Comes to a point where you either enjoy the job or outsource it.


Spam is always changing. Seems like it
might be better for a central organization to provide such an evolving
rule set so that everybody doesn't have to go through the exercise in
parallel every morning. We could make the rules available to everyone,
via a tool... I don't know... I suppose we could call it sa-update or
something. And release them regularly.


 Hey.. lets make a sa-update

oh wait..

whereis sa-update
sa-update: /usr/local/bin/sa-update





Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman

I suggest you join the SDLU list where you can discuss anti spam
philosophy.



Thanks. I suggest that you consult for an ISP-dependent business 
someday. ;-)


It's an education, too.

-Steve


Re: getting tons of SPAM

2014-07-02 Thread Steve Bergman

> There used to be a nightly (?) set of rules that were designed just for
> current spam, or does my memory serve me false? The name escapes me but
> it ceased some time back.

Are you thinking of the "sought" rule-set, which was generated and 
updated every 4 hours from SA spamtraps?


It's still marked as "active" on the wiki. But I'm not sure it is really 
still actively updated. And when I was using it, it got few hits that 
other SA rules did not catch. My perception is that its rules have been 
largely integrated into sa-update's stock rules. Not sure.


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 05:39 PM, Steve Bergman wrote:

On 07/02/2014 09:48 AM, Axb wrote:


If an IP is exploited/sends spam and a legitimate msg is rejected then
somebody hasn't done due diligence and I see the reject as legitimated.



The legitimate senders and receivers of the good message, neither of
whom's companies have anything to do with the spam, would not see it
that way. And I agree with their perspective. Some of the perspective
I'm reading here seem really off in the ether. I get the impression that
some are so frustrated with SA's limitations that they are willing to
resort to desperate measures which normal users would instantly
recognize as insane.

No rudeness intended. But some of the things I'm reading here are just
bizarre.


I suggest you join the SDLU list where you can discuss anti spam 
philosophy.


It's a great resource for knowledge.

List Guidelines: http://www.new-spam-l.com/admin/faq.html
List Information: https://spammers.dontlike.us/mailman/listinfo/list

The Mailop list is also a good place to lurk and bathe in hundreds of 
years of mail related experience


http://chilli.nosignal.org/mailman/listinfo/mailop





Re: getting tons of SPAM

2014-07-02 Thread jpff
There used to be a nightly (?) set of rules that were designed just for 
current spam, or does my memory serve me false? The name escapes me but it 
ceased some time back.

==John ff
  who finds spamhaus+clamAV+spamassassin_with_Bayes works well enough


On Wed, 2 Jul 2014, Steve Bergman wrote:




On 07/02/2014 10:10 AM, Axb wrote:



writing rules for the stuff SA tends to miss seems like a good place to
start off.


Well, there's a full time job, eh? Hope it pays well, because its tedious, 
eternal, and thankless. Spam is always changing. Seems like it might be 
better for a central organization to provide such an evolving rule set so 
that everybody doesn't have to go through the exercise in parallel every 
morning. We could make the rules available to everyone, via a tool... I don't 
know... I suppose we could call it sa-update or something. And release them 
regularly.


I don't see individual people writing custom rules for themselves as being 
the answer.


Help us Obi-Wan Thomas Bayes! You're our only hope!

-Steve



Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman

On 07/02/2014 09:48 AM, Axb wrote:


If an IP is exploited/sends spam and a legitimate msg is rejected then
somebody hasn't done due diligence and I see the reject as legitimated.



The legitimate senders and receivers of the good message, neither of 
whom's companies have anything to do with the spam, would not see it 
that way. And I agree with their perspective. Some of the perspective 
I'm reading here seem really off in the ether. I get the impression that 
some are so frustrated with SA's limitations that they are willing to 
resort to desperate measures which normal users would instantly 
recognize as insane.


No rudeness intended. But some of the things I'm reading here are just 
bizarre.


-Steve


Re: getting tons of SPAM

2014-07-02 Thread Steve Bergman



On 07/02/2014 10:10 AM, Axb wrote:



writing rules for the stuff SA tends to miss seems like a good place to
start off.


Well, there's a full time job, eh? Hope it pays well, because its 
tedious, eternal, and thankless. Spam is always changing. Seems like it 
might be better for a central organization to provide such an evolving 
rule set so that everybody doesn't have to go through the exercise in 
parallel every morning. We could make the rules available to everyone, 
via a tool... I don't know... I suppose we could call it sa-update or 
something. And release them regularly.


I don't see individual people writing custom rules for themselves as 
being the answer.


Help us Obi-Wan Thomas Bayes! You're our only hope!

-Steve


Re: getting tons of SPAM

2014-07-02 Thread Matus UHLAR - fantomas

On 02.07.14 07:52, motty cruz wrote:

I am using the following RBLs :

reject_rbl_client b.barracudacentral.org,
reject_rbl_client zen.spamhaus.org,
reject_rbl_client bl.spamcop.net,
reject_rbl_client all.spamrats.com

any other suggestions? spam still flowing:



any suggestions?


paste full headers on pastebin. OR, did you already?


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Windows found: (R)emove, (E)rase, (D)elete


Re: getting tons of SPAM

2014-07-02 Thread Axb

On 07/02/2014 04:52 PM, motty cruz wrote:


 header.from=sentineli...@e.safenet-inc.com header.d=
e.safenet-inc.com

very low score for spammy email.

any suggestions?


writing rules for the stuff SA tends to miss seems like a good place to 
start off.


and if you come asking for help with a specific type, then use pastebin 
for samples.


Only showing a SA report won't activate crystal balls.





Re: getting tons of SPAM

2014-07-02 Thread Steve Bergman



On 07/02/2014 09:52 AM, motty cruz wrote:

I am using the following RBLs :

  reject_rbl_client b.barracudacentral.org
,
  reject_rbl_client zen.spamhaus.org ,
  reject_rbl_client bl.spamcop.net ,
  reject_rbl_client all.spamrats.com 

any other suggestions?


I think your best hope is the Bayesian filter. How is it being trained?

I have my own ideas on this, which I am possibly reconsidering. But you 
can view the recent recommendations I have gotten on this list. My vote 
still goes to having users do ongoing manual training with individual 
bayes_* databases. There are a couple of people making pitches for 
autolearn and system-wide databases. I don't think more DNSBL's are the 
solution. As I mentioned in a previous post, you can't snipe spam with a 
bazooka.


-Steve


Re: getting tons of SPAM

2014-07-02 Thread motty cruz
I am using the following RBLs :

 reject_rbl_client b.barracudacentral.org,
 reject_rbl_client zen.spamhaus.org,
 reject_rbl_client bl.spamcop.net,
 reject_rbl_client all.spamrats.com

any other suggestions? spam still flowing:

X-Virus-Scanned: amavisd-new at fqdn.com
X-Spam-Flag: NO
X-Spam-Score: -0.129
X-Spam-Level:
X-Spam-Status: No, score=-0.129 tagged_above=-999 required=5.3
tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_IMAGE_RATIO_08=0.001,
HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001,
RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01,
SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01]
autolearn=unavailable autolearn_force=no
Authentication-Results: rico.fqdn.com (amavisd-new);
dkim=pass (1024-bit key) header.d=e.safenet-inc.com;
domainkeys=pass (1024-bit key)
header.from=sentineli...@e.safenet-inc.com header.d=
e.safenet-inc.com

very low score for spammy email.

any suggestions?


On Wed, Jul 2, 2014 at 7:49 AM, motty cruz  wrote:

> I am using the following RBLs:
>
>
>
> On Tue, Jul 1, 2014 at 10:08 PM, Steve Bergman 
> wrote:
>
>> On 07/01/2014 11:15 PM, Daniel Staal wrote:
>>
>>  You probably can.  ;)  But I'm sure Windstream didn't get you every
>>> piece of mail immediately after it was sent - just as soon as they could
>>> after they got it.
>>>
>>
>> Yeah. I'm conservatively holding myself to higher standards than is
>> perhaps warranted. But I think that those standards are along the lines of
>> what my long-time customer thought they were getting from Windstream. And
>> it Winstream had too many issues, I think I would have heard about it.
>>
>> And their servers *did* become unavailable for short periods from time to
>> time.
>>
>> But once I'm satisfied that I've reached parity, the real fun starts. We
>> were on POP3. Now we're on our own IMAP. And there is Dovecot full text
>> search in our near future. It will be fun to be able to go beyond and show
>> off a little. My client company's CEO does a lot of full text searching
>> over his email history.
>>
>>
>>   I'm not even saying I like greylisting - I'm just
>>
>>> saying you should work to set user expectations to reality,
>>>
>>
>> When trust died on the Internet, telnet died, but somehow the
>> unbelievably naive email system did not. It was never prepared for spammer
>> abuse. And we're still accommodating to 7 bit systems for crying out loud.
>> If it were material I suppose it would make a fine antique in someone's
>> collection. Right along side the PDP-11.
>>
>>
>>  which is
>>
>>> that email sometimes takes time to get delivered and (rarely) gets
>>> lost.  If something is absolutely time-critical, they should treat email
>>> as a backup,
>>>
>>
>> I think that It's largely a matter of *peoples* expectations and
>> understanding, If a mail gets missed, folks can understand an occasional "I
>> never got your email, we'll send someone over right away".
>>
>> What I object to is the idea of regular and unpredictable delays as
>> introduced by greylisting. And it's just plain ugly from an aesthetic
>> standpoint. But then so are our current email protocols. But I do think
>> that can be fixed.
>>
>> Never did like texting. And that's the alternative.
>>
>> -Steve
>>
>>
>


Re: getting tons of SPAM

2014-07-02 Thread motty cruz
I am using the following RBLs:



On Tue, Jul 1, 2014 at 10:08 PM, Steve Bergman  wrote:

> On 07/01/2014 11:15 PM, Daniel Staal wrote:
>
>  You probably can.  ;)  But I'm sure Windstream didn't get you every
>> piece of mail immediately after it was sent - just as soon as they could
>> after they got it.
>>
>
> Yeah. I'm conservatively holding myself to higher standards than is
> perhaps warranted. But I think that those standards are along the lines of
> what my long-time customer thought they were getting from Windstream. And
> it Winstream had too many issues, I think I would have heard about it.
>
> And their servers *did* become unavailable for short periods from time to
> time.
>
> But once I'm satisfied that I've reached parity, the real fun starts. We
> were on POP3. Now we're on our own IMAP. And there is Dovecot full text
> search in our near future. It will be fun to be able to go beyond and show
> off a little. My client company's CEO does a lot of full text searching
> over his email history.
>
>
>   I'm not even saying I like greylisting - I'm just
>
>> saying you should work to set user expectations to reality,
>>
>
> When trust died on the Internet, telnet died, but somehow the unbelievably
> naive email system did not. It was never prepared for spammer abuse. And
> we're still accommodating to 7 bit systems for crying out loud. If it were
> material I suppose it would make a fine antique in someone's collection.
> Right along side the PDP-11.
>
>
>  which is
>
>> that email sometimes takes time to get delivered and (rarely) gets
>> lost.  If something is absolutely time-critical, they should treat email
>> as a backup,
>>
>
> I think that It's largely a matter of *peoples* expectations and
> understanding, If a mail gets missed, folks can understand an occasional "I
> never got your email, we'll send someone over right away".
>
> What I object to is the idea of regular and unpredictable delays as
> introduced by greylisting. And it's just plain ugly from an aesthetic
> standpoint. But then so are our current email protocols. But I do think
> that can be fixed.
>
> Never did like texting. And that's the alternative.
>
> -Steve
>
>


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 04:40 PM, Steve Bergman wrote:


You are discussing about DNSBLs but not being specific.



I'm specific in that all the DNSBL's blacklist IP addresses or blocks.
And that in today's world many, many companies share sets of mail
servers with many other companies and individuals.


If an IP is exploited/sends spam and a legitimate msg is rejected then 
somebody hasn't done due diligence and I see the reject as legitimated.


If I need to open up, I have options as the DNSWL, etc.








Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman




You are discussing about DNSBLs but not being specific.



I'm specific in that all the DNSBL's blacklist IP addresses or blocks. 
And that in today's world many, many companies share sets of mail 
servers with many other companies and individuals.



I'll let others sell you this Hoover.


No sale necessary. I continue to recognize the overall expediency of the 
DNSBL kluge, and continue to use it myself.


I wouldn't buy a Hoover anyway. I'm a Kirby kind of guy. I have a 1969 
Dual Sanitronic 80 that my grandmother gave our family new, as a 
Christmas gift.


https://c1.staticflickr.com/7/6071/6056367963_f06f08c7f6_z.jpg

A 1976 Classic III that I picked up at a garage sale.

http://cdn3.volusion.com/maxg3.xen6j/v/vspfiles/photos/KirbyClassicIII-4.jpg?1329982229

And a really cool model 516, manufactured in 1956 that someone had set 
out on the curb for garbage pickup, which I rescued and restored.


http://www.1377731.com/kirby/516_5.jpg

All stock photos. Not mine.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 03:54 PM, Steve Bergman wrote:



On 07/02/2014 06:45 AM, Axb wrote:


I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp
level for outright rejects.


At this point, I'm using the defaults, other than upping BAYES_999
enough to enough to total to 5.0 when added to BAYES_99.



If a sender's IP is listed @Spamhaus , he has a serious problem reaching
many, many destinations.


Many, many destinations? Or a high percentage of destinations? I
recently had to explain to the owner of the company why an important
email from one of his business associates at another company was
blocked. I told him that they were on a couple of spam block lists
(which they were) and that contributed to the mail's rejection.

I made the same pitch. "This should affect their outgoing mail to many
sites, etc.". But I'm not sure I believe it. When I interact with people
who've had their emails rejected (often related to DNSBLs) I've been
listening for any mention of other mails of theirs to other companies
being blocked. But when the DNSBL rules in SA are the major contributors
to the rejecting, it seems that we are the only domain they interact
with which is doing so. Entries in the DNSBL databases do great
collateral damage.

And of course none of these companies are spammers. They're with this or
that ISP who has, at one time, had someone exploit their servers to send
spam.

DNSBL's are like a guy with a bazooka trying to play sniper.



You are discussing about DNSBLs but not being specific.

With millions of sessions/day I'm glad Spamhaus keeps my servers from 
melting.


I'll let others sell you this Hoover.







Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 06:45 AM, Axb wrote:


I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp
level for outright rejects.


At this point, I'm using the defaults, other than upping BAYES_999 
enough to enough to total to 5.0 when added to BAYES_99.




If a sender's IP is listed @Spamhaus , he has a serious problem reaching
many, many destinations.


Many, many destinations? Or a high percentage of destinations? I 
recently had to explain to the owner of the company why an important 
email from one of his business associates at another company was 
blocked. I told him that they were on a couple of spam block lists 
(which they were) and that contributed to the mail's rejection.


I made the same pitch. "This should affect their outgoing mail to many 
sites, etc.". But I'm not sure I believe it. When I interact with people 
who've had their emails rejected (often related to DNSBLs) I've been 
listening for any mention of other mails of theirs to other companies 
being blocked. But when the DNSBL rules in SA are the major contributors 
to the rejecting, it seems that we are the only domain they interact 
with which is doing so. Entries in the DNSBL databases do great 
collateral damage.


And of course none of these companies are spammers. They're with this or 
that ISP who has, at one time, had someone exploit their servers to send 
spam.


DNSBL's are like a guy with a bazooka trying to play sniper.

-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 10:47 AM, Steve Bergman wrote:


The DNSBL's are problematic because so many ISP's mail servers are on
them. We get quite a few emails from employees at companies who's ISP's
are on Spamhaus lists, or whatever, due to nothing that has anything to
do with them.


I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp 
level for outright rejects.


If a sender's IP is listed @Spamhaus , he has a serious problem reaching 
many, many destinations. If he's been expoited, you get good evidence 
and fast delisting processsing and I have yet to see a real FP with ZEN.


Consider it being better a sender gets a hard reject than having msgs 
land in some spam folder and remain unseen.


but then...


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 10:47 AM, Steve Bergman wrote:


But for all the discussion today, we never really had a good talk about
postscreen, which is something I'd like to hear someone expound a bit upon.


probably Wrong list ... review Postfix list archives


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 10:47 AM, Steve Bergman wrote:


I'll add you to the list of people telling me that jumping out of an
airplane at 20,000 feet with nothing but a parachute and a pair of
underwear is fun.


Yep... it is...
though you could catch a cold...


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 03:05 AM, Dave Funk wrote:


Unless you've explicitly disabled them, the network based rules (razor,
pyzor, dcc, DNS based rules, RBLs, URIBLs, etc) constitute an external
'reputation' system to pass judgment on messages.


Actually, DCC is not included in the default due to arbitrary 
restrictions on request volume for the public servers. 100,000 per day 
or something. And neither is Pyzor, presumably for similar reasons? 
Razor2 is in by default.


I use all these, but have reservations about them. DCC Pyzor and Razor2 
are lists of bulk email. Not specifically of *unsolicited* bulk email. 
Many of my users are on lists of various sorts.


The DNSBL's are problematic because so many ISP's mail servers are on 
them. We get quite a few emails from employees at companies who's ISP's 
are on Spamhaus lists, or whatever, due to nothing that has anything to 
do with them.




It's not uncommon to take a low-scoring spam and find that it gets a
higher score on retest as it has been added to various bad-boy lists.


Except that the "bad-boy" lists flag more ham then spam.



This is also one way that gray-listing helps.


Review the thread. You don't want to talk to me about greylisting. ;-)

But for all the discussion today, we never really had a good talk about 
postscreen, which is something I'd like to hear someone expound a bit upon.




I've used site-wide Bayes with auto-learning at a site with ~3000 users
and have had to flush & restart our Bayes database twice in 10 years.



I'll add you to the list of people telling me that jumping out of an 
airplane at 20,000 feet with nothing but a parachute and a pair of 
underwear is fun.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 02:39 AM, Dave Funk wrote:


Steve,
For some reason you seem to be hung-up on Bayes "autolearning".


Skip down the thread. I was demonstrated to be wrong. :-)



It it possible that you're confusing it with "Auto-White listing"? (which is now
deprecated and has -nothing- to do with Bayes).


No. I know the difference. AWL, planned to be replaced with TxRep and 
all that. (I'd mention that TxRep has problems, but it's too late at 
night for me to engage in yet another argument.)




SA's Bayesian scorer is a system based upon a method that parses a
message, extracts 'tokens' from it and uses an algorithm to calculate a
score for the message based upon a dictionary of previously seen tokens
and their relative merit.


Yeah. Bayesian statistics is pretty cool.


or via an automated process from within SA as it scores messages
(known as 'auto' learning). So regardless of whether manual or auto
learning is utilized, tokens are added to the dictionary.


See, that's where things stop making sense to me. I would not expect the 
Bayesian filter to do any better than it's training. And if it's 
training is via input from static rules (plus DNSBL's and DCC's) I would 
not expect it to be able to do any better. And it's not hard to imagine 
pathological behavior developing. But people are telling me different. 
And I'm open to considering alternative possibilities.



It's also
possible to employ both auto & manual learning methods in the same
installation.


That would be the scenario I am considering.


There can be one dictionary used for scoring all messages processed (called
"site wide Bayes") or many separate dictionaries, one used for each
recognized user ("per user Bayes"). Either way, the dictionary(s) need to
be updated (and the update process could be either manual, auto, or both).


Yes. I've been devoted to individual fileDB's, each individually trained 
for a particular user's spam^Wemail stream. People are telling me that 
system-wide databases work well.



It's been this way for the past 10+ years AFAIK (well, maybe 10 years
ago it didn't have as many options for back-end database storage, mostly
limited to Berkeley-DB type methods).



I think it was around 2003, in SA 2.5(?) that SA got a Bayesian 
classifier. IIRC, there was a project called dspam (which I think is 
still around) For a while the dspam guys were pushing the fact that 
*dspam* was a modern spam filter, and SA was old, clunky, and too 
outdated to use.


Anyway, in the very early versions of SA Bayes, everything was 
system-wide. Later they added the option to use individual user files. 
And the only info I've seen that described autolearn and how it worked 
was a mailing list post from 2004 which specifically stated that it was 
system-wide, in memory, and was lost upon restart. Maybe that's correct 
and maybe it's not.


But today, it looks to be user-specific, if configured that way. I'm 
still working out whether I want to use it, and if so, how.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Dave Funk

On Wed, 2 Jul 2014, Steve Bergman wrote:

Well... I just turned on autolearn for a moment, deleted the bayes_* files on 
the test account I use, and sent myself a message from my usual outside 
account. And new bayes_* files were created. So I was wrong, and I win. More 
options.


So now I can proceed to the "what does this mean?" phase.

If I leave things as they are, then training is perfect if the users are 
diligent. But if they are not, then... what? I see plenty of spams getting 
through with a 0.0 score. IIRC, the autolearn spam threshold is 7? Pretty 
much everything there is spam.


But I'm not sure I quite buy having the static rules of SA training Bayes. 
Isn't Bayes just learning to emulate the static rules, with all their 
imperfections?


Unless you've explicitly disabled them, the network based rules (razor,
pyzor, dcc, DNS based rules, RBLs, URIBLs, etc) constitute an external
'reputation' system to pass judgment on messages.
It's not uncommon to take a low-scoring spam and find that it gets a
higher score on retest as it has been added to various bad-boy lists.

This is also one way that gray-listing helps. If you stiff-arm the first
pass of a spam run a later check may hit it more accurately as it's been
added to block-lists in the mean-time.


If it starts going wrong, doesn't that mean the errors are going to spiral 
out of control?


That is a possible risk of relying solely on auto-learning.
The autolearn system has been carefully crafted and tuned over the years
to try to prevent a feed-back loop from throwing it into a tail-spin.
For example the internal scoring system used to determine if a message
is spam or ham WRT the choice for auto-learning explicitly excludes
the Bayes score (and other particular kinds of scores such as white/black
lists) to try to prevent tail-eating.
Occasional judicious manual learning can help to 'tweak' things when Bayes
looks like it's not in top shape. (IE manual learning of FPs & FNs).

I've used site-wide Bayes with auto-learning at a site with ~3000 users
and have had to flush & restart our Bayes database twice in 10 years.

Dave

--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 02:14 AM, Axb wrote:


YOu don't need to trust me or believe me (I'm not selling anything -
just commenting on what works for me)


Well, I know you know what I meant.


Ever thought of running a newer distro in a VM, only for SA and let
spamass-milter use that?
That would mean you can play with SA 3.4 without having to redo all your
mail infra?



I'm pushing to do our ubuntu 14.04 upgrade soon to get the dovecot full 
text search. And then a memory upgrade. And these days I just max them 
out on memory. 4GB -> 32GB. Plus adding a 4TB RAID1.


So it ought to be able to handle almost anything. And I've just 
confirmed that SA 3.4 made it into 14.04.


That should, at least, avert all those annoying "time to upgrade" 
responses like I got here earlier.


It's very late here. 2:45AM, I see. But it's been a lot of fun arguing 
with you guys today. And thanks for all the help. Pyzor seems to be 
functioning fine now.


General rules of thumb to keep in mind:

Whenever there are inexplicable problems, it's probably selinux causing 
them. And if not that, regular old POSIX permissions.


And if ever there is an article of clothing you need but can't find 
anywhere in the house, there's usually a dog sleeping on it. Or possibly 
a cat.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Dave Funk

On Wed, 2 Jul 2014, Steve Bergman wrote:


On 07/01/2014 11:49 PM, Karsten Bräckelmann wrote:


Those do not tell you about using file or SQL based databases?


They do. But not specifically with respect to autolearn.

You never

thought about googling for "spamassassin per user" and friends? You
never checked the SA wiki?


I have, indeed. No reference to autolearn and persistent storage. The lack of 
mention is notable.


I'd expect people to be lining up to tell me I'm mistaken if I absolutely 
were.


Can you point me to a change log somewhere documenting autolearn moving from 
in-memory and system-wide to per user and persistent?


I don't hold a strong opinion on this. It would be nice if I were wrong. It 
would open more options.


I'm just waiting for evidence that it's the case. My perception is that It's 
not.


-Steve


Steve,
For some reason you seem to be hung-up on Bayes "autolearning". It it
possible that you're confusing it with "Auto-White listing"? (which is now
deprecated and has -nothing- to do with Bayes).

SA's Bayesian scorer is a system based upon a method that parses a
message, extracts 'tokens' from it and uses an algorithm to calculate a
score for the message based upon a dictionary of previously seen tokens
and their relative merit.

The dictionary is created and updated by a process called 'learning'
wherein already-classified messages are tokenized and their tokens are
stored in the dictionary along with a merit value derived from their
instance count and a factor taken from being classified as spam or ham.
This learning process can be either externally driven (known as 'manual'
learning) or via an automated process from within SA as it scores messages
(known as 'auto' learning). So regardless of whether manual or auto
learning is utilized, tokens are added to the dictionary. It's also
possible to employ both auto & manual learning methods in the same
installation.

There can be one dictionary used for scoring all messages processed (called
"site wide Bayes") or many separate dictionaries, one used for each
recognized user ("per user Bayes"). Either way, the dictionary(s) need to
be updated (and the update process could be either manual, auto, or both).

The Bayes dictionary(s) need to be stored some how, the usual method is
via some kind of database. It could be a simple file based DB, some kind
of fancy SQL server based system or something else. This is a DBA'ish kind
of choice as to what particular technology is used to store the
dictionary DB. (usually on disk in some way, could be in some kind of
memory resident set of tables, or something else???).

So you have a multi-dimensional matrix WRT your Bayes system
configuration, and manual VS auto learning is just one factor.

It's been this way for the past 10+ years AFAIK (well, maybe 10 years
ago it didn't have as many options for back-end database storage, mostly
limited to Berkeley-DB type methods).

I hope this helps you.


--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 02:02 AM, Axb wrote:


and don't count on that - they may do it the first week, new toy,
but for how long?


Not new. They'd previously been training SA with Evolution for some 
years. I have some confidence in many of them doing it right.




Also: take in mind each user's Bayes folder also get a a bayes_seen file
which grows and grows and grows and never gets truncated.


Well, I have the maximum bayes toks set at 2,000,000. Is bayes_seen 
likely to become a problem with ~100 users and 4TB of disk space?


My largest email volume user has accumulated only 320k of "seen" in 10 
days. And I assume that repeat spams don't add to it.




Do you really want to spend time watching each user's Bayes?


Not really. But I'll do whatever is necessary.

-Steve



Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 09:01 AM, Steve Bergman wrote:

Axb,

I'm not sure I quite believe it. And I'm not quite sure I trust you. But
you do make an attractive pitch. Excellent spam filtering, system-wide,
with no responsibility for training on the part of the users?


YOu don't need to trust me or believe me (I'm not selling anything - 
just commenting on what works for me)


You can try it and after a couple of weeks, see if it works for you and 
then if necessary come up with new methods for extra training or dump 
the concept totally.


Bayes is yet another scoring mechanism in SA. If you have enough 
traffic, you can wipe the data any time and it's not like you're 
switching SA off totally.


During the dev/test process of the Redis backend, as stuff changed on a 
daily basis I was forced to purge the Bayes data several times/week.

It even became a running joke (wave Henrik/Marc).


This sounds like the kind of "too good to be true" message that I'd
expect to receive in a spam mail.


:-)



But hmm. This is good dream material for tonight. I wonder if our Ubuntu
14.04 upgrade has SA 3.4 with redis built in. I do hear that the redis
backend is amazing.


Ever thought of running a newer distro in a VM, only for SA and let 
spamass-milter use that?
That would mean you can play with SA 3.4 without having to redo all your 
mail infra?




Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 08:48 AM, Steve Bergman wrote:

Someone, please convince me that I should turn it on.


autolearn doesn't mean you cannot also train manually...


Should I turn it on and take my "train as ham" entry out of .forward? Or
should I not?


manually training ham from unreviewed data?
bad idea.


I suppose that largely depends upon my individual users' levels of
diligence.


and don't count on that - they may do it the first week, new toy, 
but for how long?


Also: take in mind each user's Bayes folder also get a a bayes_seen file 
which grows and grows and grows and never gets truncated.


Do you really want to spend time watching each user's Bayes?





Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman

Axb,

I'm not sure I quite believe it. And I'm not quite sure I trust you. But 
you do make an attractive pitch. Excellent spam filtering, system-wide, 
with no responsibility for training on the part of the users?


This sounds like the kind of "too good to be true" message that I'd 
expect to receive in a spam mail.


But hmm. This is good dream material for tonight. I wonder if our Ubuntu 
14.04 upgrade has SA 3.4 with redis built in. I do hear that the redis 
backend is amazing.


-Steve