Re: More text/plain questions

2014-08-05 Thread Quanah Gibson-Mount
--On Wednesday, July 23, 2014 9:39 PM +0100 Martin Gregorie 
 wrote:



On Wed, 2014-07-23 at 11:45 -0600, Amir 'CG' Caspi wrote:


I'm definitely considering writing a rule to catch �[0-9]{3};
patterns.  I'm definitely worried it could cause FPs, but are there
common circumstances where legitimate emails would include dozens to
hundreds of these?  (The latest FNs only include a few dozen, not the
hundreds seen in the spample above.)


This works for me:

describe MG_HEX_HTML  Body contains too many HTML hex encodings
body MG_HEX_HTML  /(.{0,3}\&\#x[0-9A-F]{4};){5}/
scoreMG_HEX_HTML  3.5

It is also used in a meta, along with some other simple local rules, to
give hex-bearing spam an extra kick up the rear. I found that, in my
mailstream anyway, there was generally not much else to write rules
against, hence the high score. Spam arriving here gets quarantined: I
look at the sender and subject as a matter of course and, if it looks
like a possible FP, I'll look at the text too (I wrote a PHP viewer for
quarantined spam a long time ago) but it appears that, after the brief
squall of hex spam which made me write the rule, the promised spamstorm
ended and so far has failed to restart.


I've seen this rule hit several times for me today, all on definite spam.

--Quanah



--

Quanah Gibson-Mount
Server Architect
Zimbra, Inc.

Zimbra ::  the leader in open source messaging and collaboration


Re: More text/plain questions

2014-07-25 Thread Kevin A. McGrail

On 7/25/2014 6:19 PM, Amir Caspi wrote:

On Jul 25, 2014, at 4:11 PM, Kevin A. McGrail  wrote:


You should look at the patch on bug 7068 
(https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068)

Yes, but this is within the code itself.  I was referring to how to do this in 
a local.cf, for example...

 Amir

It requires a plugin, sorry.  Don't think you could do it without it.


Re: More text/plain questions

2014-07-25 Thread Amir Caspi
On Jul 25, 2014, at 4:11 PM, Kevin A. McGrail  wrote:

> You should look at the patch on bug 7068 
> (https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068)

Yes, but this is within the code itself.  I was referring to how to do this in 
a local.cf, for example...

 Amir

Re: More text/plain questions

2014-07-25 Thread Kevin A. McGrail

On 7/25/2014 5:55 PM, Amir Caspi wrote:

On Jul 24, 2014, at 4:08 PM, Philip Prindeville 
 wrote:


In text/plain with CTE of ‘7bit’ or ‘8bit’ it’s meaningless to use Unicode HTML 
entity encodings.  It’s obviously not HTML.

If you want Unicode in text/plain, it should be in base64 or quoted-printable 
CTE.

Sure, but these spams also have text/html sections with the same characters.  
How do you check if the unicode entities are in the text/plain section, versus 
the entire body?  My understanding was that body rules run on both text/plain 
and text/html -- is there a way to distinguish which section those entities are 
in?
You should look at the patch on bug 7068 
(https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068)


if ($ctype eq 'text/plain' && ($cte eq '' || $cte eq '7bit' || $cte eq 
'8bit')) {


regards,
KAM


Re: More text/plain questions

2014-07-25 Thread Kevin A. McGrail

On 7/23/2014 2:27 PM, Paul Stead wrote:

KAM's rules are also helping add a few extra points

I try.


https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068

and
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7063

I've also implemented several rules to try and catch these types of 
emails.

7063 is in trunk.

7068 is in testing now.

Regards,
KAM


Re: More text/plain questions

2014-07-25 Thread Amir Caspi
On Jul 24, 2014, at 4:08 PM, Philip Prindeville 
 wrote:

> In text/plain with CTE of ‘7bit’ or ‘8bit’ it’s meaningless to use Unicode 
> HTML entity encodings.  It’s obviously not HTML.
> 
> If you want Unicode in text/plain, it should be in base64 or quoted-printable 
> CTE.

Sure, but these spams also have text/html sections with the same characters.  
How do you check if the unicode entities are in the text/plain section, versus 
the entire body?  My understanding was that body rules run on both text/plain 
and text/html -- is there a way to distinguish which section those entities are 
in?

--- Amir



Re: More text/plain questions

2014-07-24 Thread Philip Prindeville

On Jul 24, 2014, at 4:48 PM, Amir 'CG' Caspi  wrote:

> On 2014-07-24 16:11, Philip Prindeville wrote:
> 
>> You might have a shorter wait if you move to CentOS 6.5 instead.
> I would, but the VPS software I'm using does not run on CentOS 6.x, only 5.x. 
>  It's rather old software and I should convert to something else, but it's 
> not worth the time I don't have, so I'm stuck with it.
>> And I can help you with the RPM’s.  I’m a fedora/epel packager.
> Awesome.  Perhaps you want to make an SA 3.4 package for EPEL 5? ;-)  Of 
> course, that helps more than just me...
>  
> --- Amir

Already done.

I have no means to test it, however.




Re: More text/plain questions

2014-07-24 Thread Amir 'CG' Caspi
 

On 2014-07-24 16:11, Philip Prindeville wrote: 

> You might have a shorter wait if you move to CentOS 6.5 instead.

I would, but the VPS software I'm using does not run on CentOS 6.x, only
5.x. It's rather old software and I should convert to something else,
but it's not worth the time I don't have, so I'm stuck with it. 

> And I can help you with the RPM's. I'm a fedora/epel packager.

Awesome. Perhaps you want to make an SA 3.4 package for EPEL 5? ;-) Of
course, that helps more than just me... 

--- Amir 

Re: More text/plain questions

2014-07-24 Thread Philip Prindeville

On Jul 23, 2014, at 1:21 PM, Amir 'CG' Caspi  wrote:

> On 2014-07-23 13:14, Axb wrote:
>> doesn't your VPS offer you shell access?
>> if yes, uninstall the SA rpm stuff and install SA 3.4 from source/trunk.
> 
> I think I didn't explain properly.  I'm running the dedicated server on which 
> there is VPS software.  I need RPMs so that they get distributed to all the 
> client sites.  Installing from source/trunk at the root level won't 
> distribute the tools to the individual sites.
> 
> This is why I need 3.4 packaged as an rpm.
> 
> I'm hoping someone will take up that task.  3.3.x was packaged as an rpm (on 
> EPEL and other repos), so hopefully 3.4 will be, too.
> 
> Thanks.
> 
> --- Amir
> 

Sigh.

Okay, I just did a blind build from fedpkg of spamassassin/master.

http://fedorapeople.org/~philipp/spamassassin-3.4.0-7.el5.x86_64.rpm

No warranties that this actually works.

If you need i686 binaries I can make those too.



Re: More text/plain questions

2014-07-24 Thread Philip Prindeville

On Jul 23, 2014, at 12:54 PM, Amir 'CG' Caspi  wrote:

>> 
>> Hope the patches above get pushed into production
> Indeed, though I'm still running SA v3.3.x ... I'm on a CentOS 5.10 platform 
> and, because it's of the virtual-hosting control panel I use, I need my 
> software distributed in RPMs.  Until someone builds a proper 3.4 rpm for 
> CentOS/RHEL 5, I'm stuck.  (I could be the one to build it, but I'm certainly 
> no expert at RPMs.)
> 
> --- Amir
> 

You might have a shorter wait if you move to CentOS 6.5 instead.

And I can help you with the RPM’s.  I’m a fedora/epel packager.

-Philip



Re: More text/plain questions

2014-07-24 Thread Philip Prindeville

On Jul 23, 2014, at 11:45 AM, Amir 'CG' Caspi  wrote:

> On 2014-07-02 15:04, Amir Caspi wrote:
>> For what it's worth, I just received a spam that basically is the same
>> as what Philip complained about.  I've posted a spample here:
>> http://pastebin.com/Y2YGwL49
> [...]
>> I'm wondering if we shouldn't write a rule looking for lots of
>> �[0-9]{3}; patterns... say, 500 of them in one email.  Or, would we
>> expect legitimate emails to have these?
> 
> So, to follow up on this... over the past couple of weeks I've been getting a 
> lot more FNs than normal, and almost every single one of these is an "encoded 
> character" spam like the example above.  Bayes training does appear to work, 
> in that many of these FNs are already at BAYES_999... but there aren't enough 
> other rules hit to cause the FNs to cross the 5.0 threshold.  (Other, similar 
> spams do cross the threshold, usually due to RAZOR and/or PYZOR hits.)
> 
> Since these are basically unicode character encodings, is there a move to 
> translate all charsets to UTF-8 (or some other fixed standard) before 
> applying body and/or URI rules?  That would, presumably, help with trying to 
> catch these.
> 
> I'm definitely considering writing a rule to catch �[0-9]{3}; patterns.  
> I'm definitely worried it could cause FPs, but are there common circumstances 
> where legitimate emails would include dozens to hundreds of these?  (The 
> latest FNs only include a few dozen, not the hundreds seen in the spample 
> above.)
> 
> Otherwise, I'm not sure what "template" rule I could write to catch these 
> things, and they're increasing in frequency (with more and more being missed 
> as FNs).
> 
> Thanks.
> 
> -- Amir
> 


In text/plain with CTE of ‘7bit’ or ‘8bit’ it’s meaningless to use Unicode HTML 
entity encodings.  It’s obviously not HTML.

If you want Unicode in text/plain, it should be in base64 or quoted-printable 
CTE.

-Philip



Re: More text/plain questions

2014-07-23 Thread Paul Stead

On 23/07/14 21:24, Axb wrote:

look at the HTML source, sharply - there's tons of little traits to
dump in a meta rule

I have these 'traits' in my custom Clamav rules, but that's another
list... :)
--
Paul Stead, Zen Internet
Systems Engineer


Re: More text/plain questions

2014-07-23 Thread Axb

On 07/23/2014 09:54 PM, Paul Stead wrote:


Making use of the meta rules seems to be the best here - this spam is
being very tricky to catch - I'll mirror my previous statement that the
suggested patches do pick up on this spam too



look at the HTML source, sharply - there's tons of little traits to dump 
in a meta rule


Re: More text/plain questions

2014-07-23 Thread Martin Gregorie
On Wed, 2014-07-23 at 21:49 +0200, Axb wrote:

> Centos 5.x is rather dated.
>
>  Not sure there'd be such an old Fedora 
> equivalent offering SA 3.4 rpms.
> 
I'll say - a quick search shows that Centos 7.x is current. and SA 3.4.0
arrived after Fedora 20 was released.

> He'd have to find the equivalent Fedora version or just adapt a SA 3.4 
> spec file and make  his own RPMs. It's not that hard...
> 
Actually, its a bit mean of me to mention Fedora since its more akin to
Debian unstable than to a clone of an LTS distro (Centos being an RHEL
clone) or even something like Debian stable. Fedora releases generally
happen about every 6 months and become unsupported after a bit over a
year.


Martin





Re: More text/plain questions

2014-07-23 Thread Axb

On 07/23/2014 10:06 PM, Amir 'CG' Caspi wrote:

On 2014-07-23 13:38, Axb wrote:

If you're using spamd, why not run a/multiple dedicated VMs for SA 3.4
and have your other VMs use the spamd on the SA VMs  ?


There is a dedicated spamd.  It's the other tools that need to be
distributed, like sa-learn.  Bayes rules are handled per-user.  (No, I
don't plan on changing this any time soon, it would be a herculean
effort given the system setup.)


so apparently you're left with deploying your own rpms by hacking a 
recent spec or hire someone who will do it for you.


DIY ensures you're not left in the rain.




Re: More text/plain questions

2014-07-23 Thread Amir 'CG' Caspi

On 2014-07-23 13:38, Axb wrote:

If you're using spamd, why not run a/multiple dedicated VMs for SA 3.4
and have your other VMs use the spamd on the SA VMs  ?


There is a dedicated spamd.  It's the other tools that need to be 
distributed, like sa-learn.  Bayes rules are handled per-user.  (No, I 
don't plan on changing this any time soon, it would be a herculean 
effort given the system setup.)


--- Amir



Re: More text/plain questions

2014-07-23 Thread Paul Stead

On 23/07/14 20:44, John Hardin wrote:

On Wed, 23 Jul 2014, Paul Stead wrote:


body __LOC_COUNT_UNI /x[0-9A-F]{4};/
tflags   __LOC_COUNT_UNI multiple


Recommend maxhits on that.

Apologies, I omitted the max hits...


If you're only looking for 10+ hits, then maxhits=11 will allow you to
detect them with the minimum of wasted work.


I have more rules to match up to 50, but you are right - good advice for
anyone copying these, though I do prefer Martin's approach:

On 23/07/14 20:39, Martin Gregorie wrote:

body MG_HEX_HTML /(.{0,3}\&\#x[0-9A-F]{4};){5}/


Making use of the meta rules seems to be the best here - this spam is
being very tricky to catch - I'll mirror my previous statement that the
suggested patches do pick up on this spam too
--
Paul Stead, Zen Internet
Systems Engineer


Re: More text/plain questions

2014-07-23 Thread Axb

On 07/23/2014 09:43 PM, Martin Gregorie wrote:

On Wed, 2014-07-23 at 13:21 -0600, Amir 'CG' Caspi wrote:


I'm hoping someone will take up that task.  3.3.x was packaged as an rpm
(on EPEL and other repos), so hopefully 3.4 will be, too.


3.4.0 is the standard SA package for Fedora, so I'd expect to find it on
RHEL and their various clones as well.


Centos 5.x is rather dated. Not sure there'd be such an old Fedora 
equivalent offering SA 3.4 rpms.


He'd have to find the equivalent Fedora version or just adapt a SA 3.4 
spec file and make  his own RPMs. It's not that hard...




Re: More text/plain questions

2014-07-23 Thread John Hardin

On Wed, 23 Jul 2014, Paul Stead wrote:


On 23/07/14 19:54, Amir 'CG' Caspi wrote:


Care to share?  Counting encoded chars is easy, of course.


I use the following to count the encoded chars:

body __LOC_COUNT_UNI /x[0-9A-F]{4};/
tflags   __LOC_COUNT_UNI multiple


Recommend maxhits on that.


We can make some vars if we want:

meta __LOC_HAS_0_UNI (__PDS_COUNT_UNI == 0)
meta __LOC_HAS_10_UNI (__PDS_COUNT_UNI >= 10)


If you're only looking for 10+ hits, then maxhits=11 will allow you to 
detect them with the minimum of wasted work.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Where are my space habitats? Where is my flying car?
  It's 2010 and all I got from the SF books of my youth
  is the lousy dystopian government.  -- perlhaqr
---
 783 days since the first successful private support mission to ISS (SpaceX)


Re: More text/plain questions

2014-07-23 Thread Martin Gregorie
On Wed, 2014-07-23 at 13:21 -0600, Amir 'CG' Caspi wrote:

> I'm hoping someone will take up that task.  3.3.x was packaged as an rpm 
> (on EPEL and other repos), so hopefully 3.4 will be, too.
> 
3.4.0 is the standard SA package for Fedora, so I'd expect to find it on
RHEL and their various clones as well.


Martin





Re: More text/plain questions

2014-07-23 Thread Martin Gregorie
On Wed, 2014-07-23 at 11:45 -0600, Amir 'CG' Caspi wrote:

> I'm definitely considering writing a rule to catch �[0-9]{3}; 
> patterns.  I'm definitely worried it could cause FPs, but are there 
> common circumstances where legitimate emails would include dozens to 
> hundreds of these?  (The latest FNs only include a few dozen, not the 
> hundreds seen in the spample above.)
> 
This works for me:

describe MG_HEX_HTML  Body contains too many HTML hex encodings
body MG_HEX_HTML  /(.{0,3}\&\#x[0-9A-F]{4};){5}/
scoreMG_HEX_HTML  3.5

It is also used in a meta, along with some other simple local rules, to
give hex-bearing spam an extra kick up the rear. I found that, in my
mailstream anyway, there was generally not much else to write rules
against, hence the high score. Spam arriving here gets quarantined: I
look at the sender and subject as a matter of course and, if it looks
like a possible FP, I'll look at the text too (I wrote a PHP viewer for
quarantined spam a long time ago) but it appears that, after the brief
squall of hex spam which made me write the rule, the promised spamstorm
ended and so far has failed to restart.
  

Martin






Re: More text/plain questions

2014-07-23 Thread Axb

On 07/23/2014 09:21 PM, Amir 'CG' Caspi wrote:

On 2014-07-23 13:14, Axb wrote:

doesn't your VPS offer you shell access?
if yes, uninstall the SA rpm stuff and install SA 3.4 from source/trunk.


I think I didn't explain properly.  I'm running the dedicated server on
which there is VPS software.  I need RPMs so that they get distributed
to all the client sites.  Installing from source/trunk at the root level
won't distribute the tools to the individual sites.

This is why I need 3.4 packaged as an rpm.

I'm hoping someone will take up that task.  3.3.x was packaged as an rpm
(on EPEL and other repos), so hopefully 3.4 will be, too.


If you're using spamd, why not run a/multiple dedicated VMs for SA 3.4 
and have your other VMs use the spamd on the SA VMs  ?







Re: More text/plain questions

2014-07-23 Thread Paul Stead

On 23/07/14 19:54, Amir 'CG' Caspi wrote:

Care to share?  Counting encoded chars is easy, of course.

I use the following to count the encoded chars:

body __LOC_COUNT_UNI /x[0-9A-F]{4};/
tflags   __LOC_COUNT_UNI multiple

We can make some vars if we want:

meta __LOC_HAS_0_UNI (__PDS_COUNT_UNI == 0)
meta __LOC_HAS_10_UNI (__PDS_COUNT_UNI >= 10)

I've noticed that they all come through as VERP emails -

header  __LOC_VERP X-Envelope-From =~ 
/\=.*\.(com|net|org|biz)\@/

And a list of keywords that I've noticed:

header  __LOC_VERP_AMAZON   X-Envelope-From =~ 
/^amazon\-?_?coupons\-/i

Then add them together in a meta score

meta LOC_UNI_SPAM (!BAYES_00) && ( __LOC_VERP + __LOC_VERP_AMAZON + 
__LOC_HAS_10_UNI >= 3)
score LOC_UNI_SPAM 0.001

This seems to only be catching the bad stuff, you could of course add some more 
magic:

meta LOC_UNI_SPAM_99 (BAYES_99 && LOC_UNI_SPAM)
score LOC_UNI_SPAM_99 .

...checking whether the MIME-encoding is text/plain may not be sufficient

Though it's totally possible, I haven't gone as far as checking the encoding 
types etc, apart from the links to the patches I included...

SA v3.3.x ...
Me too, the patch works fine with it, I'm awaiting the Debian build for the 
production boxes, but running from source isn't too difficult either.

Though I'm aware they're not the best for generic spam, they're seem okay on 
these specific types (I suggest from the same source, looking at the styles of 
the email) - I've yet to test the rules on production.

I've also noticed the following traits but not sure how to find these traits:

* All emails have a message ID where the recipients email address is contained in md5 - 
.@domain.com
* All emails to the same recipient have the same MIME boundary - possibly a 
hash of the recipient address

Paul

--
Paul Stead, Zen Internet
Systems Engineer


Re: More text/plain questions

2014-07-23 Thread Amir 'CG' Caspi

On 2014-07-23 13:14, Axb wrote:

doesn't your VPS offer you shell access?
if yes, uninstall the SA rpm stuff and install SA 3.4 from 
source/trunk.


I think I didn't explain properly.  I'm running the dedicated server on 
which there is VPS software.  I need RPMs so that they get distributed 
to all the client sites.  Installing from source/trunk at the root level 
won't distribute the tools to the individual sites.


This is why I need 3.4 packaged as an rpm.

I'm hoping someone will take up that task.  3.3.x was packaged as an rpm 
(on EPEL and other repos), so hopefully 3.4 will be, too.


Thanks.

--- Amir



Re: More text/plain questions

2014-07-23 Thread Axb

On 07/23/2014 08:54 PM, Amir 'CG' Caspi wrote:

Indeed, though I'm still running SA v3.3.x ... I'm on a CentOS 5.10
platform and, because it's of the virtual-hosting control panel I use, I
need my software distributed in RPMs. Until someone builds a proper 3.4
rpm for CentOS/RHEL 5, I'm stuck. (I could be the one to build it, but
I'm certainly no expert at RPMs.)


doesn't your VPS offer you shell access?
if yes, uninstall the SA rpm stuff and install SA 3.4 from source/trunk.

if not, your stuck with a KIA in the hope somebody will update it to a 
Lexus.


Re: More text/plain questions

2014-07-23 Thread Amir 'CG' Caspi
 

On 2014-07-23 12:23, Paul Stead wrote: 

> I've also implemented several rules to try and catch these types of emails.

Care to share? Counting encoded chars is easy, of course. 

One thing to note, webmail and my MUA often will render the encoded
characters in their translated format, not literally (as hashes). I'm
not sure if this is because the MIME encoding isn't claiming to be
text/plain, or because the browser/MUA are trying to be helpful by not
being strict... I haven't looked too deeply into it yet. Thus, checking
whether the MIME-encoding is text/plain may not be sufficient, because
not all of them might be trying to claim text/plain. 

> Hope the patches above get pushed into production

Indeed, though I'm still running SA v3.3.x ... I'm on a CentOS 5.10
platform and, because it's of the virtual-hosting control panel I use, I
need my software distributed in RPMs. Until someone builds a proper 3.4
rpm for CentOS/RHEL 5, I'm stuck. (I could be the one to build it, but
I'm certainly no expert at RPMs.) 

--- Amir 

Re: More text/plain questions

2014-07-23 Thread Paul Stead

KAM's rules are also helping add a few extra points

On 23/07/14 19:23, Paul Stead wrote:
On 23/07/14 18:45, Amir 'CG' Caspi wrote:
So, to follow up on this... over the past couple of weeks I've been getting a lot more 
FNs than normal, and almost every single one of these is an "encoded character" 
spam like the example above.  Bayes training does appear to work, in that many of these 
FNs are already at BAYES_999... but there aren't enough other rules hit to cause the FNs 
to cross the 5.0 threshold.  (Other, similar spams do cross the threshold, usually due to 
RAZOR and/or PYZOR hits.)
Same here - I've had one particular user furious about this, laughable but 
still annoying.

I'm definitely considering writing a rule to catch �[0-9]{3}; patterns.  I'm 
definitely worried it could cause FPs, but are there common circumstances where 
legitimate emails would include dozens to hundreds of these?  (The latest FNs only 
include a few dozen, not the hundreds seen in the spample above.)
You might find the following useful

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068
and
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7063

I've also implemented several rules to try and catch these types of emails.

Namely counting the encoded chars and recognising other traits I've noticed 
with this type of mail.

Hope the patches above get pushed into production

--
Paul Stead, Zen Internet
Systems Engineer


Re: More text/plain questions

2014-07-23 Thread Paul Stead

On 23/07/14 18:45, Amir 'CG' Caspi wrote:
So, to follow up on this... over the past couple of weeks I've been getting a lot more 
FNs than normal, and almost every single one of these is an "encoded character" 
spam like the example above.  Bayes training does appear to work, in that many of these 
FNs are already at BAYES_999... but there aren't enough other rules hit to cause the FNs 
to cross the 5.0 threshold.  (Other, similar spams do cross the threshold, usually due to 
RAZOR and/or PYZOR hits.)
Same here - I've had one particular user furious about this, laughable but 
still annoying.

I'm definitely considering writing a rule to catch �[0-9]{3}; patterns.  I'm 
definitely worried it could cause FPs, but are there common circumstances where 
legitimate emails would include dozens to hundreds of these?  (The latest FNs only 
include a few dozen, not the hundreds seen in the spample above.)
You might find the following useful

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068
and
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7063

I've also implemented several rules to try and catch these types of emails.

Namely counting the encoded chars and recognising other traits I've noticed 
with this type of mail.

Hope the patches above get pushed into production

Paul

--
Paul Stead, Zen Internet
Systems Engineer


Re: More text/plain questions

2014-07-23 Thread Amir 'CG' Caspi

On 2014-07-02 15:04, Amir Caspi wrote:

For what it's worth, I just received a spam that basically is the same
as what Philip complained about.  I've posted a spample here:

http://pastebin.com/Y2YGwL49

[...]

I'm wondering if we shouldn't write a rule looking for lots of
�[0-9]{3}; patterns... say, 500 of them in one email.  Or, would we
expect legitimate emails to have these?


So, to follow up on this... over the past couple of weeks I've been 
getting a lot more FNs than normal, and almost every single one of these 
is an "encoded character" spam like the example above.  Bayes training 
does appear to work, in that many of these FNs are already at 
BAYES_999... but there aren't enough other rules hit to cause the FNs to 
cross the 5.0 threshold.  (Other, similar spams do cross the threshold, 
usually due to RAZOR and/or PYZOR hits.)


Since these are basically unicode character encodings, is there a move 
to translate all charsets to UTF-8 (or some other fixed standard) before 
applying body and/or URI rules?  That would, presumably, help with 
trying to catch these.


I'm definitely considering writing a rule to catch �[0-9]{3}; 
patterns.  I'm definitely worried it could cause FPs, but are there 
common circumstances where legitimate emails would include dozens to 
hundreds of these?  (The latest FNs only include a few dozen, not the 
hundreds seen in the spample above.)


Otherwise, I'm not sure what "template" rule I could write to catch 
these things, and they're increasing in frequency (with more and more 
being missed as FNs).


Thanks.

-- Amir



Re: More text/plain questions

2014-07-07 Thread David F. Skoll
On Mon, 07 Jul 2014 19:29:11 -0400
Daniel Staal  wrote:

> Just to start the discussion: I'd say default to UTF-8 if not
> otherwise specified and can't be worked out.  (How hard to work on
> 'working it out' is a question, of course.)  It's the growing
> standard, as far as I can tell.

+1.  UTF-8 is the best choice.  (Modern) Perl handles it very nicely.
Even non-UTF-8 messages should be recoded into UTF-8 for body rules;
otherwise, making a rule that looks for things like "抵押" will be
well-nigh impossible.

Regards,

David.


Re: More text/plain questions

2014-07-07 Thread Daniel Staal
--As of July 7, 2014 5:20:01 PM -0400, Kevin A. McGrail is alleged to have 
said:



On 7/7/2014 5:09 PM, Philip Prindeville wrote:

On Jul 7, 2014, at 7:15 AM, Kevin A. McGrail  wrote:


On 7/7/2014 2:28 AM, John Wilcock wrote:

Le 05/07/2014 19:08, Philip Prindeville a écrit :

As for encoding a cyrillic small a: there are many ways to do this.
iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this
would be very efficient—there are just too many charsets possible.

Normalising the input message to UTF-8 before body checks would help
somewhat with that. I seem to remember there's been talk of doing this.


Yes, or utf-16...  I think that will be necessary to keep SA effective
in the modern world sooner than later.


Okay, but… if the message body is non-ASCII and the CTE is 8bit or
base64 and no explicit charset has been given, how do you know which
translation to perform?

I get a lot of Han SPAM in GB2312 where the charset is never specified
(apparently it’s a national default in China, despite the requirements
stated in RFC-2045 and -2046).

Sorry, I haven't even started delving into the devilish details but I
know it's looming as a needed feature.


--As for the rest, it is mine.

Just to start the discussion: I'd say default to UTF-8 if not otherwise 
specified and can't be worked out.  (How hard to work on 'working it out' 
is a question, of course.)  It's the growing standard, as far as I can tell.


Even if it's wrong in a particular case, it would probably be useful: It 
would give rule writers something to work with.


Daniel T. Staal

---
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---


Re: More text/plain questions

2014-07-07 Thread Kevin A. McGrail

On 7/7/2014 5:09 PM, Philip Prindeville wrote:

On Jul 7, 2014, at 7:15 AM, Kevin A. McGrail  wrote:


On 7/7/2014 2:28 AM, John Wilcock wrote:

Le 05/07/2014 19:08, Philip Prindeville a écrit :

As for encoding a cyrillic small a: there are many ways to do this.
iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this
would be very efficient—there are just too many charsets possible.

Normalising the input message to UTF-8 before body checks would help somewhat 
with that. I seem to remember there's been talk of doing this.


Yes, or utf-16...  I think that will be necessary to keep SA effective in the 
modern world sooner than later.


Okay, but… if the message body is non-ASCII and the CTE is 8bit or base64 and 
no explicit charset has been given, how do you know which translation to 
perform?

I get a lot of Han SPAM in GB2312 where the charset is never specified 
(apparently it’s a national default in China, despite the requirements stated 
in RFC-2045 and -2046).
Sorry, I haven't even started delving into the devilish details but I 
know it's looming as a needed feature.


regards,
KAM


Re: More text/plain questions

2014-07-07 Thread Philip Prindeville

On Jul 7, 2014, at 7:15 AM, Kevin A. McGrail  wrote:

> On 7/7/2014 2:28 AM, John Wilcock wrote:
>> Le 05/07/2014 19:08, Philip Prindeville a écrit :
>>> As for encoding a cyrillic small a: there are many ways to do this.
>>> iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this
>>> would be very efficient—there are just too many charsets possible.
>> 
>> Normalising the input message to UTF-8 before body checks would help 
>> somewhat with that. I seem to remember there's been talk of doing this.
>> 
> Yes, or utf-16...  I think that will be necessary to keep SA effective in the 
> modern world sooner than later.


Okay, but… if the message body is non-ASCII and the CTE is 8bit or base64 and 
no explicit charset has been given, how do you know which translation to 
perform?

I get a lot of Han SPAM in GB2312 where the charset is never specified 
(apparently it’s a national default in China, despite the requirements stated 
in RFC-2045 and -2046).

-Philip



Re: More text/plain questions

2014-07-07 Thread Kevin A. McGrail

On 7/7/2014 2:28 AM, John Wilcock wrote:

Le 05/07/2014 19:08, Philip Prindeville a écrit :

As for encoding a cyrillic small a: there are many ways to do this.
iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this
would be very efficient—there are just too many charsets possible.


Normalising the input message to UTF-8 before body checks would help 
somewhat with that. I seem to remember there's been talk of doing this.


Yes, or utf-16...  I think that will be necessary to keep SA effective 
in the modern world sooner than later.


Re: More text/plain questions

2014-07-06 Thread John Wilcock

Le 05/07/2014 19:08, Philip Prindeville a écrit :

As for encoding a cyrillic small a: there are many ways to do this.
iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this
would be very efficient—there are just too many charsets possible.


Normalising the input message to UTF-8 before body checks would help 
somewhat with that. I seem to remember there's been talk of doing this.


--
John


Re: More text/plain questions

2014-07-05 Thread Philip Prindeville

On Jul 4, 2014, at 12:08 AM, haman...@t-online.de wrote:

> 
> Hi,
> 
> while this is certainly not correct - and likely does not display in every 
> mail client - it would
> probably work in several webmailers. Perhaps this is the configuration the 
> author of that
> crap tested.
> Now, I am somewhat reluctant to classify badly formatted mails as spam: there 
> are many
> systems around, even from major players, that send legitimate mails like 
> order confirmation,
> delivery notification, opted-in newsletters but do many of the formal things 
> more right than wrong
> On the other side, looking at the actual characters shows that the message is 
> spam: these are
> cyrillic letters that happen to look exactly like western ones (a, e, o or 
> such) so the obvious intent
> is to avoid detection of the strings. We have seen the same with IDN domain 
> names that might
> use a cyrillic a to register a domain that looks like, e.g. paypal.com
> The list of characters is fairly short, so maybe checking for these 
> characters in all commonly
> used variants (html entities, utf8 encoded, +u0430, \u0430. IDN encoded) 
> would be a good
> spam indication
> 
> Regards
> Wolfgang
> 
> 

I think you’re overlooking what a lot of tests already do: test for poor 
formatting.

INVALID_DATE
UNPARSEABLE_RELAY
HTML_MISSING_CTYPE
MISSING_HEADERS
MISSING_DATE

As for encoding a cyrillic small a: there are many ways to do this. iso-8859-4, 
utf-8, jp2212, gb2312, win1252, etc. I don’t think this would be very 
efficient—there are just too many charsets possible.

-Philip





Re: More text/plain questions

2014-07-03 Thread hamann . w
>> >> I got the following MIME body part below, and I�m wondering if it would 
>> >> make sense to filter on this as well.
>> >> Given that it�s text/plain with an implicit charset=�us-ascii� and an 
>> >> implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} 
>> >> doesn�t really parse into a 16-bit character, would it? That would be a 
>> >> broken MUA that made such a leap...
>> >> Wouldn�t that normally render as the character �&�, �#�, �x�, etc. rather 
>> >> than the unicode16 or UTF-8 character with that hex value?
>> >> There might be times when someone has sent an attachment improperly 
>> >> encoded this way which might have embedded binary values in it, but 
>> >> that�s kind of buggy anyway� it should have been done as base64 and 
>> >> application/octet-stream in the worst of cases if it has arbitrary binary 
>> >> data.
>> >> I wouldn�t want a message where someone gives a couple of examples of 
>> >> encoding Ѐ for instance being flagged as SPAM, but if the text is 
>> >> 20% or more of these sequences then I would say that�s SPAM-sign.
>> >> Anyway, here�s the body I saw:
>> >> --1388-8200-b67c-e579-9c27-df36-12fa-a2eb
>> Content-Type: text/plain;
>> >> Thе Rеаl 
>> >> RеаѕоnThе Ꮯоmіng 
>> >> Ꮯоllарѕе...Thе 
>> >> rеаl rеаѕоn ᎳHY 
>> >> HоmеlаndSеcurіtу 
>> >> rеcеntlу рurchаѕеd1.7 
>> >> Bіllіоn Rоundѕ оf 
>> >> аmmunіtіоn...Ꮃhаt Yоu 
>> >> Muѕt Dо Tо Ꭼnѕurе 
>> >> YоurSаfеtуHоmеlаnd 
>> >> ѕеcurіtу іѕ thеrе 
>> >> tо ѕеcurеthе 
>> >> hоmеlаnd оnlу... Sо 
>> >> thеѕе Ьullеtѕаrе 
>> >> rеаlу mеаnt fоr 
>> >> thеThіѕ іѕ аn 
>> >> еmаіlаdvеrtіѕеmеnt
>> >>  thаt wаѕ ѕеnt tо 
>> >> уоu Ьу Ρаtrіоt 
>> >> Survіvаl Ρlаn. If 
>> >> уоuwіѕh tо 
>> >> nоlоngеr rеcеіvе 
>> >> mеѕѕаgеѕ thаt 
>> >> рrоmоtе ѕurvіvаl 
>> >> tірѕ, 
>> >> рlеаѕеclіck hеrе 
>> >> tо unѕuЬѕcrіЬе.4 
>> >> Unstable as water, thou shalt not excel because thou wentest up to thy 
>> >> fathers bed then defiledst thou it he went up to my couch.34 And 
>> >> Pharaohnechoh made Eliakim the son of Josiah king in the room of Josiah 
>> >> his father, and turned his name to Jehoiakim, and took Jehoahaz away and 
>> >> he came to Egypt, and died there.37  And the thing was good in the eyes 
>> >> of Pharaoh, and in the eyes o!
>> f all his servants.
>> >> --1388-8200-b67c-e579-9c27-df36-12fa-a2eb

Hi,

while this is certainly not correct - and likely does not display in every mail 
client - it would
probably work in several webmailers. Perhaps this is the configuration the 
author of that
crap tested.
Now, I am somewhat reluctant to classify badly formatted mails as spam: there 
are many
systems around, even from major players, that send legitimate mails like order 
confirmation,
delivery notification, opted-in newsletters but do many of the formal things 
more right than wrong
On the other side, looking at the actual characters shows that the message is 
spam: these are
cyrillic letters that happen to look exactly like western ones (a, e, o or 
such) so the obvious intent
is to avoid detection of the strings. We have seen the same with IDN domain 
names that might
use a cyrillic a to register a domain that looks like, e.g. paypal.com
The list of characters is fairly short, so maybe checking for these characters 
in all commonly
used variants (html entities, utf8 encoded, +u0430, \u0430. IDN encoded) would 
be a good
spam indication

Regards
Wolfgang




Re: More text/plain questions

2014-07-03 Thread Kevin A. McGrail

On 7/2/2014 5:04 PM, Amir Caspi wrote:

Is there also a rule for UTF8-encoded Subject line?  If so, it didn't pop.
Just a quick note about this part of your email.  This is extremely 
common to use UTF-8 and I doubt it would be an indicator of spam vs 
ham.  I wouldn't even bother looking...


Re: More text/plain questions

2014-07-02 Thread Karsten Bräckelmann
On Wed, 2014-07-02 at 19:10 -0600, Philip Prindeville wrote:
> On Jul 2, 2014, at 5:16 PM, Karsten Bräckelmann  
> wrote:

> > That RE is a single, straight-forward alternation with two alternatives.
> > 
> > The first one translates to a single char in a given, specific range.
> > Basically, anything but the ampersand. The second alternative is an
> > ampersand, that is not followed by #x.
> > 
> > The (?!pattern) is a zero-width negative look-ahead assertion. A zero
> > width means, it does not consume what it matches. Thus, the second
> > alternation ultimately will match a single ampersand only. The /g global
> > matching then continues where it left of after the last matching
> > attempt. In the case of that ampersand followed by #x, that still is
> > right after the ampersand.

> Okay, so what I was trying to do is skip any ampersand followed by
> #x; as part of the matched text (but include ampersands not
> followed by #x; as part of the match).

That is the result of the plain s/&#x[0-9A-F]{4};//g global substitution
I posted.

You should define what you ultimately want to achieve. Not, what you
right now think is a step-stone and part of the solution.


> So that if I had the text:
> 
> This that & those.
> 
> The first @match would be counted as $chars:
> 
> T,h,i,s, ,t,h,a,t, ,&, ,t,h,o,s,.
> 
> and the 2nd @match would be:
> 
> e
> 
> counting as $uchars.
> 
> So in the first case, the e would be skipped over as part of the 
> capture.

Skipped over, since it is part of the capture. That kind of contradicts
itself...

Do you want all of those (HTML entity string) matches? The raw matches
themselves? Or is that just an attempt of debug visualization? Do you
actually want its number only?

This has quite an impact on the Perl code and logic / math involved.


Number of HTML entity escapes, length(char) of reminder:

  my $number = $data =~ s/&#x[0-9A-F]{4};//g;

  print "number:  ", $number, "\n";
  print "other:   ", length $data," = '", $data, "'\n";


Do need the complete HTML entity escapes. Quick hack to compute reminder.

  my @matches = $data =~ /&#x[0-9A-F]{4};/g;

  print "matches: ", scalar @matches, " = ", join(',', @matches), "\n";
  print "other:   ", length ($data) - 8*(scalar @matches), "\n";


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: More text/plain questions

2014-07-02 Thread Philip Prindeville

On Jul 2, 2014, at 5:16 PM, Karsten Bräckelmann  wrote:

> On Wed, 2014-07-02 at 14:44 -0600, Philip Prindeville wrote:
>> Okay, was tinkering with the code below but the zero-width lookahead is
>> not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output
>> is bogus (you can run this and see what I mean).
>> 
>> What am I doing wrong?
> 
> You are using an overly complex and fugly test case. ;)  Seriously, a
> stripped down test string does not require more than about 4 instances
> of plain chars and HTML entities. Much easier on the eye.
> 
> 
>>my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g;
> 
> That RE is a single, straight-forward alternation with two alternatives.
> 
> The first one translates to a single char in a given, specific range.
> Basically, anything but the ampersand. The second alternative is an
> ampersand, that is not followed by #x.
> 
> The (?!pattern) is a zero-width negative look-ahead assertion. A zero
> width means, it does not consume what it matches. Thus, the second
> alternation ultimately will match a single ampersand only. The /g global
> matching then continues where it left of after the last matching
> attempt. In the case of that ampersand followed by #x, that still is
> right after the ampersand.
> 
>  line: Thе R
>  matches: T,h,#,x,0,4,3,5,;, ,R

Okay, so what I was trying to do is skip any ampersand followed by #x; as 
part of the matched text (but include ampersands not followed by #x; as 
part of the match).

So that if I had the text:

This that & those.

The first @match would be counted as $chars:

T,h,i,s, ,t,h,a,t, ,&, ,t,h,o,s,.

and the 2nd @match would be:

e

counting as $uchars.

So in the first case, the e would be skipped over as part of the capture.

What’s the opposite of a zero-width lookahead?  I.e. a match that advances the 
cursor but doesn’t copy the matching text into the capture buffer?


> 
> The offending ampersand part of the HTML entity encoding correctly is
> not matched. The following chars do match the "anything but an
> ampersand" first alternative.
> 
> 
> I am unsure what you are trying to achieve. If you want to compare the
> number of HTML entities with the number of regular chars, wouldn't it be
> easier to simply drop them flat?
> 
>  $data =~ s/&#x[0-9A-F]{4};//g;
> 
> Or just plain match and count?
> 
>  @matches = $data =~ /&#x[0-9A-F]{4};/g;
> 
> 
> -- 
> char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
> 



Re: More text/plain questions

2014-07-02 Thread Karsten Bräckelmann
On Wed, 2014-07-02 at 14:44 -0600, Philip Prindeville wrote:
> Okay, was tinkering with the code below but the zero-width lookahead is
> not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output
> is bogus (you can run this and see what I mean).
> 
> What am I doing wrong?

You are using an overly complex and fugly test case. ;)  Seriously, a
stripped down test string does not require more than about 4 instances
of plain chars and HTML entities. Much easier on the eye.


> my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g;

That RE is a single, straight-forward alternation with two alternatives.

The first one translates to a single char in a given, specific range.
Basically, anything but the ampersand. The second alternative is an
ampersand, that is not followed by #x.

The (?!pattern) is a zero-width negative look-ahead assertion. A zero
width means, it does not consume what it matches. Thus, the second
alternation ultimately will match a single ampersand only. The /g global
matching then continues where it left of after the last matching
attempt. In the case of that ampersand followed by #x, that still is
right after the ampersand.

  line: Thе R
  matches: T,h,#,x,0,4,3,5,;, ,R

The offending ampersand part of the HTML entity encoding correctly is
not matched. The following chars do match the "anything but an
ampersand" first alternative.


I am unsure what you are trying to achieve. If you want to compare the
number of HTML entities with the number of regular chars, wouldn't it be
easier to simply drop them flat?

  $data =~ s/&#x[0-9A-F]{4};//g;

Or just plain match and count?

  @matches = $data =~ /&#x[0-9A-F]{4};/g;


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: More text/plain questions

2014-07-02 Thread Amir Caspi
On Jul 2, 2014, at 12:58 PM, David F. Skoll  wrote:

> I don't think so.  Any MUA that tried to convert "е" to a
> Unicode character in a text/plain part with implicit US-ASCII charset
> and 7bit content transfer encoding is broken.  An MUA should diplay
> exactly "е" in this situation.  It's a different story for
> text/html parts, of course.

For what it's worth, I just received a spam that basically is the same as what 
Philip complained about.  I've posted a spample here:

http://pastebin.com/Y2YGwL49

There _is_ a text/html part, and that's what's displaying in my MUA (Apple 
Mail).

Sadly, as can be seen from the spample, the score doesn't quite reach 5.0 ... 
Bayes training could help since it only scored BAYES_50, but I'm wondering if 
this character encoding is designed to sidestep Bayes -- how does Bayes treat 
these for tokens?  If you randomize the characters being replaced (from 
plaintext to encoded), then there are lots of combinations for any given word, 
which then means each combination is a different token, no?  I don't know if 
spammers are taking the "care" to randomize the letter replacement, but if so, 
does this scheme actually "foil" Bayes due to each permutation being considered 
a different token?  If so, is there a way to mitigate that?

I'm wondering if we shouldn't write a rule looking for lots of �[0-9]{3}; 
patterns... say, 500 of them in one email.  Or, would we expect legitimate 
emails to have these?

Is there also a rule for UTF8-encoded Subject line?  If so, it didn't pop.

--- Amir



Re: More text/plain questions

2014-07-02 Thread John Hardin

On Wed, 2 Jul 2014, Philip Prindeville wrote:


On Jul 2, 2014, at 12:37 PM, John Hardin  wrote:


On Wed, 2 Jul 2014, Philip Prindeville wrote:


Given that it’s text/plain with an implicit charset=“us-ascii” and an implicit 
content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} doesn’t really 
parse into a 16-bit character, would it? That would be a broken MUA that made such 
a leap...


Nope. The content-transfer-encoding is only for the *transfer* part of the 
process. Once the content reaches the MUA that content can be further parsed by 
the MUA according to other encoding rules, such as these escape sequences for 
Unicode characters. That's perfectly valid. How else would you send, for 
example, a c-cedille in spanish text via a 7-bit-clean channel?


This is a trick question, right?

You do that with base64 or quoted-printable, which are the interoperable 
standards.


Apologies, you are right. I was focused on something else this morning and 
dashed off a fast - and wrong - answer.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  WSJ on the Financial Stimulus package: "...today there are 700,000
  fewer jobs than [the administration] predicted we would have if we
  had done nothing at all."
---
 2 days until the 238th anniversary of the Declaration of Independence

Re: More text/plain questions

2014-07-02 Thread Philip Prindeville
Okay, was tinkering with the code below but the zero-width lookahead is not 
disqualifying ampersand followed by #x[0-9A-F]{4}; so the output is bogus (you 
can run this and see what I mean).

What am I doing wrong?



#!/usr/bin/perl -w

use warnings;
use strict;

my $data = <<__EOF__;
Thе Rеаl RеаѕоnThе 
Ꮯоmіng 
Ꮯоllарѕе...Thе 
rеаl rеаѕоn ᎳHY 
HоmеlаndSеcurіtу 
rеcеntlу рurchаѕеd1.7 
Bіllіоn Rоundѕ оf 
аmmunіtіоn...Ꮃhаt Yоu 
Muѕt Dо Tо Ꭼnѕurе 
YоurSаfеtуHоmеlаnd 
ѕеcurіtу іѕ thеrе 
tо ѕеcurеthе hоmеlаnd 
оnlу... Sо thеѕе 
Ьullеtѕаrе rеаlу 
mеаnt fоr thеThіѕ іѕ 
аn 
еmаіlаdvеrtіѕеmеnt
 thаt wаѕ ѕеnt tо уоu 
Ьу Ρаtrіоt Survіvаl 
Ρlаn. If уоuwіѕh tо 
nоlоngеr rеcеіvе 
mеѕѕаgеѕ thаt 
рrоmоtе ѕurvіvаl 
tірѕ, 
рlеаѕеclіck hеrе 
tо unѕuЬѕcrіЬе.4 Unstable as 
water, thou shalt not excel because thou wentest up to thy fathers bed then 
defiledst thou it he went up to my couch.34 And Pharaohnechoh made Eliakim the 
son of Josiah king in the room of Josiah his father, and turned his name to 
Jehoiakim, and took Jehoahaz away and he came to Egypt, and died there.37  And 
the thing was good in the eyes of Pharaoh, and in the eyes o!
f all his servants.
__EOF__

my $chars = 0;
my $uchars = 0;

for (split("\n", $data)) {
print STDERR "line: ", $_, "\n";

my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g;
print STDERR "matches: ", join(',', @matches), " count ", scalar @matches, "\n";
my $chars += scalar @matches;
print STDERR "chars: ", $chars, "\n";

@matches = m/&#x[0-9A-F]{4};/g;
print STDERR "matches: ", join(',', @matches), " count ", scalar @matches, "\n";
my $uchars += scalar @matches;
print STDERR "uchars: ", $uchars, "\n";

print STDERR "\n";

}



Re: More text/plain questions

2014-07-02 Thread Philip Prindeville

On Jul 2, 2014, at 12:37 PM, John Hardin  wrote:

> On Wed, 2 Jul 2014, Philip Prindeville wrote:
> 
>> Given that it’s text/plain with an implicit charset=“us-ascii” and an 
>> implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} 
>> doesn’t really parse into a 16-bit character, would it? That would be a 
>> broken MUA that made such a leap...
> 
> Nope. The content-transfer-encoding is only for the *transfer* part of the 
> process. Once the content reaches the MUA that content can be further parsed 
> by the MUA according to other encoding rules, such as these escape sequences 
> for Unicode characters. That's perfectly valid. How else would you send, for 
> example, a c-cedille in spanish text via a 7-bit-clean channel?

This is a trick question, right?

You do that with base64 or quoted-printable, which are the interoperable 
standards.

You don’t pick some implicit encoding which no one else has agreed upon.


> 
>> Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather 
>> than the unicode16 or UTF-8 character with that hex value?
> 
> I'd only expect that in a very old MUA (i.e. that does not support Unicode), 
> or display of the raw message content at user request.


How is it supposed to guess what the encoding implicitly means?  We have the 
MIME spec so that all of this is formally specified.


> 
>> I wouldn’t want a message where someone gives a couple of examples of 
>> encoding Ѐ for instance being flagged as SPAM, but if the text is 20% 
>> or more of these sequences then I would say that’s SPAM-sign.
> 
> That's valid 7-bit encoding for transfer. It's relying on the user's MUA to 
> convert the encoded Unicode values to glyphs for display.

No, 7-bit CTE means it’s 7-bit content. Period.

If you want 8-bit or 16-bit or 32-bit content over a 7-bit CHANNEL, you use a 
7-bit safe encoding like base64 or quoted-printable.

Citing RFC-2045:

6.  Content-Transfer-Encoding Header Field

   Many media types which could be usefully transported via email are
   represented, in their "natural" format, as 8bit character or binary
   data.  Such data cannot be transmitted over some transfer protocols.
   For example, RFC 821 (SMTP) restricts mail messages to 7bit US-ASCII
   data with lines no longer than 1000 characters including any trailing
   CRLF line separator.

   It is necessary, therefore, to define a standard mechanism for
   encoding such data into a 7bit short line format.  Proper labelling
   of unencoded material in less restrictive formats for direct use over
   less restrictive transports is also desireable.  This document
   specifies that such encodings will be indicated by a new "Content-
   Transfer-Encoding" header field.  This field has not been defined by
   any previous standard.

…

6.2.  Content-Transfer-Encodings Semantics

   …

   The quoted-printable and base64 encodings transform their input from
   an arbitrary domain into material in the "7bit" range, thus making it
   safe to carry over restricted transports.  The specific definition of
   the transformations are given below.

   The proper Content-Transfer-Encoding label must always be used.
   Labelling unencoded data containing 8bit characters as "7bit" is not
   allowed, nor is labelling unencoded non-line-oriented data as
   anything other than "binary" allowed.

   …

   NOTE ON THE RELATIONSHIP BETWEEN CONTENT-TYPE AND CONTENT-TRANSFER-
   ENCODING: It may seem that the Content-Transfer-Encoding could be
   inferred from the characteristics of the media that is to be encoded,
   or, at the very least, that certain Content-Transfer-Encodings could
   be mandated for use with specific media types.  There are several
   reasons why this is not the case. First, given the varying types of
   transports used for mail, some encodings may be appropriate for some
   combinations of media types and transports but not for others.  (For
   example, in an 8bit transport, no encoding would be required for text
   in certain character sets, while such encodings are clearly required
   for 7bit SMTP.)

So you can’t infer the content-type from the content-transfer-encoding or 
vice-versa.

And RFC-2046:

4.1.2.  Charset Parameter

   A critical parameter that may be specified in the Content-Type field
   for "text/plain" data is the character set.  This is specified with a
   "charset" parameter, as in:

 Content-type: text/plain; charset=iso-8859-1

   Unlike some other parameter values, the values of the charset
   parameter are NOT case sensitive.  The default character set, which
   must be assumed in the absence of a charset parameter, is US-ASCII.

so you can’t render Unicode or UTF-8 or ISO-8859-X characters because the 
charset is implicitly US-ASCII and doesn’t have any characters beyond 0111 
binary.

In short, it’s not Unicode unless it EXPLICITLY SAYS UNICODE.

And see also RFC-2152, which I won’t quote here.

Lastly, RFC-3629:

8.  MIME registration


   This memo ser

Re: More text/plain questions

2014-07-02 Thread David F. Skoll
On Wed, 2 Jul 2014 11:37:33 -0700 (PDT)
John Hardin  wrote:

> Nope. The content-transfer-encoding is only for the *transfer* part
> of the process. Once the content reaches the MUA that content can be
> further parsed by the MUA according to other encoding rules, such as
> these escape sequences for Unicode characters.

I don't think so.  Any MUA that tried to convert "е" to a
Unicode character in a text/plain part with implicit US-ASCII charset
and 7bit content transfer encoding is broken.  An MUA should diplay
exactly "е" in this situation.  It's a different story for
text/html parts, of course.

> That's perfectly valid. How else would you send, for example, a
> c-cedille in spanish text via a 7-bit-clean channel?

With the appropriate charset and content-transfer-encoding, such as
ISO-8859-1, quoted-printable, and =E7.

> I would say that's more a case of those characters shouldn't be
> present if the language is en-us than an encoding issue. The presence
> of lots of those is either a sign that the text isn't English, or is
> obfuscated. How do you reliably tell the language of the message?

I would say the presence of ꯍ in a text/plain part is either
a bug in spam-generating software or a researcher trying to send
something to a colleague. :)

Regards,

David.


Re: More text/plain questions

2014-07-02 Thread John Hardin

On Wed, 2 Jul 2014, John Hardin wrote:


On Wed, 2 Jul 2014, Philip Prindeville wrote:


 Given that it’s text/plain with an implicit charset=“us-ascii” and an
 implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4}
 doesn’t really parse into a 16-bit character, would it? That would be a
 broken MUA that made such a leap...


Nope. The content-transfer-encoding is only for the *transfer* part of the 
process. Once the content reaches the MUA that content can be further parsed 
by the MUA according to other encoding rules, such as these escape sequences 
for Unicode characters. That's perfectly valid. How else would you send, for 
example, a c-cedille in spanish text via a 7-bit-clean channel?



 Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather
 than the unicode16 or UTF-8 character with that hex value?


I'd only expect that in a very old MUA (i.e. that does not support Unicode), 
or display of the raw message content at user request.


...that said, I primarily use a text-based MUA, and it did not render 
Unicode glyphs for that sample...


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Of the twenty-two civilizations that have appeared in history,
  nineteen of them collapsed when they reached the moral state the
  United States is in now.  -- Arnold Toynbee
---
 2 days until the 238th anniversary of the Declaration of Independence

Re: More text/plain questions

2014-07-02 Thread John Hardin

On Wed, 2 Jul 2014, Philip Prindeville wrote:

Given that it’s text/plain with an implicit charset=“us-ascii” and an 
implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} 
doesn’t really parse into a 16-bit character, would it? That would be a 
broken MUA that made such a leap...


Nope. The content-transfer-encoding is only for the *transfer* part of the 
process. Once the content reaches the MUA that content can be further 
parsed by the MUA according to other encoding rules, such as these escape 
sequences for Unicode characters. That's perfectly valid. How else would 
you send, for example, a c-cedille in spanish text via a 7-bit-clean 
channel?


Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. 
rather than the unicode16 or UTF-8 character with that hex value?


I'd only expect that in a very old MUA (i.e. that does not support 
Unicode), or display of the raw message content at user request.


I wouldn’t want a message where someone gives a couple of examples of 
encoding Ѐ for instance being flagged as SPAM, but if the text is 
20% or more of these sequences then I would say that’s SPAM-sign.


That's valid 7-bit encoding for transfer. It's relying on the user's MUA 
to convert the encoded Unicode values to glyphs for display.


I would say that's more a case of those characters shouldn't be present if 
the language is en-us than an encoding issue. The presence of lots of 
those is either a sign that the text isn't English, or is obfuscated. How 
do you reliably tell the language of the message?


It would probably be a good idea to add those sequences to the replacetags 
letter REs so that the FUZZY rules will catch them.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Of the twenty-two civilizations that have appeared in history,
  nineteen of them collapsed when they reached the moral state the
  United States is in now.  -- Arnold Toynbee
---
 2 days until the 238th anniversary of the Declaration of Independence