Re: More text/plain questions
--On Wednesday, July 23, 2014 9:39 PM +0100 Martin Gregorie wrote: On Wed, 2014-07-23 at 11:45 -0600, Amir 'CG' Caspi wrote: I'm definitely considering writing a rule to catch �[0-9]{3}; patterns. I'm definitely worried it could cause FPs, but are there common circumstances where legitimate emails would include dozens to hundreds of these? (The latest FNs only include a few dozen, not the hundreds seen in the spample above.) This works for me: describe MG_HEX_HTML Body contains too many HTML hex encodings body MG_HEX_HTML /(.{0,3}\&\#x[0-9A-F]{4};){5}/ scoreMG_HEX_HTML 3.5 It is also used in a meta, along with some other simple local rules, to give hex-bearing spam an extra kick up the rear. I found that, in my mailstream anyway, there was generally not much else to write rules against, hence the high score. Spam arriving here gets quarantined: I look at the sender and subject as a matter of course and, if it looks like a possible FP, I'll look at the text too (I wrote a PHP viewer for quarantined spam a long time ago) but it appears that, after the brief squall of hex spam which made me write the rule, the promised spamstorm ended and so far has failed to restart. I've seen this rule hit several times for me today, all on definite spam. --Quanah -- Quanah Gibson-Mount Server Architect Zimbra, Inc. Zimbra :: the leader in open source messaging and collaboration
Re: More text/plain questions
On 7/25/2014 6:19 PM, Amir Caspi wrote: On Jul 25, 2014, at 4:11 PM, Kevin A. McGrail wrote: You should look at the patch on bug 7068 (https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068) Yes, but this is within the code itself. I was referring to how to do this in a local.cf, for example... Amir It requires a plugin, sorry. Don't think you could do it without it.
Re: More text/plain questions
On Jul 25, 2014, at 4:11 PM, Kevin A. McGrail wrote: > You should look at the patch on bug 7068 > (https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068) Yes, but this is within the code itself. I was referring to how to do this in a local.cf, for example... Amir
Re: More text/plain questions
On 7/25/2014 5:55 PM, Amir Caspi wrote: On Jul 24, 2014, at 4:08 PM, Philip Prindeville wrote: In text/plain with CTE of ‘7bit’ or ‘8bit’ it’s meaningless to use Unicode HTML entity encodings. It’s obviously not HTML. If you want Unicode in text/plain, it should be in base64 or quoted-printable CTE. Sure, but these spams also have text/html sections with the same characters. How do you check if the unicode entities are in the text/plain section, versus the entire body? My understanding was that body rules run on both text/plain and text/html -- is there a way to distinguish which section those entities are in? You should look at the patch on bug 7068 (https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068) if ($ctype eq 'text/plain' && ($cte eq '' || $cte eq '7bit' || $cte eq '8bit')) { regards, KAM
Re: More text/plain questions
On 7/23/2014 2:27 PM, Paul Stead wrote: KAM's rules are also helping add a few extra points I try. https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068 and https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7063 I've also implemented several rules to try and catch these types of emails. 7063 is in trunk. 7068 is in testing now. Regards, KAM
Re: More text/plain questions
On Jul 24, 2014, at 4:08 PM, Philip Prindeville wrote: > In text/plain with CTE of ‘7bit’ or ‘8bit’ it’s meaningless to use Unicode > HTML entity encodings. It’s obviously not HTML. > > If you want Unicode in text/plain, it should be in base64 or quoted-printable > CTE. Sure, but these spams also have text/html sections with the same characters. How do you check if the unicode entities are in the text/plain section, versus the entire body? My understanding was that body rules run on both text/plain and text/html -- is there a way to distinguish which section those entities are in? --- Amir
Re: More text/plain questions
On Jul 24, 2014, at 4:48 PM, Amir 'CG' Caspi wrote: > On 2014-07-24 16:11, Philip Prindeville wrote: > >> You might have a shorter wait if you move to CentOS 6.5 instead. > I would, but the VPS software I'm using does not run on CentOS 6.x, only 5.x. > It's rather old software and I should convert to something else, but it's > not worth the time I don't have, so I'm stuck with it. >> And I can help you with the RPM’s. I’m a fedora/epel packager. > Awesome. Perhaps you want to make an SA 3.4 package for EPEL 5? ;-) Of > course, that helps more than just me... > > --- Amir Already done. I have no means to test it, however.
Re: More text/plain questions
On 2014-07-24 16:11, Philip Prindeville wrote: > You might have a shorter wait if you move to CentOS 6.5 instead. I would, but the VPS software I'm using does not run on CentOS 6.x, only 5.x. It's rather old software and I should convert to something else, but it's not worth the time I don't have, so I'm stuck with it. > And I can help you with the RPM's. I'm a fedora/epel packager. Awesome. Perhaps you want to make an SA 3.4 package for EPEL 5? ;-) Of course, that helps more than just me... --- Amir
Re: More text/plain questions
On Jul 23, 2014, at 1:21 PM, Amir 'CG' Caspi wrote: > On 2014-07-23 13:14, Axb wrote: >> doesn't your VPS offer you shell access? >> if yes, uninstall the SA rpm stuff and install SA 3.4 from source/trunk. > > I think I didn't explain properly. I'm running the dedicated server on which > there is VPS software. I need RPMs so that they get distributed to all the > client sites. Installing from source/trunk at the root level won't > distribute the tools to the individual sites. > > This is why I need 3.4 packaged as an rpm. > > I'm hoping someone will take up that task. 3.3.x was packaged as an rpm (on > EPEL and other repos), so hopefully 3.4 will be, too. > > Thanks. > > --- Amir > Sigh. Okay, I just did a blind build from fedpkg of spamassassin/master. http://fedorapeople.org/~philipp/spamassassin-3.4.0-7.el5.x86_64.rpm No warranties that this actually works. If you need i686 binaries I can make those too.
Re: More text/plain questions
On Jul 23, 2014, at 12:54 PM, Amir 'CG' Caspi wrote: >> >> Hope the patches above get pushed into production > Indeed, though I'm still running SA v3.3.x ... I'm on a CentOS 5.10 platform > and, because it's of the virtual-hosting control panel I use, I need my > software distributed in RPMs. Until someone builds a proper 3.4 rpm for > CentOS/RHEL 5, I'm stuck. (I could be the one to build it, but I'm certainly > no expert at RPMs.) > > --- Amir > You might have a shorter wait if you move to CentOS 6.5 instead. And I can help you with the RPM’s. I’m a fedora/epel packager. -Philip
Re: More text/plain questions
On Jul 23, 2014, at 11:45 AM, Amir 'CG' Caspi wrote: > On 2014-07-02 15:04, Amir Caspi wrote: >> For what it's worth, I just received a spam that basically is the same >> as what Philip complained about. I've posted a spample here: >> http://pastebin.com/Y2YGwL49 > [...] >> I'm wondering if we shouldn't write a rule looking for lots of >> �[0-9]{3}; patterns... say, 500 of them in one email. Or, would we >> expect legitimate emails to have these? > > So, to follow up on this... over the past couple of weeks I've been getting a > lot more FNs than normal, and almost every single one of these is an "encoded > character" spam like the example above. Bayes training does appear to work, > in that many of these FNs are already at BAYES_999... but there aren't enough > other rules hit to cause the FNs to cross the 5.0 threshold. (Other, similar > spams do cross the threshold, usually due to RAZOR and/or PYZOR hits.) > > Since these are basically unicode character encodings, is there a move to > translate all charsets to UTF-8 (or some other fixed standard) before > applying body and/or URI rules? That would, presumably, help with trying to > catch these. > > I'm definitely considering writing a rule to catch �[0-9]{3}; patterns. > I'm definitely worried it could cause FPs, but are there common circumstances > where legitimate emails would include dozens to hundreds of these? (The > latest FNs only include a few dozen, not the hundreds seen in the spample > above.) > > Otherwise, I'm not sure what "template" rule I could write to catch these > things, and they're increasing in frequency (with more and more being missed > as FNs). > > Thanks. > > -- Amir > In text/plain with CTE of ‘7bit’ or ‘8bit’ it’s meaningless to use Unicode HTML entity encodings. It’s obviously not HTML. If you want Unicode in text/plain, it should be in base64 or quoted-printable CTE. -Philip
Re: More text/plain questions
On 23/07/14 21:24, Axb wrote: look at the HTML source, sharply - there's tons of little traits to dump in a meta rule I have these 'traits' in my custom Clamav rules, but that's another list... :) -- Paul Stead, Zen Internet Systems Engineer
Re: More text/plain questions
On 07/23/2014 09:54 PM, Paul Stead wrote: Making use of the meta rules seems to be the best here - this spam is being very tricky to catch - I'll mirror my previous statement that the suggested patches do pick up on this spam too look at the HTML source, sharply - there's tons of little traits to dump in a meta rule
Re: More text/plain questions
On Wed, 2014-07-23 at 21:49 +0200, Axb wrote: > Centos 5.x is rather dated. > > Not sure there'd be such an old Fedora > equivalent offering SA 3.4 rpms. > I'll say - a quick search shows that Centos 7.x is current. and SA 3.4.0 arrived after Fedora 20 was released. > He'd have to find the equivalent Fedora version or just adapt a SA 3.4 > spec file and make his own RPMs. It's not that hard... > Actually, its a bit mean of me to mention Fedora since its more akin to Debian unstable than to a clone of an LTS distro (Centos being an RHEL clone) or even something like Debian stable. Fedora releases generally happen about every 6 months and become unsupported after a bit over a year. Martin
Re: More text/plain questions
On 07/23/2014 10:06 PM, Amir 'CG' Caspi wrote: On 2014-07-23 13:38, Axb wrote: If you're using spamd, why not run a/multiple dedicated VMs for SA 3.4 and have your other VMs use the spamd on the SA VMs ? There is a dedicated spamd. It's the other tools that need to be distributed, like sa-learn. Bayes rules are handled per-user. (No, I don't plan on changing this any time soon, it would be a herculean effort given the system setup.) so apparently you're left with deploying your own rpms by hacking a recent spec or hire someone who will do it for you. DIY ensures you're not left in the rain.
Re: More text/plain questions
On 2014-07-23 13:38, Axb wrote: If you're using spamd, why not run a/multiple dedicated VMs for SA 3.4 and have your other VMs use the spamd on the SA VMs ? There is a dedicated spamd. It's the other tools that need to be distributed, like sa-learn. Bayes rules are handled per-user. (No, I don't plan on changing this any time soon, it would be a herculean effort given the system setup.) --- Amir
Re: More text/plain questions
On 23/07/14 20:44, John Hardin wrote: On Wed, 23 Jul 2014, Paul Stead wrote: body __LOC_COUNT_UNI /x[0-9A-F]{4};/ tflags __LOC_COUNT_UNI multiple Recommend maxhits on that. Apologies, I omitted the max hits... If you're only looking for 10+ hits, then maxhits=11 will allow you to detect them with the minimum of wasted work. I have more rules to match up to 50, but you are right - good advice for anyone copying these, though I do prefer Martin's approach: On 23/07/14 20:39, Martin Gregorie wrote: body MG_HEX_HTML /(.{0,3}\&\#x[0-9A-F]{4};){5}/ Making use of the meta rules seems to be the best here - this spam is being very tricky to catch - I'll mirror my previous statement that the suggested patches do pick up on this spam too -- Paul Stead, Zen Internet Systems Engineer
Re: More text/plain questions
On 07/23/2014 09:43 PM, Martin Gregorie wrote: On Wed, 2014-07-23 at 13:21 -0600, Amir 'CG' Caspi wrote: I'm hoping someone will take up that task. 3.3.x was packaged as an rpm (on EPEL and other repos), so hopefully 3.4 will be, too. 3.4.0 is the standard SA package for Fedora, so I'd expect to find it on RHEL and their various clones as well. Centos 5.x is rather dated. Not sure there'd be such an old Fedora equivalent offering SA 3.4 rpms. He'd have to find the equivalent Fedora version or just adapt a SA 3.4 spec file and make his own RPMs. It's not that hard...
Re: More text/plain questions
On Wed, 23 Jul 2014, Paul Stead wrote: On 23/07/14 19:54, Amir 'CG' Caspi wrote: Care to share? Counting encoded chars is easy, of course. I use the following to count the encoded chars: body __LOC_COUNT_UNI /x[0-9A-F]{4};/ tflags __LOC_COUNT_UNI multiple Recommend maxhits on that. We can make some vars if we want: meta __LOC_HAS_0_UNI (__PDS_COUNT_UNI == 0) meta __LOC_HAS_10_UNI (__PDS_COUNT_UNI >= 10) If you're only looking for 10+ hits, then maxhits=11 will allow you to detect them with the minimum of wasted work. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Where are my space habitats? Where is my flying car? It's 2010 and all I got from the SF books of my youth is the lousy dystopian government. -- perlhaqr --- 783 days since the first successful private support mission to ISS (SpaceX)
Re: More text/plain questions
On Wed, 2014-07-23 at 13:21 -0600, Amir 'CG' Caspi wrote: > I'm hoping someone will take up that task. 3.3.x was packaged as an rpm > (on EPEL and other repos), so hopefully 3.4 will be, too. > 3.4.0 is the standard SA package for Fedora, so I'd expect to find it on RHEL and their various clones as well. Martin
Re: More text/plain questions
On Wed, 2014-07-23 at 11:45 -0600, Amir 'CG' Caspi wrote: > I'm definitely considering writing a rule to catch �[0-9]{3}; > patterns. I'm definitely worried it could cause FPs, but are there > common circumstances where legitimate emails would include dozens to > hundreds of these? (The latest FNs only include a few dozen, not the > hundreds seen in the spample above.) > This works for me: describe MG_HEX_HTML Body contains too many HTML hex encodings body MG_HEX_HTML /(.{0,3}\&\#x[0-9A-F]{4};){5}/ scoreMG_HEX_HTML 3.5 It is also used in a meta, along with some other simple local rules, to give hex-bearing spam an extra kick up the rear. I found that, in my mailstream anyway, there was generally not much else to write rules against, hence the high score. Spam arriving here gets quarantined: I look at the sender and subject as a matter of course and, if it looks like a possible FP, I'll look at the text too (I wrote a PHP viewer for quarantined spam a long time ago) but it appears that, after the brief squall of hex spam which made me write the rule, the promised spamstorm ended and so far has failed to restart. Martin
Re: More text/plain questions
On 07/23/2014 09:21 PM, Amir 'CG' Caspi wrote: On 2014-07-23 13:14, Axb wrote: doesn't your VPS offer you shell access? if yes, uninstall the SA rpm stuff and install SA 3.4 from source/trunk. I think I didn't explain properly. I'm running the dedicated server on which there is VPS software. I need RPMs so that they get distributed to all the client sites. Installing from source/trunk at the root level won't distribute the tools to the individual sites. This is why I need 3.4 packaged as an rpm. I'm hoping someone will take up that task. 3.3.x was packaged as an rpm (on EPEL and other repos), so hopefully 3.4 will be, too. If you're using spamd, why not run a/multiple dedicated VMs for SA 3.4 and have your other VMs use the spamd on the SA VMs ?
Re: More text/plain questions
On 23/07/14 19:54, Amir 'CG' Caspi wrote: Care to share? Counting encoded chars is easy, of course. I use the following to count the encoded chars: body __LOC_COUNT_UNI /x[0-9A-F]{4};/ tflags __LOC_COUNT_UNI multiple We can make some vars if we want: meta __LOC_HAS_0_UNI (__PDS_COUNT_UNI == 0) meta __LOC_HAS_10_UNI (__PDS_COUNT_UNI >= 10) I've noticed that they all come through as VERP emails - header __LOC_VERP X-Envelope-From =~ /\=.*\.(com|net|org|biz)\@/ And a list of keywords that I've noticed: header __LOC_VERP_AMAZON X-Envelope-From =~ /^amazon\-?_?coupons\-/i Then add them together in a meta score meta LOC_UNI_SPAM (!BAYES_00) && ( __LOC_VERP + __LOC_VERP_AMAZON + __LOC_HAS_10_UNI >= 3) score LOC_UNI_SPAM 0.001 This seems to only be catching the bad stuff, you could of course add some more magic: meta LOC_UNI_SPAM_99 (BAYES_99 && LOC_UNI_SPAM) score LOC_UNI_SPAM_99 . ...checking whether the MIME-encoding is text/plain may not be sufficient Though it's totally possible, I haven't gone as far as checking the encoding types etc, apart from the links to the patches I included... SA v3.3.x ... Me too, the patch works fine with it, I'm awaiting the Debian build for the production boxes, but running from source isn't too difficult either. Though I'm aware they're not the best for generic spam, they're seem okay on these specific types (I suggest from the same source, looking at the styles of the email) - I've yet to test the rules on production. I've also noticed the following traits but not sure how to find these traits: * All emails have a message ID where the recipients email address is contained in md5 - .@domain.com * All emails to the same recipient have the same MIME boundary - possibly a hash of the recipient address Paul -- Paul Stead, Zen Internet Systems Engineer
Re: More text/plain questions
On 2014-07-23 13:14, Axb wrote: doesn't your VPS offer you shell access? if yes, uninstall the SA rpm stuff and install SA 3.4 from source/trunk. I think I didn't explain properly. I'm running the dedicated server on which there is VPS software. I need RPMs so that they get distributed to all the client sites. Installing from source/trunk at the root level won't distribute the tools to the individual sites. This is why I need 3.4 packaged as an rpm. I'm hoping someone will take up that task. 3.3.x was packaged as an rpm (on EPEL and other repos), so hopefully 3.4 will be, too. Thanks. --- Amir
Re: More text/plain questions
On 07/23/2014 08:54 PM, Amir 'CG' Caspi wrote: Indeed, though I'm still running SA v3.3.x ... I'm on a CentOS 5.10 platform and, because it's of the virtual-hosting control panel I use, I need my software distributed in RPMs. Until someone builds a proper 3.4 rpm for CentOS/RHEL 5, I'm stuck. (I could be the one to build it, but I'm certainly no expert at RPMs.) doesn't your VPS offer you shell access? if yes, uninstall the SA rpm stuff and install SA 3.4 from source/trunk. if not, your stuck with a KIA in the hope somebody will update it to a Lexus.
Re: More text/plain questions
On 2014-07-23 12:23, Paul Stead wrote: > I've also implemented several rules to try and catch these types of emails. Care to share? Counting encoded chars is easy, of course. One thing to note, webmail and my MUA often will render the encoded characters in their translated format, not literally (as hashes). I'm not sure if this is because the MIME encoding isn't claiming to be text/plain, or because the browser/MUA are trying to be helpful by not being strict... I haven't looked too deeply into it yet. Thus, checking whether the MIME-encoding is text/plain may not be sufficient, because not all of them might be trying to claim text/plain. > Hope the patches above get pushed into production Indeed, though I'm still running SA v3.3.x ... I'm on a CentOS 5.10 platform and, because it's of the virtual-hosting control panel I use, I need my software distributed in RPMs. Until someone builds a proper 3.4 rpm for CentOS/RHEL 5, I'm stuck. (I could be the one to build it, but I'm certainly no expert at RPMs.) --- Amir
Re: More text/plain questions
KAM's rules are also helping add a few extra points On 23/07/14 19:23, Paul Stead wrote: On 23/07/14 18:45, Amir 'CG' Caspi wrote: So, to follow up on this... over the past couple of weeks I've been getting a lot more FNs than normal, and almost every single one of these is an "encoded character" spam like the example above. Bayes training does appear to work, in that many of these FNs are already at BAYES_999... but there aren't enough other rules hit to cause the FNs to cross the 5.0 threshold. (Other, similar spams do cross the threshold, usually due to RAZOR and/or PYZOR hits.) Same here - I've had one particular user furious about this, laughable but still annoying. I'm definitely considering writing a rule to catch �[0-9]{3}; patterns. I'm definitely worried it could cause FPs, but are there common circumstances where legitimate emails would include dozens to hundreds of these? (The latest FNs only include a few dozen, not the hundreds seen in the spample above.) You might find the following useful https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068 and https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7063 I've also implemented several rules to try and catch these types of emails. Namely counting the encoded chars and recognising other traits I've noticed with this type of mail. Hope the patches above get pushed into production -- Paul Stead, Zen Internet Systems Engineer
Re: More text/plain questions
On 23/07/14 18:45, Amir 'CG' Caspi wrote: So, to follow up on this... over the past couple of weeks I've been getting a lot more FNs than normal, and almost every single one of these is an "encoded character" spam like the example above. Bayes training does appear to work, in that many of these FNs are already at BAYES_999... but there aren't enough other rules hit to cause the FNs to cross the 5.0 threshold. (Other, similar spams do cross the threshold, usually due to RAZOR and/or PYZOR hits.) Same here - I've had one particular user furious about this, laughable but still annoying. I'm definitely considering writing a rule to catch �[0-9]{3}; patterns. I'm definitely worried it could cause FPs, but are there common circumstances where legitimate emails would include dozens to hundreds of these? (The latest FNs only include a few dozen, not the hundreds seen in the spample above.) You might find the following useful https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7068 and https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7063 I've also implemented several rules to try and catch these types of emails. Namely counting the encoded chars and recognising other traits I've noticed with this type of mail. Hope the patches above get pushed into production Paul -- Paul Stead, Zen Internet Systems Engineer
Re: More text/plain questions
On 2014-07-02 15:04, Amir Caspi wrote: For what it's worth, I just received a spam that basically is the same as what Philip complained about. I've posted a spample here: http://pastebin.com/Y2YGwL49 [...] I'm wondering if we shouldn't write a rule looking for lots of �[0-9]{3}; patterns... say, 500 of them in one email. Or, would we expect legitimate emails to have these? So, to follow up on this... over the past couple of weeks I've been getting a lot more FNs than normal, and almost every single one of these is an "encoded character" spam like the example above. Bayes training does appear to work, in that many of these FNs are already at BAYES_999... but there aren't enough other rules hit to cause the FNs to cross the 5.0 threshold. (Other, similar spams do cross the threshold, usually due to RAZOR and/or PYZOR hits.) Since these are basically unicode character encodings, is there a move to translate all charsets to UTF-8 (or some other fixed standard) before applying body and/or URI rules? That would, presumably, help with trying to catch these. I'm definitely considering writing a rule to catch �[0-9]{3}; patterns. I'm definitely worried it could cause FPs, but are there common circumstances where legitimate emails would include dozens to hundreds of these? (The latest FNs only include a few dozen, not the hundreds seen in the spample above.) Otherwise, I'm not sure what "template" rule I could write to catch these things, and they're increasing in frequency (with more and more being missed as FNs). Thanks. -- Amir
Re: More text/plain questions
On Mon, 07 Jul 2014 19:29:11 -0400 Daniel Staal wrote: > Just to start the discussion: I'd say default to UTF-8 if not > otherwise specified and can't be worked out. (How hard to work on > 'working it out' is a question, of course.) It's the growing > standard, as far as I can tell. +1. UTF-8 is the best choice. (Modern) Perl handles it very nicely. Even non-UTF-8 messages should be recoded into UTF-8 for body rules; otherwise, making a rule that looks for things like "抵押" will be well-nigh impossible. Regards, David.
Re: More text/plain questions
--As of July 7, 2014 5:20:01 PM -0400, Kevin A. McGrail is alleged to have said: On 7/7/2014 5:09 PM, Philip Prindeville wrote: On Jul 7, 2014, at 7:15 AM, Kevin A. McGrail wrote: On 7/7/2014 2:28 AM, John Wilcock wrote: Le 05/07/2014 19:08, Philip Prindeville a écrit : As for encoding a cyrillic small a: there are many ways to do this. iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this would be very efficient—there are just too many charsets possible. Normalising the input message to UTF-8 before body checks would help somewhat with that. I seem to remember there's been talk of doing this. Yes, or utf-16... I think that will be necessary to keep SA effective in the modern world sooner than later. Okay, but… if the message body is non-ASCII and the CTE is 8bit or base64 and no explicit charset has been given, how do you know which translation to perform? I get a lot of Han SPAM in GB2312 where the charset is never specified (apparently it’s a national default in China, despite the requirements stated in RFC-2045 and -2046). Sorry, I haven't even started delving into the devilish details but I know it's looming as a needed feature. --As for the rest, it is mine. Just to start the discussion: I'd say default to UTF-8 if not otherwise specified and can't be worked out. (How hard to work on 'working it out' is a question, of course.) It's the growing standard, as far as I can tell. Even if it's wrong in a particular case, it would probably be useful: It would give rule writers something to work with. Daniel T. Staal --- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---
Re: More text/plain questions
On 7/7/2014 5:09 PM, Philip Prindeville wrote: On Jul 7, 2014, at 7:15 AM, Kevin A. McGrail wrote: On 7/7/2014 2:28 AM, John Wilcock wrote: Le 05/07/2014 19:08, Philip Prindeville a écrit : As for encoding a cyrillic small a: there are many ways to do this. iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this would be very efficient—there are just too many charsets possible. Normalising the input message to UTF-8 before body checks would help somewhat with that. I seem to remember there's been talk of doing this. Yes, or utf-16... I think that will be necessary to keep SA effective in the modern world sooner than later. Okay, but… if the message body is non-ASCII and the CTE is 8bit or base64 and no explicit charset has been given, how do you know which translation to perform? I get a lot of Han SPAM in GB2312 where the charset is never specified (apparently it’s a national default in China, despite the requirements stated in RFC-2045 and -2046). Sorry, I haven't even started delving into the devilish details but I know it's looming as a needed feature. regards, KAM
Re: More text/plain questions
On Jul 7, 2014, at 7:15 AM, Kevin A. McGrail wrote: > On 7/7/2014 2:28 AM, John Wilcock wrote: >> Le 05/07/2014 19:08, Philip Prindeville a écrit : >>> As for encoding a cyrillic small a: there are many ways to do this. >>> iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this >>> would be very efficient—there are just too many charsets possible. >> >> Normalising the input message to UTF-8 before body checks would help >> somewhat with that. I seem to remember there's been talk of doing this. >> > Yes, or utf-16... I think that will be necessary to keep SA effective in the > modern world sooner than later. Okay, but… if the message body is non-ASCII and the CTE is 8bit or base64 and no explicit charset has been given, how do you know which translation to perform? I get a lot of Han SPAM in GB2312 where the charset is never specified (apparently it’s a national default in China, despite the requirements stated in RFC-2045 and -2046). -Philip
Re: More text/plain questions
On 7/7/2014 2:28 AM, John Wilcock wrote: Le 05/07/2014 19:08, Philip Prindeville a écrit : As for encoding a cyrillic small a: there are many ways to do this. iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this would be very efficient—there are just too many charsets possible. Normalising the input message to UTF-8 before body checks would help somewhat with that. I seem to remember there's been talk of doing this. Yes, or utf-16... I think that will be necessary to keep SA effective in the modern world sooner than later.
Re: More text/plain questions
Le 05/07/2014 19:08, Philip Prindeville a écrit : As for encoding a cyrillic small a: there are many ways to do this. iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this would be very efficient—there are just too many charsets possible. Normalising the input message to UTF-8 before body checks would help somewhat with that. I seem to remember there's been talk of doing this. -- John
Re: More text/plain questions
On Jul 4, 2014, at 12:08 AM, haman...@t-online.de wrote: > > Hi, > > while this is certainly not correct - and likely does not display in every > mail client - it would > probably work in several webmailers. Perhaps this is the configuration the > author of that > crap tested. > Now, I am somewhat reluctant to classify badly formatted mails as spam: there > are many > systems around, even from major players, that send legitimate mails like > order confirmation, > delivery notification, opted-in newsletters but do many of the formal things > more right than wrong > On the other side, looking at the actual characters shows that the message is > spam: these are > cyrillic letters that happen to look exactly like western ones (a, e, o or > such) so the obvious intent > is to avoid detection of the strings. We have seen the same with IDN domain > names that might > use a cyrillic a to register a domain that looks like, e.g. paypal.com > The list of characters is fairly short, so maybe checking for these > characters in all commonly > used variants (html entities, utf8 encoded, +u0430, \u0430. IDN encoded) > would be a good > spam indication > > Regards > Wolfgang > > I think you’re overlooking what a lot of tests already do: test for poor formatting. INVALID_DATE UNPARSEABLE_RELAY HTML_MISSING_CTYPE MISSING_HEADERS MISSING_DATE As for encoding a cyrillic small a: there are many ways to do this. iso-8859-4, utf-8, jp2212, gb2312, win1252, etc. I don’t think this would be very efficient—there are just too many charsets possible. -Philip
Re: More text/plain questions
>> >> I got the following MIME body part below, and I�m wondering if it would >> >> make sense to filter on this as well. >> >> Given that it�s text/plain with an implicit charset=�us-ascii� and an >> >> implicit content-transfer-encoding of 7bit, the sequence [0-9A-F]{4} >> >> doesn�t really parse into a 16-bit character, would it? That would be a >> >> broken MUA that made such a leap... >> >> Wouldn�t that normally render as the character �&�, �#�, �x�, etc. rather >> >> than the unicode16 or UTF-8 character with that hex value? >> >> There might be times when someone has sent an attachment improperly >> >> encoded this way which might have embedded binary values in it, but >> >> that�s kind of buggy anyway� it should have been done as base64 and >> >> application/octet-stream in the worst of cases if it has arbitrary binary >> >> data. >> >> I wouldn�t want a message where someone gives a couple of examples of >> >> encoding Ѐ for instance being flagged as SPAM, but if the text is >> >> 20% or more of these sequences then I would say that�s SPAM-sign. >> >> Anyway, here�s the body I saw: >> >> --1388-8200-b67c-e579-9c27-df36-12fa-a2eb >> Content-Type: text/plain; >> >> Thе Rеаl >> >> RеаѕоnThе Ꮯоmіng >> >> Ꮯоllарѕе...Thе >> >> rеаl rеаѕоn ᎳHY >> >> HоmеlаndSеcurіtу >> >> rеcеntlу рurchаѕеd1.7 >> >> Bіllіоn Rоundѕ оf >> >> аmmunіtіоn...Ꮃhаt Yоu >> >> Muѕt Dо Tо Ꭼnѕurе >> >> YоurSаfеtуHоmеlаnd >> >> ѕеcurіtу іѕ thеrе >> >> tо ѕеcurеthе >> >> hоmеlаnd оnlу... Sо >> >> thеѕе Ьullеtѕаrе >> >> rеаlу mеаnt fоr >> >> thеThіѕ іѕ аn >> >> еmаіlаdvеrtіѕеmеnt >> >> thаt wаѕ ѕеnt tо >> >> уоu Ьу Ρаtrіоt >> >> Survіvаl Ρlаn. If >> >> уоuwіѕh tо >> >> nоlоngеr rеcеіvе >> >> mеѕѕаgеѕ thаt >> >> рrоmоtе ѕurvіvаl >> >> tірѕ, >> >> рlеаѕеclіck hеrе >> >> tо unѕuЬѕcrіЬе.4 >> >> Unstable as water, thou shalt not excel because thou wentest up to thy >> >> fathers bed then defiledst thou it he went up to my couch.34 And >> >> Pharaohnechoh made Eliakim the son of Josiah king in the room of Josiah >> >> his father, and turned his name to Jehoiakim, and took Jehoahaz away and >> >> he came to Egypt, and died there.37 And the thing was good in the eyes >> >> of Pharaoh, and in the eyes o! >> f all his servants. >> >> --1388-8200-b67c-e579-9c27-df36-12fa-a2eb Hi, while this is certainly not correct - and likely does not display in every mail client - it would probably work in several webmailers. Perhaps this is the configuration the author of that crap tested. Now, I am somewhat reluctant to classify badly formatted mails as spam: there are many systems around, even from major players, that send legitimate mails like order confirmation, delivery notification, opted-in newsletters but do many of the formal things more right than wrong On the other side, looking at the actual characters shows that the message is spam: these are cyrillic letters that happen to look exactly like western ones (a, e, o or such) so the obvious intent is to avoid detection of the strings. We have seen the same with IDN domain names that might use a cyrillic a to register a domain that looks like, e.g. paypal.com The list of characters is fairly short, so maybe checking for these characters in all commonly used variants (html entities, utf8 encoded, +u0430, \u0430. IDN encoded) would be a good spam indication Regards Wolfgang
Re: More text/plain questions
On 7/2/2014 5:04 PM, Amir Caspi wrote: Is there also a rule for UTF8-encoded Subject line? If so, it didn't pop. Just a quick note about this part of your email. This is extremely common to use UTF-8 and I doubt it would be an indicator of spam vs ham. I wouldn't even bother looking...
Re: More text/plain questions
On Wed, 2014-07-02 at 19:10 -0600, Philip Prindeville wrote: > On Jul 2, 2014, at 5:16 PM, Karsten Bräckelmann > wrote: > > That RE is a single, straight-forward alternation with two alternatives. > > > > The first one translates to a single char in a given, specific range. > > Basically, anything but the ampersand. The second alternative is an > > ampersand, that is not followed by #x. > > > > The (?!pattern) is a zero-width negative look-ahead assertion. A zero > > width means, it does not consume what it matches. Thus, the second > > alternation ultimately will match a single ampersand only. The /g global > > matching then continues where it left of after the last matching > > attempt. In the case of that ampersand followed by #x, that still is > > right after the ampersand. > Okay, so what I was trying to do is skip any ampersand followed by > #x; as part of the matched text (but include ampersands not > followed by #x; as part of the match). That is the result of the plain s/[0-9A-F]{4};//g global substitution I posted. You should define what you ultimately want to achieve. Not, what you right now think is a step-stone and part of the solution. > So that if I had the text: > > This that & those. > > The first @match would be counted as $chars: > > T,h,i,s, ,t,h,a,t, ,&, ,t,h,o,s,. > > and the 2nd @match would be: > > e > > counting as $uchars. > > So in the first case, the e would be skipped over as part of the > capture. Skipped over, since it is part of the capture. That kind of contradicts itself... Do you want all of those (HTML entity string) matches? The raw matches themselves? Or is that just an attempt of debug visualization? Do you actually want its number only? This has quite an impact on the Perl code and logic / math involved. Number of HTML entity escapes, length(char) of reminder: my $number = $data =~ s/[0-9A-F]{4};//g; print "number: ", $number, "\n"; print "other: ", length $data," = '", $data, "'\n"; Do need the complete HTML entity escapes. Quick hack to compute reminder. my @matches = $data =~ /[0-9A-F]{4};/g; print "matches: ", scalar @matches, " = ", join(',', @matches), "\n"; print "other: ", length ($data) - 8*(scalar @matches), "\n"; -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: More text/plain questions
On Jul 2, 2014, at 5:16 PM, Karsten Bräckelmann wrote: > On Wed, 2014-07-02 at 14:44 -0600, Philip Prindeville wrote: >> Okay, was tinkering with the code below but the zero-width lookahead is >> not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output >> is bogus (you can run this and see what I mean). >> >> What am I doing wrong? > > You are using an overly complex and fugly test case. ;) Seriously, a > stripped down test string does not require more than about 4 instances > of plain chars and HTML entities. Much easier on the eye. > > >>my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g; > > That RE is a single, straight-forward alternation with two alternatives. > > The first one translates to a single char in a given, specific range. > Basically, anything but the ampersand. The second alternative is an > ampersand, that is not followed by #x. > > The (?!pattern) is a zero-width negative look-ahead assertion. A zero > width means, it does not consume what it matches. Thus, the second > alternation ultimately will match a single ampersand only. The /g global > matching then continues where it left of after the last matching > attempt. In the case of that ampersand followed by #x, that still is > right after the ampersand. > > line: Thе R > matches: T,h,#,x,0,4,3,5,;, ,R Okay, so what I was trying to do is skip any ampersand followed by #x; as part of the matched text (but include ampersands not followed by #x; as part of the match). So that if I had the text: This that & those. The first @match would be counted as $chars: T,h,i,s, ,t,h,a,t, ,&, ,t,h,o,s,. and the 2nd @match would be: e counting as $uchars. So in the first case, the e would be skipped over as part of the capture. What’s the opposite of a zero-width lookahead? I.e. a match that advances the cursor but doesn’t copy the matching text into the capture buffer? > > The offending ampersand part of the HTML entity encoding correctly is > not matched. The following chars do match the "anything but an > ampersand" first alternative. > > > I am unsure what you are trying to achieve. If you want to compare the > number of HTML entities with the number of regular chars, wouldn't it be > easier to simply drop them flat? > > $data =~ s/[0-9A-F]{4};//g; > > Or just plain match and count? > > @matches = $data =~ /[0-9A-F]{4};/g; > > > -- > char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; > main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}} >
Re: More text/plain questions
On Wed, 2014-07-02 at 14:44 -0600, Philip Prindeville wrote: > Okay, was tinkering with the code below but the zero-width lookahead is > not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output > is bogus (you can run this and see what I mean). > > What am I doing wrong? You are using an overly complex and fugly test case. ;) Seriously, a stripped down test string does not require more than about 4 instances of plain chars and HTML entities. Much easier on the eye. > my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g; That RE is a single, straight-forward alternation with two alternatives. The first one translates to a single char in a given, specific range. Basically, anything but the ampersand. The second alternative is an ampersand, that is not followed by #x. The (?!pattern) is a zero-width negative look-ahead assertion. A zero width means, it does not consume what it matches. Thus, the second alternation ultimately will match a single ampersand only. The /g global matching then continues where it left of after the last matching attempt. In the case of that ampersand followed by #x, that still is right after the ampersand. line: Thе R matches: T,h,#,x,0,4,3,5,;, ,R The offending ampersand part of the HTML entity encoding correctly is not matched. The following chars do match the "anything but an ampersand" first alternative. I am unsure what you are trying to achieve. If you want to compare the number of HTML entities with the number of regular chars, wouldn't it be easier to simply drop them flat? $data =~ s/[0-9A-F]{4};//g; Or just plain match and count? @matches = $data =~ /[0-9A-F]{4};/g; -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: More text/plain questions
On Jul 2, 2014, at 12:58 PM, David F. Skoll wrote: > I don't think so. Any MUA that tried to convert "е" to a > Unicode character in a text/plain part with implicit US-ASCII charset > and 7bit content transfer encoding is broken. An MUA should diplay > exactly "е" in this situation. It's a different story for > text/html parts, of course. For what it's worth, I just received a spam that basically is the same as what Philip complained about. I've posted a spample here: http://pastebin.com/Y2YGwL49 There _is_ a text/html part, and that's what's displaying in my MUA (Apple Mail). Sadly, as can be seen from the spample, the score doesn't quite reach 5.0 ... Bayes training could help since it only scored BAYES_50, but I'm wondering if this character encoding is designed to sidestep Bayes -- how does Bayes treat these for tokens? If you randomize the characters being replaced (from plaintext to encoded), then there are lots of combinations for any given word, which then means each combination is a different token, no? I don't know if spammers are taking the "care" to randomize the letter replacement, but if so, does this scheme actually "foil" Bayes due to each permutation being considered a different token? If so, is there a way to mitigate that? I'm wondering if we shouldn't write a rule looking for lots of �[0-9]{3}; patterns... say, 500 of them in one email. Or, would we expect legitimate emails to have these? Is there also a rule for UTF8-encoded Subject line? If so, it didn't pop. --- Amir
Re: More text/plain questions
On Wed, 2 Jul 2014, Philip Prindeville wrote: On Jul 2, 2014, at 12:37 PM, John Hardin wrote: On Wed, 2 Jul 2014, Philip Prindeville wrote: Given that it’s text/plain with an implicit charset=“us-ascii” and an implicit content-transfer-encoding of 7bit, the sequence [0-9A-F]{4} doesn’t really parse into a 16-bit character, would it? That would be a broken MUA that made such a leap... Nope. The content-transfer-encoding is only for the *transfer* part of the process. Once the content reaches the MUA that content can be further parsed by the MUA according to other encoding rules, such as these escape sequences for Unicode characters. That's perfectly valid. How else would you send, for example, a c-cedille in spanish text via a 7-bit-clean channel? This is a trick question, right? You do that with base64 or quoted-printable, which are the interoperable standards. Apologies, you are right. I was focused on something else this morning and dashed off a fast - and wrong - answer. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- WSJ on the Financial Stimulus package: "...today there are 700,000 fewer jobs than [the administration] predicted we would have if we had done nothing at all." --- 2 days until the 238th anniversary of the Declaration of Independence
Re: More text/plain questions
Okay, was tinkering with the code below but the zero-width lookahead is not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output is bogus (you can run this and see what I mean). What am I doing wrong? #!/usr/bin/perl -w use warnings; use strict; my $data = <<__EOF__; Thе Rеаl RеаѕоnThе Ꮯоmіng Ꮯоllарѕе...Thе rеаl rеаѕоn ᎳHY HоmеlаndSеcurіtу rеcеntlу рurchаѕеd1.7 Bіllіоn Rоundѕ оf аmmunіtіоn...Ꮃhаt Yоu Muѕt Dо Tо Ꭼnѕurе YоurSаfеtуHоmеlаnd ѕеcurіtу іѕ thеrе tо ѕеcurеthе hоmеlаnd оnlу... Sо thеѕе Ьullеtѕаrе rеаlу mеаnt fоr thеThіѕ іѕ аn еmаіlаdvеrtіѕеmеnt thаt wаѕ ѕеnt tо уоu Ьу Ρаtrіоt Survіvаl Ρlаn. If уоuwіѕh tо nоlоngеr rеcеіvе mеѕѕаgеѕ thаt рrоmоtе ѕurvіvаl tірѕ, рlеаѕеclіck hеrе tо unѕuЬѕcrіЬе.4 Unstable as water, thou shalt not excel because thou wentest up to thy fathers bed then defiledst thou it he went up to my couch.34 And Pharaohnechoh made Eliakim the son of Josiah king in the room of Josiah his father, and turned his name to Jehoiakim, and took Jehoahaz away and he came to Egypt, and died there.37 And the thing was good in the eyes of Pharaoh, and in the eyes o! f all his servants. __EOF__ my $chars = 0; my $uchars = 0; for (split("\n", $data)) { print STDERR "line: ", $_, "\n"; my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g; print STDERR "matches: ", join(',', @matches), " count ", scalar @matches, "\n"; my $chars += scalar @matches; print STDERR "chars: ", $chars, "\n"; @matches = m/[0-9A-F]{4};/g; print STDERR "matches: ", join(',', @matches), " count ", scalar @matches, "\n"; my $uchars += scalar @matches; print STDERR "uchars: ", $uchars, "\n"; print STDERR "\n"; }
Re: More text/plain questions
On Jul 2, 2014, at 12:37 PM, John Hardin wrote: > On Wed, 2 Jul 2014, Philip Prindeville wrote: > >> Given that it’s text/plain with an implicit charset=“us-ascii” and an >> implicit content-transfer-encoding of 7bit, the sequence [0-9A-F]{4} >> doesn’t really parse into a 16-bit character, would it? That would be a >> broken MUA that made such a leap... > > Nope. The content-transfer-encoding is only for the *transfer* part of the > process. Once the content reaches the MUA that content can be further parsed > by the MUA according to other encoding rules, such as these escape sequences > for Unicode characters. That's perfectly valid. How else would you send, for > example, a c-cedille in spanish text via a 7-bit-clean channel? This is a trick question, right? You do that with base64 or quoted-printable, which are the interoperable standards. You don’t pick some implicit encoding which no one else has agreed upon. > >> Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather >> than the unicode16 or UTF-8 character with that hex value? > > I'd only expect that in a very old MUA (i.e. that does not support Unicode), > or display of the raw message content at user request. How is it supposed to guess what the encoding implicitly means? We have the MIME spec so that all of this is formally specified. > >> I wouldn’t want a message where someone gives a couple of examples of >> encoding Ѐ for instance being flagged as SPAM, but if the text is 20% >> or more of these sequences then I would say that’s SPAM-sign. > > That's valid 7-bit encoding for transfer. It's relying on the user's MUA to > convert the encoded Unicode values to glyphs for display. No, 7-bit CTE means it’s 7-bit content. Period. If you want 8-bit or 16-bit or 32-bit content over a 7-bit CHANNEL, you use a 7-bit safe encoding like base64 or quoted-printable. Citing RFC-2045: 6. Content-Transfer-Encoding Header Field Many media types which could be usefully transported via email are represented, in their "natural" format, as 8bit character or binary data. Such data cannot be transmitted over some transfer protocols. For example, RFC 821 (SMTP) restricts mail messages to 7bit US-ASCII data with lines no longer than 1000 characters including any trailing CRLF line separator. It is necessary, therefore, to define a standard mechanism for encoding such data into a 7bit short line format. Proper labelling of unencoded material in less restrictive formats for direct use over less restrictive transports is also desireable. This document specifies that such encodings will be indicated by a new "Content- Transfer-Encoding" header field. This field has not been defined by any previous standard. … 6.2. Content-Transfer-Encodings Semantics … The quoted-printable and base64 encodings transform their input from an arbitrary domain into material in the "7bit" range, thus making it safe to carry over restricted transports. The specific definition of the transformations are given below. The proper Content-Transfer-Encoding label must always be used. Labelling unencoded data containing 8bit characters as "7bit" is not allowed, nor is labelling unencoded non-line-oriented data as anything other than "binary" allowed. … NOTE ON THE RELATIONSHIP BETWEEN CONTENT-TYPE AND CONTENT-TRANSFER- ENCODING: It may seem that the Content-Transfer-Encoding could be inferred from the characteristics of the media that is to be encoded, or, at the very least, that certain Content-Transfer-Encodings could be mandated for use with specific media types. There are several reasons why this is not the case. First, given the varying types of transports used for mail, some encodings may be appropriate for some combinations of media types and transports but not for others. (For example, in an 8bit transport, no encoding would be required for text in certain character sets, while such encodings are clearly required for 7bit SMTP.) So you can’t infer the content-type from the content-transfer-encoding or vice-versa. And RFC-2046: 4.1.2. Charset Parameter A critical parameter that may be specified in the Content-Type field for "text/plain" data is the character set. This is specified with a "charset" parameter, as in: Content-type: text/plain; charset=iso-8859-1 Unlike some other parameter values, the values of the charset parameter are NOT case sensitive. The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII. so you can’t render Unicode or UTF-8 or ISO-8859-X characters because the charset is implicitly US-ASCII and doesn’t have any characters beyond 0111 binary. In short, it’s not Unicode unless it EXPLICITLY SAYS UNICODE. And see also RFC-2152, which I won’t quote here. Lastly, RFC-3629: 8. MIME registration This memo ser
Re: More text/plain questions
On Wed, 2 Jul 2014 11:37:33 -0700 (PDT) John Hardin wrote: > Nope. The content-transfer-encoding is only for the *transfer* part > of the process. Once the content reaches the MUA that content can be > further parsed by the MUA according to other encoding rules, such as > these escape sequences for Unicode characters. I don't think so. Any MUA that tried to convert "е" to a Unicode character in a text/plain part with implicit US-ASCII charset and 7bit content transfer encoding is broken. An MUA should diplay exactly "е" in this situation. It's a different story for text/html parts, of course. > That's perfectly valid. How else would you send, for example, a > c-cedille in spanish text via a 7-bit-clean channel? With the appropriate charset and content-transfer-encoding, such as ISO-8859-1, quoted-printable, and =E7. > I would say that's more a case of those characters shouldn't be > present if the language is en-us than an encoding issue. The presence > of lots of those is either a sign that the text isn't English, or is > obfuscated. How do you reliably tell the language of the message? I would say the presence of ꯍ in a text/plain part is either a bug in spam-generating software or a researcher trying to send something to a colleague. :) Regards, David.
Re: More text/plain questions
On Wed, 2 Jul 2014, John Hardin wrote: On Wed, 2 Jul 2014, Philip Prindeville wrote: Given that it’s text/plain with an implicit charset=“us-ascii” and an implicit content-transfer-encoding of 7bit, the sequence [0-9A-F]{4} doesn’t really parse into a 16-bit character, would it? That would be a broken MUA that made such a leap... Nope. The content-transfer-encoding is only for the *transfer* part of the process. Once the content reaches the MUA that content can be further parsed by the MUA according to other encoding rules, such as these escape sequences for Unicode characters. That's perfectly valid. How else would you send, for example, a c-cedille in spanish text via a 7-bit-clean channel? Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather than the unicode16 or UTF-8 character with that hex value? I'd only expect that in a very old MUA (i.e. that does not support Unicode), or display of the raw message content at user request. ...that said, I primarily use a text-based MUA, and it did not render Unicode glyphs for that sample... -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Of the twenty-two civilizations that have appeared in history, nineteen of them collapsed when they reached the moral state the United States is in now. -- Arnold Toynbee --- 2 days until the 238th anniversary of the Declaration of Independence
Re: More text/plain questions
On Wed, 2 Jul 2014, Philip Prindeville wrote: Given that it’s text/plain with an implicit charset=“us-ascii” and an implicit content-transfer-encoding of 7bit, the sequence [0-9A-F]{4} doesn’t really parse into a 16-bit character, would it? That would be a broken MUA that made such a leap... Nope. The content-transfer-encoding is only for the *transfer* part of the process. Once the content reaches the MUA that content can be further parsed by the MUA according to other encoding rules, such as these escape sequences for Unicode characters. That's perfectly valid. How else would you send, for example, a c-cedille in spanish text via a 7-bit-clean channel? Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather than the unicode16 or UTF-8 character with that hex value? I'd only expect that in a very old MUA (i.e. that does not support Unicode), or display of the raw message content at user request. I wouldn’t want a message where someone gives a couple of examples of encoding Ѐ for instance being flagged as SPAM, but if the text is 20% or more of these sequences then I would say that’s SPAM-sign. That's valid 7-bit encoding for transfer. It's relying on the user's MUA to convert the encoded Unicode values to glyphs for display. I would say that's more a case of those characters shouldn't be present if the language is en-us than an encoding issue. The presence of lots of those is either a sign that the text isn't English, or is obfuscated. How do you reliably tell the language of the message? It would probably be a good idea to add those sequences to the replacetags letter REs so that the FUZZY rules will catch them. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Of the twenty-two civilizations that have appeared in history, nineteen of them collapsed when they reached the moral state the United States is in now. -- Arnold Toynbee --- 2 days until the 238th anniversary of the Declaration of Independence