Re: [sniffer] OT: Language filtering in Declude, wasPossibleblip?

Scott Fisher Fri, 21 May 2004 14:39:49 -0700

Wouldn't it be better to reverse the order?

Run the subject and header tests on the majority of the mail.
Then run the body with a TESTSFAILED END NOTCONTAINS CHINESE. 
You should end up with less body searches this way.


Scott Fisher
Director of IT
Farm Progress Companies

>>> [EMAIL PROTECTED] 05/21/04 04:28PM >>>
I think you might have possibly identified the group of required 
characters.  I'll give that a try.  I'm not sure if any Cyrillic stuff 
has been passing through but this bears watching as well and I might 
have to change my list there as well.

I am also tagging BIG5, however almost all spam comes in GB2312.  Here's 
what I'm searching for in the CHINESE filter:

    # CHINESE v1.0.0

    SKIPIFWEIGHT    25
    MAXWEIGHT    10

    TESTSFAILED    END    NOTCONTAINS    HIGHBIT

    SUBJECT        END    CONTAINS    charset=gb2312
    SUBJECT        END    CONTAINS    charset="gb2312"
    SUBJECT        END    CONTAINS    charset=big5
    SUBJECT        END    CONTAINS    charset="big5"

    HEADERS        10    CONTAINS    =?gb2312?b?
    HEADERS        10    CONTAINS    =?big5?b?
    HEADERS        10    CONTAINS    charset=gb2312
    HEADERS        10    CONTAINS    charset="gb2312"
    HEADERS        10    CONTAINS    charset=big5
    HEADERS        10    CONTAINS    charset="big5"

    BODY        10    CONTAINS    charset=gb2312"
    BODY        10    CONTAINS    charset=3dgb2312"
    BODY        10    CONTAINS    charset=big5"
    BODY        10    CONTAINS    charset=3dbig5"
    BODY        10    CONTAINS    content=zh-cn"
    BODY        10    CONTAINS    content=3dzh-cn"


The END statements for the subject are meant as a precaution, although 
it's probably not necessary with the HIGHBIT filter ending on US-ASCII 
and ISO-8859-1 (plus a language definition hit for 'content="en-us"').

I do believe that you can apply a similar technique to spam in Spanish, 
but since the characterset is the same as English, you would be 
searching for those 'content=' markers in combination with special 
characters (a short list in this case).  We hardly see any Spanish spam, 
or at least held Spanish spam so I'm doing nothing about it.  Spanish is 
of course a lot more common in US E-mail.  It may be that some Spanish 
spam isn't identified as Spanish since that's not necessary for proper 
display in most E-mail clients, but I have seen no proof of that.

Matt



Scott Fisher wrote:

>Interesting. I generally just punish people if GB2312 ?BIG5 or such are in the 
>headers. This is overwhelmingly SPAM, but like you siad there are English in some of 
>those messages.
>
>It looks like the GB2312 Chinese characters will have A B0 to F7 as it's highbyte. 
>and an A0 to FF as it's lowbyte. 
>If the GB2312 Chinese is present, I would think most every character should be one of 
>these:
>°±²³ µ¶· *º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷
>
>Checking some of my e-mails confirms that.
>
>The bad news is that requires another body filter. It's too bad there wasn't a 
>BODY256 filter type where only the first 256 bytes would be checked. That would 
>certainly be enough to score up these, and wouldn't be a CPU hog. I'm not certain 
>that I'd want to throw another body filter at my few Chinese spams.
>
>How often do you get a body indication of GB2312 / Cyrillic charactersets with no 
>header indication?
>
>It's an interesting subject because I those few Chinese spams that get through to 
>three of my accounts frustrate me.
>Got any tips for Spanish spam?
>
>Scott Fisher
>Director of IT
>Farm Progress Companies
>
>  
>
>>>>[EMAIL PROTECTED] 05/21/04 03:17PM >>>
>>>>        
>>>>
>No, just one, but it won't score unless there is a header or body 
>indication of the GB2312 or Windows-1251 charactersets.  I'm using a 
>combo filter in Declude where the HIGHBIT filter is non-scoring, and the 
>CHINESE and CYRILLIC filters contain a line that says:
>
>    TESTSFAILED      END      NOTCONTAINS      HIGHBIT
>
>I'm pretty sure that the CHINESE and CYRILLIC filters will always hit 
>where appropriate unless the HIGHBIT test doesn't hit.  I have about 65 
>different high bit characters in that filter presently, all copied from 
>spam.  If Scott was around, I would ask him how the NONENGLISH test is 
>tripped because that might accomplish the same goals, however I'm not 
>sure if it also scores the definition of a characterset, in which case 
>it would have false positives in this scenario.
>
>Matt
>
>
>
>Scott Fisher wrote:
>
>  
>
>>Interesting.
>>
>>Are you searching for 2 character pairs with GB2312?
>>
>>Scott Fisher
>>Director of IT
>>Farm Progress Companies
>>
>> 
>>
>>    
>>
>>>>>[EMAIL PROTECTED] 05/21/04 01:46PM >>>
>>>>>       
>>>>>
>>>>>          
>>>>>
>>Scott,
>>
>>Regarding my Cyrillic and Chinese filters, I did a review of a full 
>>week's held spam, looking for foreign languages and patterns to tag.  I 
>>found from other research that the primary Chinese characterset, GB2312, 
>>contains the Western Latin characterset, and so someone could send an 
>>E-mail with this characterset defined and still have English as the 
>>message.  Because of this I do more than just look for the offending 
>>characterset, I've built a combo filter that looks for both high bit 
>>characters such as ¥ as well as body or header hits for encoding of 
>>GB2312 (Chinese/Korean) or Windows-1251 (Cyrillic).  I also have Declude 
>>END statements for appearances of US-ASCII and ISO-8859-1, so messages 
>>like this one that are referencing such patterns won't trip the filter.  
>>It seems to be stopping about 80% to 90% of the stuff, but I'm guessing 
>>that the stuff that is getting through didn't hit one of the high bit 
>>characters in my filter and I might need to simply expand my list a 
>>bit.  Unfortunately I have no idea what characters are most common, so 
>>I'm just eyeballing it from sources.
>>
>>I had one false positive on a Yahoo Groups posting that referenced 
>>163.com, a Chinese free Web mail provider that inserts Chinese language 
>>footers.  The message was in English, but encoded in GB2312 and didn't 
>>indicate any sign of English besides the actual text.  Because of this, 
>>I might throw in an exception for the word "the " (followed by a space) 
>>just as a test to see if text in English is present, but I have to 
>>review that.  This message was also BASE64 encoded and that might be an 
>>appropriate exception???  The last pattern that I might look at is using 
>>the new MailPolice test for identifying Web-mail providers, and 
>>excepting them from the filter because they have issues with encoding 
>>languages I've found.
>>
>>Hope this helps.
>>
>>Matt
>>
>>
>>
>>Scott Fisher wrote:
>>
>> 
>>
>>    
>>
>>>2 thoughts from me:
>>>
>>>1. Right on on the Nigerian scams, possible keeping these rules longer. As I was 
>>>forwarding out a Nigerian scam to the spam mailbox, I too wondered how long the 
>>>Nigerian rules were kept in play. I might also add Nigeria's twin sister the 
>>>International Lottery spam and Stock Spams might also be kept longer. I noticed an 
>>>increase in the Stock spams this week. 
>>>
>>>2. I've been tracking different character sets for a couple of weeks, the Chinese, 
>>>Cyrillic and Korean look promising. I get false hits on Greek, Thai, and Vietnamese 
>>>Headers.
>>>
>>>Scott Fisher
>>>Director of IT
>>>Farm Progress Companies
>>>
>>>
>>>
>>>   
>>>
>>>      
>>>
>>>>>>[EMAIL PROTECTED] 05/21/04 12:42PM >>>
>>>>>>      
>>>>>>
>>>>>>         
>>>>>>
>>>>>>            
>>>>>>
>>>Pete,
>>>
>>>Our Hold range has returned to more normal territory on Thursday.  
>>>Here's the stats from the week as a whole on what has been very 
>>>consistent traffic.  Out of all E-mail processed, both good and bad, the 
>>>%Hold represents what scored between 10-24 points on our system and 
>>>needed review, the %Sniffer represents all Sniffer hits except for Gray, 
>>>the %Spam is what we scanned and didn't deliver (generally about 99.8% 
>>>of spam is caught at a score of 10 which this is based on), and the 
>>>Sniffer/Spam is the percentage of Sniffer hits as a portion of messages 
>>>scoring 10 or more.
>>>
>>>  Day      %Hold    %Sniffer    %Spam    Sniffer/Spam
>>>  Mon:     1.86%     77.27%     80.37%     96.14%
>>>  Tue:     2.83%     74.53%     79.37%     93.39%
>>>  Wed:     2.13%     77.60%     79.66%     97.41%
>>>  Thur:    1.95%     76.50%     80.66%     94.84%
>>>
>>>The only change that we made to our system was to add two smaller 
>>>domains later in the week, and we introduced filters for Cyrillic and 
>>>Chinese languages on Wednesday morning which have cut our hold file down 
>>>by 0.38 percentage points on Thursday, which explains how our %Hold is 
>>>lower on than on Wednesday with a lower Sniffer hit rate on spam.
>>>
>>>I did note two high volume untagged static spammers on Tuesday that we 
>>>blacklisted locally, and that combined with the increase in Sniffer 
>>>change rates (spam storm) might account for the changes that I saw.  I 
>>>am wondering though about the recommendations that you have made for 
>>>possibly fine tuning our rule base.  Again though, please keep in mind 
>>>that I still feel that performance is overall very, very good.
>>>
>>>One of my thoughts regarding minimum rule strengths and grace periods is 
>>>that all groups aren't necessarily the same.  For instance Nigerian 
>>>scams are low volume and sporadic, and my system performs the worst on 
>>>these things.  Maybe lower rule strengths and longer grace periods makes 
>>>much more sense for the Phishing category than it does for many other 
>>>categories for instance.  Is that possible?
>>>
>>>I also looked up the rule strengths on your site and found that about 
>>>50%, or maybe more, have a strength below 1, and maybe lowering that is 
>>>worth testing out so long as I don't massively increase the number of 
>>>records.  I do think though that I would like to test out extending the 
>>>grace period.  Most of my false positives are not on things that this 
>>>would affect, and that might give niche sources a little extra coverage 
>>>if I understand things correctly.
>>>
>>>I'll follow your directions and contact you directly regarding any 
>>>affirmative changes, but I thought it might be beneficial to keep this 
>>>discussion public since some other stats hounds might find this 
>>>information to be of use :)
>>>
>>>If you can glean anything from the numbers that I gave you, please add 
>>>your thoughts.
>>>
>>>Thanks,
>>>
>>>Matt
>>>
>>>
>>>
>>>
>>>
>>>Pete McNeil wrote:
>>>
>>>
>>>
>>>   
>>>
>>>      
>>>
>>>>At 05:00 PM 5/19/2004, you wrote:
>>>>
>>>><snip/>
>>>>
>>>>  
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>>>I haven't yet upgraded to the most recent release, I'm still on the 
>>>>>prior beta.  I'll probably do that this evening.  I tend to wait on 
>>>>>upgrades until there has been enough time for bugs to surface unless 
>>>>>I am already looking for a fix.  I'm sure that the extra verification 
>>>>>of the rulebase will help prevent the potential of problems, and I 
>>>>>guess this has the possibility of being caused by a bit of corrupted 
>>>>>data, though that's probably reaching.
>>>>>    
>>>>>
>>>>>       
>>>>>
>>>>>          
>>>>>
>>>>There were no substantive changes from the beta to the production 
>>>>version. Largely just a removal of monitoring code.
>>>>
>>>>  
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>>>Again, regardless if there was a blip, Sniffer still does a wonderful 
>>>>>job of tagging lots and lots of E-mail, just not quite as much as the 
>>>>>day before.
>>>>>    
>>>>>
>>>>>       
>>>>>
>>>>>          
>>>>>
>>>>Last night I was able to adjust the rule strength analysis window back 
>>>>to it's original settings. About 5 days of data were lost - but those 
>>>>days will be recovered quickly. Please let me know if this adjustment 
>>>>improved your conditions.
>>>>
>>>>I've noted that on a number of other lists there seem to be posts 
>>>>about a sudden increase in spam over the past few days. We are 
>>>>definitely seeing this also - approximately a 25% or more increase in 
>>>>new rule additions in the past 4 days:
>>>>
>>>>http://www.sortmonster.com/MessageSniffer/Performance/ChangeRates.jsp 
>>>>
>>>>Specifically note from about 4 days ago...
>>>>
>>>>
>>>>Days Ago Adjustments
>>>>-------- -----------
>>>>
>>>>0        356
>>>>1        508
>>>>2        391
>>>>3        410
>>>>4        410
>>>>5        326
>>>>6        309
>>>>7        371
>>>>8        292
>>>>9        347
>>>>10       309
>>>>
>>>>
>>>>
>>>>( 5-10 : 1954/6 -> 325.67, 0-5 : 2075/5 -> 415, 325.67/415 -> 78.47 )
>>>>Note that day 0 is not complete. So applying a "fudge factor" 78.4 
>>>>_looks like_ 75%.
>>>>Besides, 92% of statistics are made up on the spot anyway %^b
>>>>I think a number of things are combined here... I just want to get a 
>>>>good handle on them and make sure we are doing the best we can.
>>>>
>>>>I've noted, Matt, that your rulebase tuning parameters are set at the 
>>>>defaults. If you would like to adjust these to be more aggressive then 
>>>>please let me know off list (support@). More aggressive settings will for th
>>>>keep more rules active in your rulebase at lower strengths and will 
>>>>also allow new rules more time to gain strength before being 
>>>>evaluated. Respectively the current defaults are:
>>>>
>>>>Minimum Rule Strength: 1.0
>>>>Grace Period: 5 days.
>>>>
>>>>Adjusting these settings can significantly increase the size of your 
>>>>rulebase file.
>>>>
>>>>Best,
>>>>_M
>>>>
>>>>  
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>   
>>>
>>>      
>>>
>> 
>>
>>    
>>
>
>  
>

-- 
=====================================================
MailPure custom filters for Declude JunkMail Pro.
http://www.mailpure.com/software/ 
=====================================================


This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html

Re: [sniffer] OT: Language filtering in Declude, wasPossibleblip?

Reply via email to