http://bugzilla.spamassassin.org/show_bug.cgi?id=3023
------- Additional Comments From [EMAIL PROTECTED] 2004-02-09 07:38 ------- Subject: RE: New: Detecting random garbage in emails > > This rule should not provide any negative scores even if all of the tokens > where previously seen before (obviously, you don't want to give > bonus points > to spam message which you previusly learnt). > > Any thoughts on this? I looked at this idea, about a month ago and posted the following two notes to Chris's rules emporium list. It may offer some background/theory. I think there's some promise in this approach as one possible scoring rule, but there may be some false positives. I think you may need to not apply this rule if there are less than, say, 50 unique words in the text just because there won't be large enouh sample. -------------------------------------------------------------- On the theme of finding spams with a high concentration of "noise words", I followed this theme for a bit: http://linkage.rockefeller.edu/wli/zipf/ in particular (as applied to English text): http://web.archive.org/web/20001005120011/hobart.cs.umass.edu/~allan/cs646-f 97/char_of_text.html Non-folding url: http://tinyurl.com/38j64 The basic idea here is that the frequency of occurrence drops at a fixed rate of the rank of a word. For English, the unique words represent about 1/10th of the total sample (ie, in a text of 1000 words, expect 100 to be unique), and of those unique words about 1/2 will appear only once. So, I tried hacking together a Perl script that tabulates the number of unique words in a text, and the number of those which occur only once, expecting the ratio to be about 0.5 and for spammy mails, expecting that ratio to be higher in the messages that use random words or gibberish. Couldn't get the statistics to play in my favor, but the experiment was complicated by the fact that so many spam messages use html and base64 encoding, and I excluded those e-mails because I don't have a handy perl set up that decodes and strips html. So, I can't conclusively say whether this approach might work in practice or not. What I was hoping for was there might be a cut-off, say at 70% where if 70% or more of the totoal unique words in the text were used only once then that might be a spam marker. While it appeared generally true that spam used more words only once in the text, I couldn't quite convince myself that this would be a decisive criteria. Here's the Perl script: #!/usr/bin/perl use strict; my %count = (); my %word_length = (); my $total = 0; my $unique_words = 0; my $repeated_words = 0; my $debug = 0; while (<STDIN>) { my ($word); tr/A-Za-z/ /cs; foreach $word (split(' ', lc $_)) { my ($c); ++$total; ++$word_length{length($word)}; $c = ++$count{$word}; ++$unique_words if ($c == 1); ++$repeated_words if ($c == 2); } } exit 0 if ($unique_words < 50); print "total words: $total unique_words: $unique_words repeated words: $repeated_words\n"; printf "fraction of words appearing only once: %7.3f\n", ($unique_words - $repeated_words) / $unique_words; if ($debug) { my @sort_by_len = map [$_, $word_length{$_}], sort { $word_length{$b} <=> $word_length{$a} } keys %word_length; print "word lengths\n"; foreach my $s (@sort_by_len) {print join(":",@{$s}) . "\n";} my @lines = (); my ($w, $c); push(@lines, sprintf("%7d\t%s\n", $c, $w)) while (($w, $c) = each(%count)); @lines = sort { $b cmp $a } @lines; for ($c = 1; $c <= @lines/2; ++$c) { print $lines[$c]; } } -------------------------------------------------------------- > -----Original Message----- > From: Robert Menschel [mailto:[EMAIL PROTECTED] > Sent: Wednesday, January 14, 2004 7:03 PM > To: Gary Funck > Cc: [EMAIL PROTECTED] > Subject: Re: [RulesEmporium] Zipf's law - an experiment > > > Hello Gary, > > I haven't taken the time to read the links you provided, and maybe that's > why I'm confused. > > Wednesday, January 14, 2004, 8:42:38 AM, you wrote: > > GF> The basic idea here is that the frequency of occurrence drops > at a fixed > GF> rate of the rank of a word. For English, the unique words > represent about > GF> 1/10th of the total sample (ie, in a text of 1000 words, > expect 100 to be > GF> unique), and of those unique words about 1/2 will appear only once. > > That means the other 1/2 of the unique words appear more than once. If > they appear more than once, what makes them unique? > Yeah, I was struggling with the terminology. Unique means "occurs one or more times" in the same sense as "sort -u" and "uniq" discard duplicates. Thus in the following familiar paragraph: "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth." There are 272 total words, of which 138 are unique (used once or more), of the unique words, 48 are repeated (used more than once), and the fraction of words used only once to the number of unique words is 0.652. Zipf's Law would tell us that on average across a large collection of English text that the ratio is usually close to 0.5. Well, we all know Lincoln had a large vocabulary and shouldn't be surprised that the ratio is on the high side. > GF> Here's the Perl script: > > I couldn't run this as is against my corpus, for the same reason you > couldn't. But if someone can make this into an eval rule, now that I'm > doing mass checks on my own machine I can do mass-checks on evals. > Yeah, I'd appreciate that also. I was thinking the rule might have cutoffs for 70-80 percent, 80-90 percent, and 90-100 percent of words not repeated. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
