[Bug 3023] Detecting random garbage in emails

bugzilla-daemon 9 Feb 2004 15:38:09 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=3023

------- Additional Comments From [EMAIL PROTECTED]  2004-02-09 07:38 -------
Subject: RE:  New: Detecting random garbage in emails

>
> This rule should not provide any negative scores even if all of the tokens
> where previously seen before (obviously, you don't want to give
> bonus points
> to spam message which you previusly learnt).
>
> Any thoughts on this?

I looked at this idea, about a month ago and posted the following two notes
to
Chris's rules emporium list. It may offer some background/theory. I think
there's
some promise in this approach as one possible scoring rule, but there may be
some
false positives. I think you may need to not apply this rule if there are
less
than, say, 50 unique words in the text just because there won't be large
enouh
sample.

--------------------------------------------------------------

On the theme of finding spams with a high concentration of "noise words", I
followed
this theme for a bit:

http://linkage.rockefeller.edu/wli/zipf/

in particular (as applied to English text):
http://web.archive.org/web/20001005120011/hobart.cs.umass.edu/~allan/cs646-f
97/char_of_text.html

Non-folding url:
http://tinyurl.com/38j64

The basic idea here is that the frequency of occurrence drops at a fixed
rate of the rank of a word. For English, the unique words represent about
1/10th of the total sample (ie, in a text of 1000 words, expect 100 to be
unique), and of those unique words about 1/2 will appear only once.

So, I tried hacking together a Perl script that tabulates the number of
unique words in a text, and the number of those which occur only once,
expecting the ratio to be about 0.5 and for spammy mails, expecting that
ratio to be higher in the messages that use random words or gibberish.

Couldn't get the statistics to play in my favor, but the experiment was
complicated by the fact that so many spam messages use html and base64
encoding, and I excluded those e-mails because I don't have  a handy
perl set up that decodes and strips html. So, I can't conclusively say
whether this approach might work in practice or not.

What I was hoping for was there might be a cut-off, say at 70% where if 70%
or
more of the totoal unique words in the text were used only once then that
might
be a spam marker. While it appeared generally true that spam used more words
only once in the text, I couldn't quite convince myself that this would be
a decisive criteria.

Here's the Perl script:

#!/usr/bin/perl

use strict;

my %count = ();
my %word_length = ();
my $total = 0;
my $unique_words = 0;
my $repeated_words = 0;
my $debug = 0;

while (<STDIN>) {
    my ($word);
    tr/A-Za-z/ /cs;
    foreach $word (split(' ', lc $_)) {
      my ($c);
      ++$total;
      ++$word_length{length($word)};
      $c = ++$count{$word};
      ++$unique_words if ($c == 1);
      ++$repeated_words if ($c == 2);
    }
}
exit 0 if ($unique_words < 50);
print "total words: $total unique_words: $unique_words repeated words:
$repeated_words\n";
printf "fraction of words appearing only once: %7.3f\n",
   ($unique_words - $repeated_words) / $unique_words;
if ($debug) {
    my @sort_by_len = map [$_, $word_length{$_}],
                  sort { $word_length{$b} <=> $word_length{$a} }
                       keys %word_length;
    print "word lengths\n";
    foreach my $s (@sort_by_len) {print join(":",@{$s}) . "\n";}
    my @lines = ();
    my ($w, $c);
    push(@lines, sprintf("%7d\t%s\n", $c, $w))
                  while (($w, $c) = each(%count));
    @lines = sort { $b cmp $a } @lines;
    for ($c = 1; $c <= @lines/2; ++$c) {
      print $lines[$c];
    }
}
--------------------------------------------------------------

> -----Original Message-----
> From: Robert Menschel [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, January 14, 2004 7:03 PM
> To: Gary Funck
> Cc: [EMAIL PROTECTED]
> Subject: Re: [RulesEmporium] Zipf's law - an experiment
>
>
> Hello Gary,
>
> I haven't taken the time to read the links you provided, and maybe that's
> why I'm confused.
>
> Wednesday, January 14, 2004, 8:42:38 AM, you wrote:
>
> GF> The basic idea here is that the frequency of occurrence drops
> at a fixed
> GF> rate of the rank of a word. For English, the unique words
> represent about
> GF> 1/10th of the total sample (ie, in a text of 1000 words,
> expect 100 to be
> GF> unique), and of those unique words about 1/2 will appear only once.
>
> That means the other 1/2 of the unique words appear more than once. If
> they appear more than once, what makes them unique?
>

Yeah, I was struggling with the terminology. Unique means "occurs one
or more times" in the same sense as "sort -u" and "uniq" discard duplicates.
Thus in the following familiar paragraph:

"Four score and seven years ago our fathers brought forth on this continent,
a new nation, conceived in Liberty, and dedicated to the proposition that
all men are created equal. Now we are engaged in a great civil war, testing
whether that nation or any nation so conceived and so dedicated, can long
endure. We are met on a great battle-field of that war. We have come to
dedicate a portion of that field, as a final resting place for those who
here gave their lives that that nation might live. It is altogether fitting
and proper that we should do this. But, in a larger sense, we can not
dedicate - we can not consecrate - we can not hallow - this ground. The
brave men, living and dead, who struggled here, have consecrated it, far
above our poor power to add or detract. The world will little note, nor long
remember what we say here, but it can never forget what they did here. It is
for us the living, rather, to be dedicated here to the unfinished work which
they who fought here have thus far so nobly advanced. It is rather for us to
be here dedicated to the great task remaining before us - that from these
honored dead we take increased devotion to that cause for which they gave
the last full measure of devotion - that we here highly resolve that these
dead shall not have died in vain - that this nation, under God, shall have a
new birth of freedom - and that government of the people, by the people, for
the people, shall not perish from the earth."

There are 272 total words, of which 138 are unique (used once or more), of
the
unique words, 48 are repeated (used more than once), and the fraction of
words used only
once to the
number of unique words is 0.652. Zipf's Law would tell us that on average
across
a large collection of English text that the ratio is usually close to 0.5.
Well, we all know Lincoln had a large vocabulary and shouldn't be surprised
that the ratio is on the high side.

> GF> Here's the Perl script:
>
> I couldn't run this as is against my corpus, for the same reason you
> couldn't. But if someone can make this into an eval rule, now that I'm
> doing mass checks on my own machine I can do mass-checks on evals.
>

Yeah, I'd appreciate that also. I was thinking the rule might have cutoffs
for 70-80 percent, 80-90 percent, and 90-100 percent of words not repeated.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3023] Detecting random garbage in emails

Reply via email to