Building on Adam's perl script, this rendition will print the words it
sees which begin with rare tuples.

my (@rare_tuples) = qw/bb bc bd bf bg bh bj bk bm bn bp bq bs bt bv bw bx bz
cb cc cd cf cg cj ck cm cn cp cq cs ct cv cw cx
db dc dd df dg dj dk dl dm dn dp dq ds dt dv dx dz
eh ez
fb fc fd ff fg fh fk fm fn fp fq fs ft fv fw fx fy fz
gb gc gd gf gg gj gk gm gp gq gs gt gv gw gx gz
hb hc hd hf hg hh hj hk hl hm hn hp hq hr hs ht hv hw hx hz
ie ih ii ij ik iq iu iw ix iy iz
jb jc jd jf jg jh jj jk jl jm jn jp jq jr js jt jv jw jx jy jz
kb kc kd kf kg kj kk km kp kq ks kt kv kw kx ky kz
lb lc ld lf lg lh lj lk lm ln lp lq lr ls lt lv lw lx lz
mb mc md mf mg mj mk ml mm mp mq ms mt mv mw mx mz
nb nc nd nf ng nh nj nk nl nm nn np nq nr ns nv nw nx nz
oj oq
pb pc pd pg pj pk pm pp pq pv pw px pz
qa qb qc qd qe qf qg qh qi qj qk ql qm qn qo qp qq qr qs qt qv qw qx qy qz
rb rc rd rf rg rj rk rl rm rn rp rq rr rs rt rv rw rx rz
sb sd sf sg sj sr ss sx sz
tb tc td tf tg tj tk tl tm tn tp tq tt tv tx tz
ua uc ue uf uh ui uj uk uo uq uu uv uw ux uy uz
vb vc vd vf vg vh vj vk vl vm vn vp vq vr vs vt vv vw vx vz
wb wc wd wf wg wj wk wl wm wn wp wq ws wt wu wv ww wx wz
xb xc xd xf xg xh xj xk xl xm xn xo xp xq xr xs xt xu xv xw xx xz
yb yc yd yf yg yh yj yk yl ym yn yp yq yr ys yv yw yx yy yz
zb zc zd zf zg zh zj zk zm zn zp zq zr zs zt zu zv zw zx zz/;

foreach $rare (@rare_tuples) {
  $is_rare{$rare} = 1;
}
while (<>) {
    chomp;
    # change all non-word chars into white space
    s/[^A-Za-z]/ /g;
    foreach $word ( split ) {
        if (length($word) >= 2) {
            my $xx = substr($word,0,2);
            $found{$xx} = 1;
            print "$word\n" if ($is_rare{$xx});
        }
    }
}
foreach my $a ('a'..'z') {
        foreach my $b ('a'..'z') { if (!$found{$a.$b}) { print "$a$b "; } }
        print "\n";
}
print "\n";

To execute against subject lines try the folllowing:

cat *.mbox | formail -zxSubject: -s | perl words.pl

where words.pl is the subject line above.

At first glance, this picks up a *lot* of garbage words in spam subject
lines and not so many in ham subject lines.

If it were changed to output the tuples found one per line, then you could
use Unix's 'comm' command to weed out those unique to spam. Not perfect,
but with a sufficiently large corpus, you might gain some confidence that
this will help detect garbage subject lines.





-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to