any bayes and Junk folder Re: spam filtering

Bob Mon, 14 Feb 2005 18:14:17 -0800

John Peacock wrote:

Elliot F wrote:
John, did you try any other bayesian filters before going to dspam? I was thinking of doing the same thing as I'm using crm114.......
I looked at crm114, but didn't test it...........
With dspam, I was able to build a shared profile using the SA corpus, and run with that for about two weeks on behalf of the ~400 users I manage. Then I was able to change one line and set everyone loose with their own configuration........

I wound up writing a process which scans the users' "Junk" folder on a daily basis and resubmits those false negatives for retraining. I am able to force everyone to use the web-based "Quarantine" folder for handling false positives.............
John

I just wanted to thread on the concept of grabbing the Junk
folder, and the idea that spamassassin's bayes could do as
well as any bayes, perhaps, depending more on how the
administrator sets up ham and spam systemwide and for
individuals--including grabbing their thunderbird Junk
folders via nfs?!

To put it another way, it sounds like spamass bayes might
not be getting the same training as another bayes being
touted here lately. On the other hand, if it's more convenient
to train another bayes, it's fair to say they are "better".

At least to make an initial "SA corpus", I am grabbing my
own thunderbird Junk folder via nfs, deleting it, and also
grabbing my carefully un-false-positived thunderbird Inbox
via nfs also. In both cases I rip into individual messages via
an awk script(hey, not bad, not perl, but SHORT) and then
pipe to "safecat" to avoid file naming collisions. Then sa-learn
the spam and ham. I guess it's too intrusive to grab Junk
folders via nfs, RIGHT? But terrorism justifies ignoring
user rights? Oh well.
_________________________________________________________

# bigmailfile_to_files.awk called by rip_ham_spam
BEGIN { CONVFMT = "%d" ; OFMT = "%d" ; i = 0 ; file = 0 ; }
{ ++i
 if ( $1 == "From" && $2 == "-" ) {
  if ( 0 == file ) { file = i ; }
  else { close("nice -n 19 /usr/bin/safecat tmp .") ; file = i ; }
 }
print $0 | "nice -n 19 /usr/bin/safecat tmp ."
}
END { close("nice -n 19 /usr/bin/safecat tmp .") ; }

________________________________________________________
# rip_ham_spam calls above bigmailfile_to_files.awk to separate
# big thunderbird mail folders into individual files which are piped
# to safecat for naming(namespace collision avoidance)
#!/bin/bash
if [ -f "$1" ]
then nice -n 19 awk -f /home/bb/bin/bigmailfile_to_files.awk $1
else
inbox=/home/bb/nfs/.thunderbird/*.default/Mail/k-kdom.bushiedarpa.con/Inbox
if [ -f $inbox ] && [ "$USER" = "root" ]
 then pushd /home/bb/m > /dev/null
 junk=/home/bb/nfs/.thunderbird/*.default/Mail/k-kdom.bushiedarpa.con/Junk
 pushd spam > /dev/null
 [ -d "tmp" ] || mkdir tmp
 cp $junk tmp/target
 nice -n 19 awk -f /home/bb/bin/bigmailfile_to_files.awk $junk
 rm -r tmp
 echo > ${junk}
 for spamslice in $( nice -n 19 ls -1 | nice -n 19 sort -r |\
    nice -n 19 sed -n -e '5000,$p' |\
     nice -n 19 tr '\n' ' ' )
   do rm $spamslice 2> /dev/null
  done
 popd > /dev/null
 [ -d "ham" ] && rm -r ham
 mkdir -p ham/tmp
 cd ham
 cp $inbox tmp/target
 nice -n 19 awk -f /home/bb/bin/bigmailfile_to_files.awk tmp/target
 rm -r tmp
 for hamslice in $( nice -n 19 ls -1 | nice -n 19 sort -r |\
    nice -n 19 sed -n -e "$[ 2 * $( ls -1 ../spam | wc -l ) ],\$p" |\
     nice -n 19 tr '\n' ' ' )
   do rm $hamslice 2> /dev/null
  done
 popd > /dev/null
 nice -n 19 chown -R bb.home /home/bb/m
 sa-learn -C /usr/share/spamassassin --clear
 sa-learn -C /usr/share/spamassassin --no-sync --ham /home/bb/m/ham
 sa-learn -C /usr/share/spamassassin --no-sync --spam /home/bb/m/spam
 sa-learn -C /usr/share/spamassassin --sync
 chown -R spamd.spamd /usr/share/spamassassin
fi
fi

I also honeypot web crawlers and usenet using a dozen non
existing email addresses to collect some juicy spam, and one
thing I do with that is take all IP's from those and make
my own blacklist database, then tcprules, then that will
set RBLSMTPD, which is looked at by dnsbl even though
it runs under pperl. That hits on three to five percent of my
spam denials, I would say, because those spammers hit me
several times over a short interval.

update_blacklist() { # regress to bashism [ "$( hostname )" = "heinous.harmless.info" ] || return 0 for a in /home/bb/m/sa-rbl/new/* # To: [EMAIL PROTECTED] do [ -f "$a" ] && \ ( mv $a /home/bb/m/spam chown bb.home "/home/bb/m/spam/$( echo $a | sed 's/^.*\///g' )" ) done pushd /home/bb/m/spam > /dev/null for oldspam in $( nice -n 19 ls -1 | nice -n 19 sort -r |\ nice -n 19 sed -n -e '5000,$p' | nice -n 19 tr '\n' ' ' ) do rm $oldspam 2> /dev/null done ( for spork in * do nice -n 19 sed -n -e '/^Received:/p' $spork | nice -n 19 tr '][)(' '\n\n\n\n' done ) | nice -n 19 sed -n -e '/^[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*$/p' |\ nice -n 19 sort | nice -n 19 uniq | nice -n 19 sort -n |\ nice -n 19 sed -e '/[0-9][0-9][0-9][0-9]/d' -e '/[3-9][0-9][0-9]/d' \ -e '/2[6-9][0-9]/d' -e '/25[6-9]/d' | grep -v "37.79.123." | grep -v "^192" |\ grep -v "^127" |\ nice -n 19 tee -a /var/www/spammer.convicts.com/badlist | nice -n 19 sed \ -e 's/^.*$/&:allow,RBLSMTPD="IP listed at http:\/\/convicts.com\/"/1' nice -n 19 chown -R bb.home /home/bb/m popd > /dev/null }

-Bob

any bayes and Junk folder Re: spam filtering

Reply via email to