On 4/20/2014 3:14 PM, Dan Mahoney, System Admin wrote: > All, > > Most of my users aren't command-line friendly. I'd like to basically > have my IMAP server default to handing out two imap mailboxes that get > auto-crontabbed to training bayes. >
We do this, but you *really* need to trust your users to classify things correctly. So maybe you only advertise it to your "power" or wise/discerning users. In our IMAP setup (Dovecot/Pigeonhole) users all have a "Junk" folder in the root of their mailbox. Everybody's mailbox is a separate directory in the MailDir format (one file per message). Users that we trust are instructed to create "Junk/TrainAsSpam" and "Junk/TrainAsHam" folders under "Junk/". Then they put their mis-trained messages into those folders. The daily cron jobs then inspect message files in those folders and run sa-learn on them. ... The key bit of the script that we use is: find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS -exec ${SALEARN} --ham {} \; $UD is the path to the user's directory, e.g.: /var/vmail/example.com/username/Maildir/.Junk.TrainAsNotSpam $DAYS is the age of the messages to look at, typically a value of 3 days works fine if you run this daily. $SALEARN is simply the path to the sa-learn command "/usr/bin/sa-learn" ... Naturally, I make no warranty for fitness of purpose of the attached scripts. Nor is this the only way to skin the cat.
#!/bin/bash FIND=/usr/bin/find GREP=/bin/grep RM=/bin/rm SED=/bin/sed SORT=/bin/sort SALEARN=/usr/bin/sa-learn BASE="/var/vmail/" DAYS=3 echo "" echo "Train SpamAssassin (sa-learn) on mail newer then ${DAYS} days." echo "Started at: " `date` echo "" # since RHEL5/CentOS5 don't have "sort -R" option to randomize, use the following example # echo -e "2\n1\n3\n5\n4" | perl -MList::Util -e 'print List::Util::shuffle <>' DIRS=`$FIND $BASE -maxdepth 3 -name subscriptions | \ $SED -n 's:^/var/vmail/::p' | $SED 's:/subscriptions$:/:' | \ perl -MList::Util -e 'print List::Util::shuffle <>'` # keep track of directories processed so far DCNT=0 for DIR in ${DIRS} do UD="${BASE}${DIR}.Junk.TrainAsNotSpam" if [ -d "$UD/cur" ] then echo "`date` - Process: $DIR" echo " folder: $UD" echo " files:" `find $UD/cur/ $UD/new/ -type f -name '*' | wc -l` echo " recent:" `find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS | wc -l` sleep 1 find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS -exec ${SALEARN} --ham {} \; sleep 1 echo "" fi # the following is debug code, to stop the script after N directories DCNT=$(($DCNT+1)) #echo "DCNT: $DCNT" #if [[ $DCNT -ge 10 ]]; then exit 0; fi done echo "Finished at:" `date` echo ""
#!/bin/bash FIND=/usr/bin/find GREP=/bin/grep RM=/bin/rm SED=/bin/sed SORT=/bin/sort SALEARN=/usr/bin/sa-learn BASE="/var/vmail/" DAYS=3 echo "" echo "Train SpamAssassin (sa-learn) on mail newer then ${DAYS} days." echo "Started at: " `date` echo "" # since RHEL5/CentOS5 don't have "sort -R" option to randomize, use the following example # echo -e "2\n1\n3\n5\n4" | perl -MList::Util -e 'print List::Util::shuffle <>' DIRS=`$FIND $BASE -maxdepth 3 -name subscriptions | \ $SED -n 's:^/var/vmail/::p' | $SED 's:/subscriptions$:/:' | \ perl -MList::Util -e 'print List::Util::shuffle <>'` # keep track of directories processed so far DCNT=0 for DIR in ${DIRS} do UD="${BASE}${DIR}.Junk.TrainAsSpam" if [ -d "$UD/cur" ] then echo "`date` - Process: $DIR" echo " folder: $UD" echo " files:" `find $UD/cur/ $UD/new/ -type f -name '*' | wc -l` echo " recent:" `find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS | wc -l` sleep 1 find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS -exec ${SALEARN} --spam {} \; sleep 1 echo "" fi # the following is debug code, to stop the script after N directories DCNT=$(($DCNT+1)) #echo "DCNT: $DCNT" #if [[ $DCNT -ge 10 ]]; then exit 0; fi done echo "Finished at:" `date` echo ""