On 4/20/2014 3:14 PM, Dan Mahoney, System Admin wrote:
> All,
> 
> Most of my users aren't command-line friendly.  I'd like to basically
> have my IMAP server default to handing out two imap mailboxes that get
> auto-crontabbed to training bayes.
> 

We do this, but you *really* need to trust your users to classify things
correctly.  So maybe you only advertise it to your "power" or
wise/discerning users.

In our IMAP setup (Dovecot/Pigeonhole) users all have a "Junk" folder in
the root of their mailbox.  Everybody's mailbox is a separate directory
in the MailDir format (one file per message).

Users that we trust are instructed to create "Junk/TrainAsSpam" and
"Junk/TrainAsHam" folders under "Junk/".  Then they put their
mis-trained messages into those folders.  The daily cron jobs then
inspect message files in those folders and run sa-learn on them.

...

The key bit of the script that we use is:

find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS -exec ${SALEARN}
--ham {} \;

$UD is the path to the user's directory, e.g.:

/var/vmail/example.com/username/Maildir/.Junk.TrainAsNotSpam

$DAYS is the age of the messages to look at, typically a value of 3 days
works fine if you run this daily.

$SALEARN is simply the path to the sa-learn command "/usr/bin/sa-learn"

...

Naturally, I make no warranty for fitness of purpose of the attached
scripts.  Nor is this the only way to skin the cat.
#!/bin/bash

FIND=/usr/bin/find
GREP=/bin/grep
RM=/bin/rm
SED=/bin/sed
SORT=/bin/sort
SALEARN=/usr/bin/sa-learn

BASE="/var/vmail/"
DAYS=3

echo ""
echo "Train SpamAssassin (sa-learn) on mail newer then ${DAYS} days."
echo "Started at: " `date`
echo ""

# since RHEL5/CentOS5 don't have "sort -R" option to randomize, use the 
following example
# echo -e "2\n1\n3\n5\n4" | perl -MList::Util -e 'print List::Util::shuffle <>'

DIRS=`$FIND $BASE -maxdepth 3 -name subscriptions | \
    $SED -n 's:^/var/vmail/::p' | $SED 's:/subscriptions$:/:' | \
    perl -MList::Util -e 'print List::Util::shuffle <>'`

# keep track of directories processed so far
DCNT=0

for DIR in ${DIRS}
do
    UD="${BASE}${DIR}.Junk.TrainAsNotSpam"

    if [ -d "$UD/cur" ]
    then
        echo "`date` - Process: $DIR"
        echo " folder: $UD"
        echo "  files:" `find $UD/cur/ $UD/new/ -type f -name '*' | wc -l`
        echo " recent:" `find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS 
| wc -l`
        sleep 1

        find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS -exec ${SALEARN} 
--ham {} \;
        sleep 1

        echo ""
    fi

    # the following is debug code, to stop the script after N directories
    DCNT=$(($DCNT+1))
    #echo "DCNT: $DCNT" 
    #if [[ $DCNT -ge 10 ]]; then exit 0; fi
done

echo "Finished at:" `date`
echo ""


#!/bin/bash

FIND=/usr/bin/find
GREP=/bin/grep
RM=/bin/rm
SED=/bin/sed
SORT=/bin/sort
SALEARN=/usr/bin/sa-learn

BASE="/var/vmail/"
DAYS=3

echo ""
echo "Train SpamAssassin (sa-learn) on mail newer then ${DAYS} days."
echo "Started at: " `date`
echo ""

# since RHEL5/CentOS5 don't have "sort -R" option to randomize, use the 
following example
# echo -e "2\n1\n3\n5\n4" | perl -MList::Util -e 'print List::Util::shuffle <>'

DIRS=`$FIND $BASE -maxdepth 3 -name subscriptions | \
    $SED -n 's:^/var/vmail/::p' | $SED 's:/subscriptions$:/:' | \
    perl -MList::Util -e 'print List::Util::shuffle <>'`

# keep track of directories processed so far
DCNT=0

for DIR in ${DIRS}
do
    UD="${BASE}${DIR}.Junk.TrainAsSpam"

    if [ -d "$UD/cur" ] 
    then
        echo "`date` - Process: $DIR"
        echo " folder: $UD"
        echo "  files:" `find $UD/cur/ $UD/new/ -type f -name '*' | wc -l`
        echo " recent:" `find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS 
| wc -l`
        sleep 1

        find $UD/cur/ $UD/new/ -type f -name '*' -mtime -$DAYS -exec ${SALEARN} 
--spam {} \;
        sleep 1

        echo ""
    fi

    # the following is debug code, to stop the script after N directories
    DCNT=$(($DCNT+1))
    #echo "DCNT: $DCNT"
    #if [[ $DCNT -ge 10 ]]; then exit 0; fi
done

echo "Finished at:" `date`
echo ""


Reply via email to