On Sun, Nov 20, 2005 at 01:57:15PM +0000, Henry Stern wrote:
> Does your implementation respect the additional constraints required by
> SpamAssassin?  The constraints are as follows:
> 
> 1.  Only "nice" rules may have scores less than 0.
> 2.  No rule may have a score above 5.
Both of these might be added as constraints to the SVM's original
optimization problem. However, they might break the maximum margin
hyperplane guarantee, and make efficient solutions impossible, so
this is not advisable. I know that this has been done for linear
regression and it is feasible there, so perhaps that might be a
better way. Initially, I used linear regression and it performs
similar, seems less sensitive to unequal class distributions, but
performs worse on small datasets (<1000 examples) - about what would
be expected. Logistic regression may also be an option as it is more
similar to SVM.


> Constraint 1 is required because it must be impossible for a spammer to
> add content to their messages that will reduce its score.  This is a
> hard-learned lesson.  In the past, we had some rules that detected
> headers added by common e-mail clients.  Spammers added these headers
> deliberately to cause SpamAssassin to misclassify their messages.
If that is the case, the rule will obviously get a score of 0 if
this is reflected in the training data.


> Constraint 2 is required by SpamAssassin's (unofficial?) development
> policy to reduce the risk of false positives.  If no one rule can cause
> a message to be marked as spam then at least two rules must fire on a
> legitimate message in order for it to be incorrectly marked.  This also
> reduces the degradation of the true positive rate over time as
> developers are forced to look for multiple features to identify spam
> instead of relying on one brittle feature.
So you additionally force the bias of the linear function to be
exactly 5. This adds a lot of constraints that are strictly
unnecessary to solve the learning problem. IMHO the false positive
rate should rather be tested via CV on a large corpus of recent ham
and spam mails instead of imposing these synthetic constraints.

For example, the more rules you have, the more likely it is that two
rules fire on a legitimate message, so if the set of rules
increases, there will come a point when you might need to change
this constraint.


> Lastly, something that most people do not take into consideration is
> that the accuracy numbers measured in the lab do not at all resemble
> those that will be observed in the real world.  SpamAssassin's
> real-world accuracy depends more on the efficacy of the rules in use
> than on the scores assigned to them:  almost every false negative that
> you will see in the real world has a score close to 0 because the
> spammer is going to great efforts to avoid triggering any rules.
Perhaps I did not make myself clear: systems trained in a similar
way were tested for the last 18 months at the Austrian Research
institute for Artificial Intelligence, by myself and seven test
users. The final institute-wide model trained on 60,000 mails is in
the process of being deployed for all users (around 40 people).
In all that time, only a single FP was reported, which was caused by
the auto-whitelist (on an open mailing list which occassionally
receives spam mails) that was subsequently switched off for all
users.

The reported FN ranges from 1% to 0.1%. A closer look by myself
based on a eight month real-time record of received mails indicates
that the FP rate is underreported, and should be on the order of 0.5%
for the most recent system. This is similar to human performance
at the same task. You can find the results in my principal paper which
can be downloaded from alex.seewald.at/spam

The default score set of SA performs poorly. Over eighteen weeks,
the FN rate increased from 30% to 70%. See TR-2005-04: A Close Look
at Current Approaches in Spam Filtering, Fig. 2. For this specific
experiment I just stored the classification, so I cannot check what
the highest score for a misclassified spam mail was. I would expect
it to be much higher than 0.


> In the case of textual rules, the problem spammers test their messages
> against SpamAssassin before they are sent out.  As an example, Send
> Safe, a popular spam tool, has an embedded copy of SpamAssassin that the
> spammer can use to test their message before it's ever sent out.
As I mentioned in a previous mail: one score set for the whole world
may not be sufficient. If everybody used their own score set,
spammers would find it harder to circumvent it. E.g. if spammers
were very proficient in triggering a nice rule, this rule would over
time get a negative score during retraining - exactly what it _should_ get.


> In the case of network rules (RBLs and URIBLs), they chose attack
> patterns that attempt to out-pace the update frequency of the
> blacklists.  There are two notable spam gangs that rapidly rotate the
> links to their landing pages.  Two weeks ago, one of those gangs was
> abusing tripod.com's free hosting service and was using each link for
> less than 5 minutes each!
Exactly why I am using only local tests in my experiments.


> If you are interested in improving the state of the art of machine
> learning and spam filtering, the area that I think needs the most work
> is how to evaluate a model.  Traditional methods such as bootstrap or
> cross-validation are somewhat meaningless because they do not take into
> account the fact that spam evolves along a timeline and that one
> generally receives multiple copies of the same spam.  A better
> evaluation method would model how responsive a learning algorithm is to
> spam as it changes over time.  In my opinion, because of the high degree
> of duplication in your average corpus, current evaluation methods test
> to see how well a classifier can detect duplicate (with some fuzz) entries.
Actually, the degree of duplication in SA's representation
(set of triggered rules) is much higher than in the original mails.
sa-learn, for example, checks for duplicate mails, and up to 5-10%
of the spam mails are indeed exactly the same by its measure.
However, if we look at SA's representation, about 50% of the mails
are duplicated. I actually used this to improve runtime of my
script.

According to my rather extensive experiments, CV (when used with
spam mails from a short time period - mailbox collection is probably
the key) can be indeed be used to get a reasonable estimate for FP and
FN. The same data can be reused to get a reasonably good model, and
mailbox collection is far less costly than recording all your ham
and spam mails over eight months (as I have done to be able to check
for time effects, another section in the TR-2005-20 paper). Updating
every 4-6 months yields acceptable performance, and focusses the
collection effort on a short timespan of 1-2 days. This might be
more acceptable by users.

Clearly, having large sets of ham and spam mails that exactly
reflect the order and timing of the received mails is preferrable,
but there are large practical problems to get these:
* Humans tend to overlook FPs, especially at high spam:ham ratios.
  At our institute, we have around 20:1, which means that to find a
  single FP mail (at a FP rate of 0.5%), people have to look through 
  4000(!) spam mails on average. This explains for example the
  discrepancy between BrightMail's reported 1:10^6 FP rate and my
  measured 1:2000.
* It is hard to motivate people to put so much effort into any data
  collection process. Even just forwarding all spam mails which came
  through to a central mail address may seem too much work to them.
  Even the FN rate can thus not be established reliably. I did this
  for eight months, but I really don't want to do it again anytime
  soon. Especially looking through the spams a second time was
  really an effort, although I really found a few false positives
  I had overlooked in the first round.
* A good spam filter makes the problem worse, as it increases the
  work to find FPs, and many people just stop if it seems to work
  well enough. This is rational behaviour: is it really worth to
  try to beat human performance on that, when human performance
  itself hampers your data collection and needs to be accounted for?
* Lastly, the spams people get differ: in initial experiments, I
  used my own model for three users, and one of them had an FN rate
  of 50% - i.e. 50% of their spam was unknown to my model. Again,
  this emphasizes that a single score set will most certainly not
  work for all users, and simplifies work for spammers to circumvent
  it.

IMHO the problem is _not_ the evaluation: there are enough ways you
can check it, and CV / bootstrap has its uses in evaluation, as has the
utilization of past ham mails to create large mail collections in
relatively short time. The problem is that to go further, we would
need both ham and spam data from a large number of very motivated
users, practically in real-time, and also adapt the model in
real-time faster than the spammers adapt them. This can IMHO only be
done by automating the adaption process itself, which naturally
leads to incremental learning as an important prerequisite. Symantec
BrightMail has done this to excess: around 700MB of updates were
sent to my machine in the month when I tested their evaluation kit -
and _still_ the results are not significantly better than using a
static model, trained similarily as with SA-Train.pl, updated every
4-6 months, with an appropriate threshold. This might change if a
lot of people are using it, as it will then become a target for
spammers to attack. I am not sure that would be a good idea...

And here is another problem: SA is not up to this task. In all my
experiments, SA - even when trained with SVMs - is just as good as a
pure NaiveBayes learner (I used SpamBayes), so the additional
background knowledge of human-created rules does not help it to
achieve better performance. This is especially apparent when
considering performance as a function of threshold: the trade-off
curves are almost exactly the same! The only advantage is that the
model memory footprint is smaller while incremental learning would
be possible but far from trivial for the combined SVM+NB system.

So for my personal filter I have switched to SpamBayes for several
months now, since it facilitates incremental training and might even
be slightly faster. Performance, as expected, is very similar to SA
at a 0.5 threshold. I use 0.1 to get much lower FN rate at slightly
increased FP rate which still seems to be on the order of 0.5%.

Solving the spam problem will not work by better filters, but by
automating defensive and aggressive measures to hit back at those
computers which send out the spams. This is what I call proactive
spam filtering, and this is what I am currently working on. A
well-performing spam filter remains an important part of the system,
but really improving on today's spam filters is likely to be a
herculean task - something I could only do if I had a research
project and could exclusively focus this topics for several years.
We will see about that...

Best,
  Alex
-- 
Dr.techn. Alexander K. Seewald

Solutions for the 21st century   +43(664)1106886
------------------------------------------------
         Information wants to be free;
Information also wants to be expensive (S.Brant)
--------------- alex.seewald.at ----------------

Reply via email to