https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #78 from Mark Martinec <mark.marti...@ijs.si> 2009-10-07 09:56:41 
PDT ---
> I cleaned up my few FPs and some other stuff, new logs sent..

Thanks to Daryl and Henrik, I'm still waiting for the bluestreak, but
meanwhile am running garescorer on what I have (including the recent updates).

Btw, Daryl, you haven't commented on:

/home/dos/SA-corpus/ham/leah/
  INBOX-Inbox-2007/1196258008.P18803Q16.dilbert.dostech.net:2,S

/home/dos/SA-corpus/ham/leah/
  INBOX-Inbox-2007/1199765108.P20983Q90.dilbert.dostech.net:2,RS

/home/dos/SA-corpus/ham/dos/
  Inbox-2008/1221834769.M749008P21562V0000000000000302I00414902_237.\
  cyan.dostech.net,S=26243:2,S


> Talking about weights, does anyone have an academic answer on how results are
> affected when some corpuses are uniqued (atleast mine is) and some are not?

Don't know. I removed exact duplicates on mail body from my corpus, although
due to 'personalized' spam which is becoming prevalent nowadays thanks to the
free CPU resources on botnets, there are still plenty of very similar yet
different messages left in the corpus. I did some manual removal on these,
but it is very impractical to be thorough.


> Might we consider assigning different confidence weights to ham corpa?
>
> For example, my ham corpa are relatively small in number, but I have strong
> confidence that they are thoroughly cleaned.  Furthermore they are extremely
> varied in sources and likely to be different from other masscheck 
> participants.
> I have also filtered out all discussion mailing lists and automated report

I do recognize that corpora are quite different in several aspects, although
I don't know how one can weight them more fairly and incorporate it into
the current procedure.

Let me just document here what I'm doing now with a local copy of all
submitted logs.

Due to a significant disproportion on the size of spam-bayes-net-dos.log
and spam-bayes-net-jm.log compared to the rest, I'm taking a random sample
of each of these files, restricted to scoreset 3 and age below 6 months,
decimated to 150.000 entries each (I initially used 100.000, but now
bumped it up).

There are some spam log entries older than 6 months on other spam logs, but
not too many (mostly on the 'hege' collection), but as it seems these are
mainly hand-selected fraud samples, I'm keeping these regardless of age.

Due to shortage of ham, I'm keeping it all regardless of age. This mainly
goes for JM's ham collection, which contains some (smaller) share of
older ham; the remaining collections are fairly recent.

There are no scoreset 0 and 2 entries in any of the logs. So for the
scoreset 3 and 2 I'm using a selection from the logs with 'set=3'.
For scoresets 0 and 1 runs I'm using all entries (set=1 and set=3).

This all amounts to the following 'wc -l' counts:

  463957 ham-full-set1.log
  483402 spam-full-set1.log

  293637 ham-full-set3.log
  443635 spam-full-set3.log

This seems reasonably fair and balanced to me.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to