https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #78 from Mark Martinec <mark.marti...@ijs.si> 2009-10-07 09:56:41 PDT --- > I cleaned up my few FPs and some other stuff, new logs sent.. Thanks to Daryl and Henrik, I'm still waiting for the bluestreak, but meanwhile am running garescorer on what I have (including the recent updates). Btw, Daryl, you haven't commented on: /home/dos/SA-corpus/ham/leah/ INBOX-Inbox-2007/1196258008.P18803Q16.dilbert.dostech.net:2,S /home/dos/SA-corpus/ham/leah/ INBOX-Inbox-2007/1199765108.P20983Q90.dilbert.dostech.net:2,RS /home/dos/SA-corpus/ham/dos/ Inbox-2008/1221834769.M749008P21562V0000000000000302I00414902_237.\ cyan.dostech.net,S=26243:2,S > Talking about weights, does anyone have an academic answer on how results are > affected when some corpuses are uniqued (atleast mine is) and some are not? Don't know. I removed exact duplicates on mail body from my corpus, although due to 'personalized' spam which is becoming prevalent nowadays thanks to the free CPU resources on botnets, there are still plenty of very similar yet different messages left in the corpus. I did some manual removal on these, but it is very impractical to be thorough. > Might we consider assigning different confidence weights to ham corpa? > > For example, my ham corpa are relatively small in number, but I have strong > confidence that they are thoroughly cleaned. Furthermore they are extremely > varied in sources and likely to be different from other masscheck > participants. > I have also filtered out all discussion mailing lists and automated report I do recognize that corpora are quite different in several aspects, although I don't know how one can weight them more fairly and incorporate it into the current procedure. Let me just document here what I'm doing now with a local copy of all submitted logs. Due to a significant disproportion on the size of spam-bayes-net-dos.log and spam-bayes-net-jm.log compared to the rest, I'm taking a random sample of each of these files, restricted to scoreset 3 and age below 6 months, decimated to 150.000 entries each (I initially used 100.000, but now bumped it up). There are some spam log entries older than 6 months on other spam logs, but not too many (mostly on the 'hege' collection), but as it seems these are mainly hand-selected fraud samples, I'm keeping these regardless of age. Due to shortage of ham, I'm keeping it all regardless of age. This mainly goes for JM's ham collection, which contains some (smaller) share of older ham; the remaining collections are fairly recent. There are no scoreset 0 and 2 entries in any of the logs. So for the scoreset 3 and 2 I'm using a selection from the logs with 'set=3'. For scoresets 0 and 1 runs I'm using all entries (set=1 and set=3). This all amounts to the following 'wc -l' counts: 463957 ham-full-set1.log 483402 spam-full-set1.log 293637 ham-full-set3.log 443635 spam-full-set3.log This seems reasonably fair and balanced to me. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.