On Wed, 14 Nov 2007 16:51:42 +0100, Thomas Viehmann <[EMAIL PROTECTED]> said:
> Hi Manoj, > If you have suggestions how automatic testing can be incorporated into > the a spam-removal process in a way that is acceptable to the project, > I'd be very happy to seem them discussed here. However I'm not sure > that the bias that we (there are six people currently seeing how > things work) currently impose in our manual review can be very well > implemented in software. What to do with "sponsorship request spam" > from people claiming to be students or clans, what to do with foreign > language spam that people reply to with translation and the > explanation "ignore, this is spam", what to do with the reply? If you can get a corpus of past messages from people claiming to be students or clans, we can fine tune a crm114 filter to identify these mails. Given the narrow range of the messages we are classifying, I am pretty sure that the number of "unsure/retrain" messages that humans need to ponder over can be reduced by 2-3 orders of magnitude. There is no reason to only have one filtering pass; especially since we are not dealing with a streaming incoming mail. My take on this is that we automate the process by passing the unknown mbox through the crm114+SA filter, and classify the mail into Ham, Unsure, and Spam. The Unsure would be manually inspected, and used to further train the filters; as well as any erroneous classification (TOE). Periodically, we TUNE (Train Until No Error) the Corpus. Hey, if this works as well for list mail as it does for me (and my email is mostly Debian list mail), it might even work to filter incoming mail. But we'll see. >> It would be interesting to see how many messages escape my filters, >> and give me an opportunity to further train them. All I need would be >> the mbox file; and for me to setup a process to feed the email to the >> filters, and classify the result -- and then send back the message >> ID's of Ham and Spam back to Debian. > There is a couple of almost-mboxes linked from [1]. Before the first > "From " there is a mbox-like header but from there on it is a regular > mbox archive consisting of the nominations. Preliminary results > indicate that around 2/3 of the submissions for debian-project are > actually removal candidates (based on review by pabs and me, there are > others looking at the same things). The information in the initial > headers should be fairly self-explanatory, the number besides year, > month, and message number is the number of times this a message was > reported as spam. > I can easily put up more of these, of course, just tell me what you > want. (There are ca. 90000 nominated messages, it is unclear to me > whether old data is equally usable as newer.) I can try setting up infrastructure to classify a mbox (create a new user, write a simple script to parse mbox, feed mails to crm114+SA, and use mailagent to filter into ham, Spam, and unsure). I have a dog-and-pony show coming up Dec 3rd, so I might not be very responsive, at least until I am sure my software is working, but I'll grab a mbox and see what the results look like. If the setup is mostly working, future mbox's can be handled mostly automatically. If you have a human scanned set of list mails known to be Spam, or known to be ham, etc, I can use those either to augment my Corpus, or to use in place of my personal Corpus, to better reflect your judgement of what is or is not Spam. manoj -- Time flies like an arrow, fruit flies like a banana. Frequently attributed to Groucho Marx Manoj Srivastava <[EMAIL PROTECTED]> <http://www.debian.org/~srivasta/> 1024D/BF24424C print 4966 F272 D093 B493 410B 924B 21BA DABB BF24 424C -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]