Am 21.01.2016 um 07:21 schrieb Marc Perkel:
OK - Just to show you this isn't Bayesian - see if you can do this. Here is a list of 5505874 words and phrases used in the subject line of HAM and never seen in the subject line of SPAM http://www.junkemailfilter.com/data/subject-ham.txt Here is a list of 3494938 words and phrases used in the subject line of SPAM and never seen in the subject line of HAM http://www.junkemailfilter.com/data/subject-spam.txt Hope you understand it now. Not Bayesian!!!!
don't get me wrong but i don't take anybody serious who needs "!!!!" and when you don't stop advertising that aggressive you are classified as spammer too
177 MB only subjects?well, not really impressive given that i easly get the same results with a 81 MB bayes-db containing the *complete* junk of 1.5 years while only selected ham (reported wrongly classified, my personal mail and a few inboxes from nice users)
when i can get with a 600 MB corpus containing around 81000 messages the same results the only thing i understand now is that it's not really efficient and needs access to all mails for training which is a no-go
[harry@srv-rhsoft:~]$ curl --head http://www.junkemailfilter.com/data/subject-spam.txt
HTTP/1.1 200 OK Date: Thu, 21 Jan 2016 08:12:15 GMT Server: Apache/2.2.15 (CentOS) Last-Modified: Thu, 21 Jan 2016 06:11:41 GMT ETag: "340315d-446e47c-529d1f9f0676b" Accept-Ranges: bytes Content-Length: 71754876 Connection: close Content-Type: text/plain[harry@srv-rhsoft:~]$ curl --head http://www.junkemailfilter.com/data/subject-ham.txt
HTTP/1.1 200 OK Date: Thu, 21 Jan 2016 08:12:25 GMT Server: Apache/2.2.15 (CentOS) Last-Modified: Thu, 21 Jan 2016 06:09:18 GMT ETag: "340309c-645b7a1-529d1f16ad5db" Accept-Ranges: bytes Content-Length: 105232289 Connection: close Content-Type: text/plain
signature.asc
Description: OpenPGP digital signature