Am 21.01.2016 um 07:21 schrieb Marc Perkel:
OK - Just to show you this isn't Bayesian - see if you can do this.

Here is a list of 5505874 words and phrases used in the subject line of
HAM and never seen in the subject line of SPAM

http://www.junkemailfilter.com/data/subject-ham.txt

Here is a list of 3494938 words and phrases used in the subject line of
SPAM and never seen in the subject line of HAM

http://www.junkemailfilter.com/data/subject-spam.txt

Hope you understand it now. Not Bayesian!!!!

don't get me wrong but i don't take anybody serious who needs "!!!!" and when you don't stop advertising that aggressive you are classified as spammer too

177 MB only subjects?

well, not really impressive given that i easly get the same results with a 81 MB bayes-db containing the *complete* junk of 1.5 years while only selected ham (reported wrongly classified, my personal mail and a few inboxes from nice users)

when i can get with a 600 MB corpus containing around 81000 messages the same results the only thing i understand now is that it's not really efficient and needs access to all mails for training which is a no-go

[harry@srv-rhsoft:~]$ curl --head http://www.junkemailfilter.com/data/subject-spam.txt
HTTP/1.1 200 OK
Date: Thu, 21 Jan 2016 08:12:15 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Thu, 21 Jan 2016 06:11:41 GMT
ETag: "340315d-446e47c-529d1f9f0676b"
Accept-Ranges: bytes
Content-Length: 71754876
Connection: close
Content-Type: text/plain

[harry@srv-rhsoft:~]$ curl --head http://www.junkemailfilter.com/data/subject-ham.txt
HTTP/1.1 200 OK
Date: Thu, 21 Jan 2016 08:12:25 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Thu, 21 Jan 2016 06:09:18 GMT
ETag: "340309c-645b7a1-529d1f16ad5db"
Accept-Ranges: bytes
Content-Length: 105232289
Connection: close
Content-Type: text/plain

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to