Re[2]: Help with BayesIt tuning
Hello DZ-Jay, Sunday, August 15, 2004, 1:47:45 PM, you wrote: >> It might be coincidence, but Paul Graham has written much about >> Bayesian filtering. I'd guess it has something to do with his >> methodology. Even if I'm wrong, there's some interesting reading at: >> http://www.paulgraham.com/antispam.html DJ> Thanx for the info... that would make more sense, although DJ> how come the spam-grade and graham values coinside in all messages DJ> without exception? I guess I'll ask Alexey about it. In the DJ> meantime, I'll check out the link you sent :) Does Alexey not frequent this list? It would sure be helpful if he could answer directly. Does anyone know how we can continue this conversation directly with him? -- Stuartmailto:[EMAIL PROTECTED] Using The Bat! v2.13 "Lucky" Beta/5 on Windows 98 4.10 Build A Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Hello George, Sunday, August 15, 2004, 11:35:29 AM, you wrote: GM> DZ-Jay wrote: DJ>> Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say: >>>What is Graham? >>>What is Spam-grade? DJ>> AFAIK, spam-grade would be the probability of it being spam, and DJ>> Graham, I suppose, means the probability of it being not-spam (I DJ>> suppose, non-spam-grade > ham-grade > graham ?) GM> It might be coincidence, but Paul Graham has written much about GM> Bayesian filtering. I'd guess it has something to do with his GM> methodology. Even if I'm wrong, there's some interesting reading at: GM> http://www.paulgraham.com/antispam.html Yes, Paul uses a slightly modified algorithm from the original Bayes. So does that mean it is calculating using both algorithms to create two values? -- Best regards, MikeDmailto:[EMAIL PROTECTED] Using The Bat! v2.12.00 on Windows ME 4.90 Build 3000 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Hello DZ-Jay, Sunday, August 15, 2004, 10:20:52 AM, you wrote: DJ> Some time around 08/15/2004 09:24:56, I think I heard MikeD (3) say: DJ>>> I was too. I just upgraded yesterday to 0.5.9 and I haven't DJ>>> noticed a difference. It does provide a white/black list, which I DJ>>> don't care to use because it defeats the purpose of a Bayesian DJ>>> filter (there's huge discussion -- more like religious wars -- DJ>>> about this on the POPFile list hehe). Also, the kludges.txt file DJ>>> doesn't seem to be implemented either (ignore list for headers). >> That's too bad DJ> I just learned (by re-reading a babelfished translation of DJ> the russian BayesIt page) that the "kludges" file (whitelist of DJ> kludges) does seem to work, except I misunderstood it. I thought DJ> it worked like POPFile's "ignore" list, which ignores the DJ> specified tokens when computing the probability of a message. But DJ> it is not a list of just "tokens", it is a list of header names DJ> that will be ignored, for example, if you put in the list: DJ> message-id DJ> x-mailer DJ> subject DJ> If will ignore the values of headers that start with those DJ> strings. This is very useful, though. DJ> I wonder, is the "ignore" list in the black/white list rules DJ> window what I confused the kludges list for? i.e. is it akin to DJ> the POPFile ignore list? Anybody know? Hmmm ... does it just ignore those 'lines' in the header? If so, I don't think that will be a problem for me. My Kludges contains: x-spam-checker-version x-spam-level x-spam-report x-spam-status x-uidl And I don't think any of those are causing a problem. -- Best regards, MikeDmailto:[EMAIL PROTECTED] Using The Bat! v2.12.00 on Windows ME 4.90 Build 3000 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[3]: Help with BayesIt tuning
Hello Pete, Sunday, August 15, 2004, 9:52:14 AM, you wrote: PH> Sunday, August 15, 2004, 7:44:17 AM, you wrote: AW>> Hello MikeD, AW>> On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote: AW>> Have you deleted you spam and non-spam dictionary files when you AW>> upgraded? PH> What are their names and where are they? Originally I had two sets of dictionaries. One (I assume for the old version were in c:\Program Files\TheBat\bayesit\base. The current version is creating the following files here ... c:\My Documents\BatMail\bayesit\base transact spamdict.idx nspamdict.idx spamdict.lst spamdict.bye nspamdict.lst selective.txt nspamdict.bye -- Best regards, MikeDmailto:[EMAIL PROTECTED] Using The Bat! v2.12.00 on Windows ME 4.90 Build 3000 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Hello DZ-Jay, Sunday, August 15, 2004, 9:31:38 AM, you wrote: DJ> Some time around 08/15/2004 09:23:46, I think I heard MikeD (3) say: >> Hello Andre, >> Sunday, August 15, 2004, 6:44:17 AM, you wrote: AW>>> Have you deleted you spam and non-spam dictionary files when you AW>>> upgraded? >> Funny, that. When I first upgraded I did not and it seemed to work >> fine ... until I rebooted. DJ> Strange... rebooting shouldn't affect anything... Well I am guessing that because I had been running the old version of Bayesit earlier in the day, that it continued to use that until I rebooted. It is the only thing that I can think of that makes sense. >> After that, yes, I deleted all the dict files I could find. >> Apparently there were two sets, one from the old version and one set >> from the new. DJ> I had to do the same thing when upgrading from v0.4gm to DJ> v0.5.4 because I was having problems. >> I then re-trained it on the accumulated spam and ham folders I have >> with about 2,000 messages each. BTW, If I give Bayesit all 2,000 >> messages at once to "chew on", it would hang. If I gave it in >> "chunks" it seemed to work OK DJ> Hum... after deleting the dict files, I trained normally with DJ> lots of spam/non-spam messages (I'm pretty sure it was more than DJ> 2,000) without a problem. So I don't know what could have DJ> happened in your case (?) DJ> I personally find BayesIt extremely powerful, accurate, and DJ> fast (I come from POPFile, with an accuracy of 99.6 % which DJ> required a LOT of manual tuning, had quite some false positives, DJ> and was VERY slow...), but what it misses it *really* misses (0%, DJ> as opposed to some mid-way value). I have used several 0.4 versions and they worked great, so I am guessing that I just need to 'fix' a setting somewhere ... or at least I hope that is it -- Best regards, MikeDmailto:[EMAIL PROTECTED] Using The Bat! v2.12.00 on Windows ME 4.90 Build 3000 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 12:35:29, I think I heard George Mitchell say: > It might be coincidence, but Paul Graham has written much about > Bayesian filtering. I'd guess it has something to do with his > methodology. Even if I'm wrong, there's some interesting reading at: > http://www.paulgraham.com/antispam.html Thanx for the info... that would make more sense, although how come the spam-grade and graham values coinside in all messages without exception? I guess I'll ask Alexey about it. In the meantime, I'll check out the link you sent :) dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
DZ-Jay wrote: DJ> Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say: >>What is Graham? >>What is Spam-grade? DJ> AFAIK, spam-grade would be the probability of it being spam, and DJ> Graham, I suppose, means the probability of it being not-spam (I DJ> suppose, non-spam-grade > ham-grade > graham ?) It might be coincidence, but Paul Graham has written much about Bayesian filtering. I'd guess it has something to do with his methodology. Even if I'm wrong, there's some interesting reading at: http://www.paulgraham.com/antispam.html -- George Using The Bat! 2.12.00 on Windows XP Pro 5.1, Build 2600, Service Pack 1. Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 11:57:15, I think I heard Pete Holsberg say: > Sunday, August 15, 2004, 11:11:00 AM, you wrote: DJ>> Some time around 08/15/2004 10:52:14, I think I heard Pete Holsberg say: DJ>>\BayesIt\base DJ>> or DJ>>\MAIL\BayesIt\base > ??? Mine are in C:\Documents and Settings\pjh\Application Data\BayesIt\base > TB is in C:\Program Files\The Bat!\thebat.exe and BayesIt is in C:\Program > Files\BayesIt > under Windows 2000. Well, I guess those are the default installation paths: The application in the Program Files directory and the BayesIt files in your profile directory. Since I have TB! installed in a non-standard directory (i.e. outside the Program Files directory), BayesIt was installed within that directory. I guess then I should have said: \BayesIt\base or \BayesIt\base Sorry about that. I guess that since I don't use the default installation paths I don't know where things normally fall. In any case, the dict files fall within the BayesIt working directory, which is specified in BayesIt options window. > Is this significant? Not at all. dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Sunday, August 15, 2004, 11:11:00 AM, you wrote: DJ> Some time around 08/15/2004 10:52:14, I think I heard Pete Holsberg say: >> Sunday, August 15, 2004, 7:44:17 AM, you wrote: AW>>> Hello MikeD, AW>>> On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote: AW>>> Have you deleted you spam and non-spam dictionary files when you AW>>> upgraded? >> What are their names and where are they? DJ> Their names are spamdict.* and nspamdict.* and they are located in a directory DJ> called "base" within the BayesIt working directory, which is normally either: DJ> \BayesIt\base DJ> or DJ> \MAIL\BayesIt\base ??? Mine are in C:\Documents and Settings\pjh\Application Data\BayesIt\base TB is in C:\Program Files\The Bat!\thebat.exe and BayesIt is in C:\Program Files\BayesIt under Windows 2000. Is this significant? -- Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say: > I am not seeing the "empty" tokens, but the following message is being >received without being caught. I sent it again to myself about 5 or >6 times and marked it as junk each time. The values do not seem to >change at all. Maybe this is because of your value in the "recalculating strategy" parameter, which governs how often automatic "retraining" is done. Try lowering this value and re-marking the message as spam and see if the values change. dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say: >What is Graham? >What is Spam-grade? AFAIK, spam-grade would be the probability of it being spam, and Graham, I suppose, means the probability of it being not-spam (I suppose, non-spam-grade > ham-grade > graham ?) But in my log I see exactly what you see in yours: that the graham and spam-grade values are identical in every case. This keeps getting fishier and fishier... dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Hello DZ-Jay, Sunday, August 15, 2004, 9:25:14 AM, you wrote: DJ> However, I recently noticed why some obviously spam messages DJ> are given a probability of 0%: Apparently the analysis engine is DJ> regarding a few "empty" tokens with a value of 0%, which DJ> "unspamifies" the final value, for example, in my log file, I get DJ> this in some messages: I am not seeing the "empty" tokens, but the following message is being received without being caught. I sent it again to myself about 5 or 6 times and marked it as junk each time. The values do not seem to change at all. What is Graham? What is Spam-grade? <[EMAIL PROTECTED]> Graham: 7.59688e-029 Spam-grade: 7.59688e-029 Value for The Bat!: 0 : --- biz: 0.01 --: 0.0212766 size: 0.01 Advance: 0.01 H this: 0.058463 partners: 0.01 Today: 0.01 H PLease: 0.01 H de: 0.0359281 Career: 0.01 text: 0.01 experience: 0.0133407 aol: 0.01 Verdana: 0.01 past: 0.01 -- Stuartmailto:[EMAIL PROTECTED] Using The Bat! v2.13 "Lucky" Beta/5 on Windows 98 4.10 Build A Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 09:24:56, I think I heard MikeD (3) say: DJ>> I was too. I just upgraded yesterday to 0.5.9 and I haven't DJ>> noticed a difference. It does provide a white/black list, which I DJ>> don't care to use because it defeats the purpose of a Bayesian DJ>> filter (there's huge discussion -- more like religious wars -- DJ>> about this on the POPFile list hehe). Also, the kludges.txt file DJ>> doesn't seem to be implemented either (ignore list for headers). > That's too bad I just learned (by re-reading a babelfished translation of the russian BayesIt page) that the "kludges" file (whitelist of kludges) does seem to work, except I misunderstood it. I thought it worked like POPFile's "ignore" list, which ignores the specified tokens when computing the probability of a message. But it is not a list of just "tokens", it is a list of header names that will be ignored, for example, if you put in the list: message-id x-mailer subject If will ignore the values of headers that start with those strings. This is very useful, though. I wonder, is the "ignore" list in the black/white list rules window what I confused the kludges list for? i.e. is it akin to the POPFile ignore list? Anybody know? dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 10:52:14, I think I heard Pete Holsberg say: > Sunday, August 15, 2004, 7:44:17 AM, you wrote: AW>> Hello MikeD, AW>> On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote: AW>> Have you deleted you spam and non-spam dictionary files when you AW>> upgraded? > What are their names and where are they? Their names are spamdict.* and nspamdict.* and they are located in a directory called "base" within the BayesIt working directory, which is normally either: \BayesIt\base or \MAIL\BayesIt\base dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Sunday, August 15, 2004, 7:44:17 AM, you wrote: AW> Hello MikeD, AW> On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote: AW> Have you deleted you spam and non-spam dictionary files when you AW> upgraded? What are their names and where are they? -- Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 10:20:47, I think I heard Alexander S. Kunz say: > I just checked my POPfile bucket pages and found it very interesting that, > despite spam is only 5.8% of my messages (lucky me, hu?), the "distinct > word count" for those spam messages is by far the highest (only messages > marked as "genuine/english" come close). I'd interpret that as "spam is > *very* recognizable" after a certain training period. That could explain > your results with BayesIt - maybe. Yes, I agree that that could be the reason. However, the messages that are missed (roughly 4% of total spam traffic) are marked with a 0%, which would qualify them as "unambiguosly genuine (non-spam)", but they obviously are not, as a lot of spam tokens are found in them. This is why I think there might be a problem with the filter itself, or with my settings. > In practice, I had similar (odd) results with BayesIt. :-) ...part of the > reason that made me switch to POPfile... Funny, I went the other way... POPfile was very reliable for me (99.6%) but required constant manual hacking of the corpus to maintain this accuracy, plus with a sufficiently high corpus, it was really slow (took almost a couple of seconds to download each message, even very small ones), which with a dial-up connection and hundreds of messages a day is almost unbearable. Plus, there was no way to offer some extra weight to non-spam messages (like with "regarding threshold" in BayesIt), which almost completely irradicates false positives. With POPfile I had to scan my spam box once in a while in order to make sure. With BayesIt, after doing so for a few months without even a single false positive, I concluded that it was not necessary anymore to scan the spam folder often. I like that :) dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Hello DZ-Jay, 15-Aug-2004 15:12, you wrote: > I checked the BAYESIT.LOG file and realized that all messages are > marked with either 100/99 % or 0% probability, which means that no matter > how low I set the parameter, it will continue working the same. I don't > understand how come there is no "gray area", with messages marked with a, > say, 30% probability, etc. I do not get any false positives at all, but > I do get about 4% of false negatives... I just checked my POPfile bucket pages and found it very interesting that, despite spam is only 5.8% of my messages (lucky me, hu?), the "distinct word count" for those spam messages is by far the highest (only messages marked as "genuine/english" come close). I'd interpret that as "spam is *very* recognizable" after a certain training period. That could explain your results with BayesIt - maybe. In practice, I had similar (odd) results with BayesIt. :-) ...part of the reason that made me switch to POPfile... -- Best regards, Alexander Bradley's Bromide: If computers get too powerful, we can organize them into a committee... that will do them in. Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 09:23:46, I think I heard MikeD (3) say: > Hello Andre, > Sunday, August 15, 2004, 6:44:17 AM, you wrote: AW>> Have you deleted you spam and non-spam dictionary files when you AW>> upgraded? > Funny, that. When I first upgraded I did not and it seemed to work > fine ... until I rebooted. Strange... rebooting shouldn't affect anything... > After that, yes, I deleted all the dict files I could find. > Apparently there were two sets, one from the old version and one set > from the new. I had to do the same thing when upgrading from v0.4gm to v0.5.4 because I was having problems. > I then re-trained it on the accumulated spam and ham folders I have > with about 2,000 messages each. BTW, If I give Bayesit all 2,000 > messages at once to "chew on", it would hang. If I gave it in > "chunks" it seemed to work OK Hum... after deleting the dict files, I trained normally with lots of spam/non-spam messages (I'm pretty sure it was more than 2,000) without a problem. So I don't know what could have happened in your case (?) I personally find BayesIt extremely powerful, accurate, and fast (I come from POPFile, with an accuracy of 99.6 % which required a LOT of manual tuning, had quite some false positives, and was VERY slow...), but what it misses it *really* misses (0%, as opposed to some mid-way value). dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 09:24:56, I think I heard MikeD (3) say: DJ>> I started with the "move message" setting at 40 and continued DJ>> to lowered it without noticing any effect. That's when I checked DJ>> the BAYESIT.LOG file and realized that all messages are marked DJ>> with either 100/99 % or 0% probability, which means that no matter DJ>> how low I set the parameter, it will continue working the same. I DJ>> don't understand how come there is no "gray area", with messages DJ>> marked with a, say, 30% probability, etc. I do not get any false DJ>> positives at all, but I do get about 4% of false negatives... > At the moment, everything in the log is .99. Nothing has any other > value. Does that sound right? That's more or less what I get, and in my opinion, it doesn't seem to be right. However, I recently noticed why some obviously spam messages are given a probability of 0%: Apparently the analysis engine is regarding a few "empty" tokens with a value of 0%, which "unspamifies" the final value, for example, in my log file, I get this in some messages: : --- 15.08.2004 08:13:41 <[EMAIL PROTECTED]> Graham: 0 Spam-grade: 0 Value for The Bat!: 0 : --- <...> : 0 : 0 : 0 : 0 : 0 : 0 : 0 : 0 As you can see, no matter how many spam tokens are found, all those 0's will end up clearing the final probability value. This seems to me a bug in the tokenizer. I haven't been able to find a common denominator for messages that cause this. Does anybody else get "empty" tokens in their log files? dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 09:28:42, I think I heard Thomas Fernandez say: DJ>> What is not that simple? The bayesian algorithm or how the DJ>> "regarding threshold" is used by the plugin? > The Bayesian algorithms. Your question, to which I answered, could be > understood this way, so I don't feel I have to apologise. I guess some people in this list just have to offer an answer -- any answer -- just because. Well then, thank you for your wonderfully insightful answer of "Check out for a mathematician called Bayes. 19th century, IIRC." No need to apologize at all, I have such a better grasp on the subject now, thanks! dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Hello DZ-Jay, On Sun, 15 Aug 2004 09:20:41 -0400 GMT (15/08/2004, 20:20 +0700 GMT), DZ-Jay wrote: DJ>>> That makes sense. But do you know how the weight is calculated? >> Check out for a mathematician called Bayes. 19th century, IIRC. DJ> Have you read at all the entire thread, or did you just DJ> decided to come in and offer your insightful comments at just this DJ> point? I 've read the thread, but nowhere was mentioned how a Bayesian filter works. I thought that was your question. Apparantly it wasn't, so sorry for having wasted bandwidth. >> It's not that simple. DJ> What is not that simple? The bayesian algorithm or how the DJ> "regarding threshold" is used by the plugin? The Bayesian algorithms. Your question, to which I answered, could be understood this way, so I don't feel I have to apologise. -- Regards, Thomas. "Sorry, Officer, I didn't realize my radar detector wasn't plugged in." Message reply created with The Bat! 2.12.02 under Chinese Windows 98 4.10 Build A Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Hello DZ-Jay, Sunday, August 15, 2004, 8:12:23 AM, you wrote: DJ> Some time around 08/14/2004 22:24:58, I think I heard MikeD (3) say: >> What settings are you using? Under the old version (0.4gm) I had it >> trained and was getting most spam caught, no false positives with a >> "Move message" setting of 10. Now I have gone down as low as 1 and as >> high as 99 without success. DJ> I started with the "move message" setting at 40 and continued DJ> to lowered it without noticing any effect. That's when I checked DJ> the BAYESIT.LOG file and realized that all messages are marked DJ> with either 100/99 % or 0% probability, which means that no matter DJ> how low I set the parameter, it will continue working the same. I DJ> don't understand how come there is no "gray area", with messages DJ> marked with a, say, 30% probability, etc. I do not get any false DJ> positives at all, but I do get about 4% of false negatives... At the moment, everything in the log is .99. Nothing has any other value. Does that sound right? >> BTW, I am using the 0.5.5 verision that came with 2.12. Should I be >> using the newer version that I saw mentioned? DJ> I was too. I just upgraded yesterday to 0.5.9 and I haven't DJ> noticed a difference. It does provide a white/black list, which I DJ> don't care to use because it defeats the purpose of a Bayesian DJ> filter (there's huge discussion -- more like religious wars -- DJ> about this on the POPFile list hehe). Also, the kludges.txt file DJ> doesn't seem to be implemented either (ignore list for headers). That's too bad -- Best regards, MikeDmailto:[EMAIL PROTECTED] Using The Bat! v2.12.00 on Windows ME 4.90 Build 3000 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Hello Andre, Sunday, August 15, 2004, 6:44:17 AM, you wrote: AW> Hello MikeD, AW> On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote: AW> Have you deleted you spam and non-spam dictionary files when you AW> upgraded? Funny, that. When I first upgraded I did not and it seemed to work fine ... until I rebooted. After that, yes, I deleted all the dict files I could find. Apparently there were two sets, one from the old version and one set from the new. I then re-trained it on the accumulated spam and ham folders I have with about 2,000 messages each. BTW, If I give Bayesit all 2,000 messages at once to "chew on", it would hang. If I gave it in "chunks" it seemed to work OK -- Best regards, MikeDmailto:[EMAIL PROTECTED] Using The Bat! v2.12.00 on Windows ME 4.90 Build 3000 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/15/2004 07:43:05, I think I heard Andre Wichartz say: > Hello DZ-Jay, > On 14 Aug 2004 at 14:42:17 -0400 GMT [20:42 CEST] you wrote: DJ>> That makes sense. But do you know how the weight is DJ>> calculated? I can assume it is the product of its initial DJ>> probability by the "regarding threshold" value, is that true? > I don't program the thing. For specific questions you really should ask > Alexey. I thought that with so much traffic in this list there would be someone who knew. Oh well... DJ>> And is it only for tokens that have the same occurrence in spam and DJ>> non-spam messages, or is the weight skewed by this threshold on all DJ>> tokens to give them an extra "non-spamy" umph in order to avoid DJ>> false positives? > I just made an example. It would of course work regardless how often a > word occurs. So you don't know... Ok. I'll continue looking for info, probably contacting Alexey. Thanx dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/14/2004 23:28:14, I think I heard Thomas Fernandez say: DJ>> That makes sense. But do you know how the weight is calculated? > Check out for a mathematician called Bayes. 19th century, IIRC. Have you read at all the entire thread, or did you just decided to come in and offer your insightful comments at just this point? I'm talking about the "regarding threshold" value and how is it used, i.e. given the bayesian probability of a message what *ADDITIONAL* computation occurs with that parameter. Do you know? Do you think Mr. Bayes would have had enough visionary insight to see how this BayesIt-specific parameter was used by Alexey in his plugin? DJ>> I can assume it is the product of its initial probability by the DJ>> "regarding threshold" value, is that true? > It's not that simple. What is not that simple? The bayesian algorithm or how the "regarding threshold" is used by the plugin? Because, if you have noticed from the context of the comment, I am talking about the parameters in the ADVANCED.INI file and how they are implemented. dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/14/2004 22:24:58, I think I heard MikeD (3) say: > What settings are you using? Under the old version (0.4gm) I had it > trained and was getting most spam caught, no false positives with a > "Move message" setting of 10. Now I have gone down as low as 1 and as > high as 99 without success. I started with the "move message" setting at 40 and continued to lowered it without noticing any effect. That's when I checked the BAYESIT.LOG file and realized that all messages are marked with either 100/99 % or 0% probability, which means that no matter how low I set the parameter, it will continue working the same. I don't understand how come there is no "gray area", with messages marked with a, say, 30% probability, etc. I do not get any false positives at all, but I do get about 4% of false negatives... > BTW, I am using the 0.5.5 verision that came with 2.12. Should I be > using the newer version that I saw mentioned? I was too. I just upgraded yesterday to 0.5.9 and I haven't noticed a difference. It does provide a white/black list, which I don't care to use because it defeats the purpose of a Bayesian filter (there's huge discussion -- more like religious wars -- about this on the POPFile list hehe). Also, the kludges.txt file doesn't seem to be implemented either (ignore list for headers). dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Hello MikeD, On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote: M> I have been following this thread since I have been having some M> problems too. I was using the old version (0.4gm) until I upgraded to M> the current version of TB. M> The settings I used to use don't seem to work any more and I either M> get everything filtered as junk or nothing is filtered as junk. I M> trained it with about 2000 spam and 2000 ham messages and still no M> joy. I have tried low "threshold" numbers and high with out much M> difference. Have you deleted you spam and non-spam dictionary files when you upgraded? -- Cheers, Andre "I don't suffer from insanity. I enjoy every minute of it." Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Hello DZ-Jay, On 14 Aug 2004 at 14:42:17 -0400 GMT [20:42 CEST] you wrote: DJ> That makes sense. But do you know how the weight is DJ> calculated? I can assume it is the product of its initial DJ> probability by the "regarding threshold" value, is that true? I don't program the thing. For specific questions you really should ask Alexey. DJ> And is it only for tokens that have the same occurrence in spam and DJ> non-spam messages, or is the weight skewed by this threshold on all DJ> tokens to give them an extra "non-spamy" umph in order to avoid DJ> false positives? I just made an example. It would of course work regardless how often a word occurs. -- Cheers, Andre "Geh nicht nur die glatten Strassen: geh Wege, die vor Dir noch niemand ging, damit Du Spuren hinterlässt,und nicht nur Staub." Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Hello DZ-Jay, On Sat, 14 Aug 2004 14:42:17 -0400 GMT (15/08/2004, 01:42 +0700 GMT), DZ-Jay wrote: >> Assume a word orccurs equally often in spam and non-spam mails. If you >> set the value to 1 the word will get a spam propability of 0.5. If you >> set it to a higher value the word will get something lower than 0.5. >> Words in non-spam mails just count more and you can set just how much >> more. >> At least that's my take on it. DJ> That makes sense. But do you know how the weight is calculated? Check out for a mathematician called Bayes. 19th century, IIRC. DJ> I can assume it is the product of its initial probability by the DJ> "regarding threshold" value, is that true? It's not that simple. -- Cheers, Thomas. 24 Dinge, die man beim Sex nicht sagen sollte: 8. Du bist fast so gut wie mein Ex! Message reply created with The Bat! 2.12.02 under Chinese Windows 98 4.10 Build A Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Hello DZ-Jay, Saturday, August 14, 2004, 5:31:59 PM, you wrote: DJ> Some time around 08/14/2004 15:47:24, I think I heard MikeD say: >> The settings I used to use don't seem to work any more and I either >> get everything filtered as junk or nothing is filtered as junk. I >> trained it with about 2000 spam and 2000 ham messages and still no >> joy. I have tried low "threshold" numbers and high with out much >> difference. DJ> That's pretty much what I get: messages are either DJ> COMPLETELY spam (99 or 100 % probability) or COMPLETELY not-spam DJ> (0% probability). Although mine seems to catch most (~97%) of DJ> spam, out of a few hundred emails daily, so its not that bad. And DJ> that's with the default settings. I'm trying to tune it to get it DJ> a bit higher in accuracy, if possible, but can't seem to get much DJ> help on this subject :( What settings are you using? Under the old version (0.4gm) I had it trained and was getting most spam caught, no false positives with a "Move message" setting of 10. Now I have gone down as low as 1 and as high as 99 without success. BTW, I am using the 0.5.5 verision that came with 2.12. Should I be using the newer version that I saw mentioned? -- Best regards, MikeDmailto:[EMAIL PROTECTED] Using The Bat! v2.12.00 on Windows ME 4.90 Build 3000 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/14/2004 14:54:38, I think I heard Pete Holsberg say: > Saturday, August 14, 2004, 2:37:03 PM, you wrote: DJ>> Some time around 08/14/2004 12:34:07, I think I heard Pete Holsberg say: >>> Where do you do the setting??? DJ>> In a file called ADVANCED.INI in the BayesIt working directory, or DJ>> in the TB! installation directory. > Not found anywhere on either HD! > Can it be created manually?? Yes you can... but which version of BayesIt are you using? Maybe you are using an older version... Here's the default ADVANCED.INI file that came with BayesIt 0.5.9: working thread priority = 2; onexit thread priority = 3; export selective download = 1; selective download spam threshold = 10; simple digits spam marks = 1; no spaces spam marks = 1; limit size to hash = 19; limit size to hash header = 96; temporary dictionary = "c:\\temp"; use expiration = 0; age to expirate = 100; learn from zero = 1; max size of log file = 131072; recalculating strategy = 3; regarding threshold = 1.5; use autotrain = 1; use degeneration = 1; number of exclamations = 5; dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/14/2004 15:47:24, I think I heard MikeD say: > The settings I used to use don't seem to work any more and I either > get everything filtered as junk or nothing is filtered as junk. I > trained it with about 2000 spam and 2000 ham messages and still no > joy. I have tried low "threshold" numbers and high with out much > difference. That's pretty much what I get: messages are either COMPLETELY spam (99 or 100 % probability) or COMPLETELY not-spam (0% probability). Although mine seems to catch most (~97%) of spam, out of a few hundred emails daily, so its not that bad. And that's with the default settings. I'm trying to tune it to get it a bit higher in accuracy, if possible, but can't seem to get much help on this subject :( dZ. > Is there a good "getting started" file somewhere that I have > just missed? -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Hello All, I have been following this thread since I have been having some problems too. I was using the old version (0.4gm) until I upgraded to the current version of TB. The settings I used to use don't seem to work any more and I either get everything filtered as junk or nothing is filtered as junk. I trained it with about 2000 spam and 2000 ham messages and still no joy. I have tried low "threshold" numbers and high with out much difference. Is there a good "getting started" file somewhere that I have just missed? -- Best regards, MikeDmailto:[EMAIL PROTECTED] Using The Bat! v2.12.00 on Windows ME 4.90 Build 3000 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Saturday, August 14, 2004, 2:37:03 PM, you wrote: DJ> Some time around 08/14/2004 12:34:07, I think I heard Pete Holsberg say: >> Where do you do the setting??? DJ> In a file called ADVANCED.INI in the BayesIt working directory, or DJ> in the TB! installation directory. Not found anywhere on either HD! Can it be created manually?? -- Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/14/2004 12:27:41, I think I heard Andre Wichartz say: > Assume a word orccurs equally often in spam and non-spam mails. If you > set the value to 1 the word will get a spam propability of 0.5. If you > set it to a higher value the word will get something lower than 0.5. > Words in non-spam mails just count more and you can set just how much > more. > At least that's my take on it. That makes sense. But do you know how the weight is calculated? I can assume it is the product of its initial probability by the "regarding threshold" value, is that true? And is it only for tokens that have the same occurrence in spam and non-spam messages, or is the weight skewed by this threshold on all tokens to give them an extra "non-spamy" umph in order to avoid false positives? Thanx dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/14/2004 12:34:07, I think I heard Pete Holsberg say: > Where do you do the setting??? In a file called ADVANCED.INI in the BayesIt working directory, or in the TB! installation directory. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re[2]: Help with BayesIt tuning
Saturday, August 14, 2004, 12:27:41 PM, you wrote: AW> Hello DZ-Jay, AW> On 14 Aug 2004 at 11:30:32 -0400 GMT [17:30 CEST] you wrote: DJ>> Yes, I am aware of its definition, but what I don't understand DJ>> is what would be the effect of changing it to, say, 1.2 from 1.5 DJ>> (apart from the academic answer of making non-spam tokens a bit less DJ>> heavier). How does the plugin use this value? AW> Assume a word orccurs equally often in spam and non-spam mails. If you AW> set the value to 1 the word will get a spam propability of 0.5. If you AW> set it to a higher value the word will get something lower than 0.5. AW> Words in non-spam mails just count more and you can set just how much AW> more. Where do you do the setting??? -- Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Hello DZ-Jay, On 14 Aug 2004 at 11:30:32 -0400 GMT [17:30 CEST] you wrote: DJ> Yes, I am aware of its definition, but what I don't understand DJ> is what would be the effect of changing it to, say, 1.2 from 1.5 DJ> (apart from the academic answer of making non-spam tokens a bit less DJ> heavier). How does the plugin use this value? Assume a word orccurs equally often in spam and non-spam mails. If you set the value to 1 the word will get a spam propability of 0.5. If you set it to a higher value the word will get something lower than 0.5. Words in non-spam mails just count more and you can set just how much more. At least that's my take on it. -- Cheers, Andre "1. If it's green or it wiggles, it's biology. 2. If it stinks, it's chemistry. 3. If it doesn't work, it's physics." Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Some time around 08/14/2004 10:34:25, I think I heard Andre Wichartz say: > Hello DZ-Jay, > On 14 Aug 2004 at 09:28:34 -0400 GMT [15:28 CEST] you wrote: DJ>> BTW, I do not understand very well the "regarding threshold" DJ>> parameter, can someone explain it please? > From advanced.ini: > ; this number shows, how much "heavier" non-spam tokens than spam. It > makes some kind of "guard" and keeps from false positives. Usual value > is 2, but you can also try others... Yes, I am aware of its definition, but what I don't understand is what would be the effect of changing it to, say, 1.2 from 1.5 (apart from the academic answer of making non-spam tokens a bit less heavier). How does the plugin use this value? Thanx dZ. -- Powered by The Bat! v.2.12.00, Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Re: Help with BayesIt tuning
Hello DZ-Jay, On 14 Aug 2004 at 09:28:34 -0400 GMT [15:28 CEST] you wrote: DJ> BTW, I do not understand very well the "regarding threshold" DJ> parameter, can someone explain it please? From advanced.ini: ; this number shows, how much "heavier" non-spam tokens than spam. It makes some kind of "guard" and keeps from false positives. Usual value is 2, but you can also try others... -- Cheers, Andre "Charlie was a Chemist, but Charlie is no more. What Charlie thought was H20 was H2SO4." Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html
Help with BayesIt tuning
Hello: I've been running BayesIt for a while and it works beautifully. My accuracy right now is at 96.75%, so I guess I shouldn't complain. But out of a few hundred messages I get a day, it misses about 10 that look like obvious spam but were marked as not-spam. I checked the BAYESIT.LOG file and found that almost all messages are valued by BayesIt at either 99, 100 or 0. Its as if BayesIt thinks all messages are absolutely spam, or absolutely not-spam. In a sense I think this is good, and its because I started training it with a large collection of spam/not-spam messages. But I cannot help but think that there should be more of a gray area for some messages... For example, the 10 messages that it misses daily are valued at 0. I think there should be a way for me to tune the configurations in order to make it more accurate. On the other hand, I do not get ANY false positives, so that is a very good thing. This is what I have in my ADVANCED.INI: working thread priority="2" onexit thread priority="3" selective download spam threshold="50" export selective download="1" simple digits spam marks="1" no spaces spam marks="1" limit size to hash="19" limit size to hash header="96" temporary dictionary="C:\DOCUME~1\dz\LOCALS~1\Temp" use expiration="0" age to expirate="90" learn from zero="0" ; I changed this one today, was "1" max size of log file="5242880" recalculating strategy="0.0002" ; I changed this one today, was "5" regarding threshold="1.5" ; I changed this today, was "1.8" use autotrain="1" use degeneration="1" number of exclamations="5" Any recommendations? BTW, I do not understand very well the "regarding threshold" parameter, can someone explain it please? I use BayesIt 0.5.5 Thanx! :) -dZ. -- Powered by The Bat! v.2.12.00 times BayesIt v.0.5.5 Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4 Current version is 2.12.00 | 'Using TBUDL' information: http://www.silverstones.com/thebat/TBUDLInfo.html