Re[2]: Help with BayesIt tuning

2004-08-16 Thread Stuart Cuddy
Hello DZ-Jay,
Sunday, August 15, 2004, 1:47:45 PM, you wrote:

>> It might be coincidence, but Paul Graham has written much about
>> Bayesian filtering.  I'd guess it has something to do with his
>> methodology.  Even if I'm wrong, there's some interesting reading at:

>> http://www.paulgraham.com/antispam.html

DJ> Thanx for the info... that would make more sense, although
DJ> how come the spam-grade and graham values coinside in all messages
DJ> without exception?  I guess I'll ask Alexey about it.  In the
DJ> meantime, I'll check out the link you sent :)

Does Alexey not frequent this list?  It would sure be helpful if he
could answer directly.

Does anyone know how we can continue this conversation directly with
him?


-- 
 Stuartmailto:[EMAIL PROTECTED]
Using The Bat! v2.13 "Lucky" Beta/5 on Windows 98 4.10 Build   A 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-16 Thread MikeD (3)
Hello George,

Sunday, August 15, 2004, 11:35:29 AM, you wrote:

GM> DZ-Jay wrote:

DJ>> Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say:

>>>What is Graham?
>>>What is Spam-grade?

DJ>> AFAIK, spam-grade would be the probability of it being spam, and
DJ>> Graham, I suppose, means the probability of it being not-spam (I
DJ>> suppose, non-spam-grade > ham-grade > graham ?)

GM> It might be coincidence, but Paul Graham has written much about
GM> Bayesian filtering.  I'd guess it has something to do with his
GM> methodology.  Even if I'm wrong, there's some interesting reading at:

GM> http://www.paulgraham.com/antispam.html


Yes, Paul uses a slightly modified algorithm from the original Bayes.
So does that mean it is calculating using both algorithms to create
two values?

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-16 Thread MikeD (3)
Hello DZ-Jay,

Sunday, August 15, 2004, 10:20:52 AM, you wrote:

DJ> Some time around 08/15/2004 09:24:56, I think I heard MikeD (3) say:
DJ>>> I was too.  I just upgraded yesterday to 0.5.9 and I haven't
DJ>>> noticed a difference.  It does provide a white/black list, which I
DJ>>> don't care to use because it defeats the purpose of a Bayesian
DJ>>> filter (there's huge discussion -- more like religious wars --
DJ>>> about this on the POPFile list hehe).  Also, the kludges.txt file
DJ>>> doesn't seem to be implemented either (ignore list for headers).

>> That's too bad 

DJ> I just learned (by re-reading a babelfished translation of
DJ> the russian BayesIt page) that the "kludges" file (whitelist of
DJ> kludges) does seem to work, except I misunderstood it.  I thought
DJ> it worked like POPFile's "ignore" list, which ignores the
DJ> specified tokens when computing the probability of a message.  But
DJ> it is not a list of just "tokens", it is a list of header names
DJ> that will be ignored, for example, if you put in the list:

DJ> message-id
DJ> x-mailer
DJ> subject

DJ> If will ignore the values of headers that start with those
DJ> strings.  This is very useful, though.

DJ> I wonder, is the "ignore" list in the black/white list rules
DJ> window what I confused the kludges list for? i.e. is it akin to
DJ> the POPFile ignore list?  Anybody know?

Hmmm ... does it just ignore those 'lines' in the header?  If so, I
don't think that will be a problem for me.  My Kludges contains:

x-spam-checker-version
x-spam-level
x-spam-report
x-spam-status
x-uidl

And I don't think any of those are causing a problem.

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[3]: Help with BayesIt tuning

2004-08-16 Thread MikeD (3)
Hello Pete,

Sunday, August 15, 2004, 9:52:14 AM, you wrote:

PH> Sunday, August 15, 2004, 7:44:17 AM, you wrote:

AW>> Hello MikeD,

AW>> On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

AW>> Have you deleted you spam and non-spam dictionary files when you
AW>> upgraded?

PH> What are their names and where are they?

Originally I had two sets of dictionaries.  One (I assume for the old
version were in c:\Program Files\TheBat\bayesit\base.  The current
version is creating the following files here ...

c:\My Documents\BatMail\bayesit\base
transact
spamdict.idx
nspamdict.idx
spamdict.lst
spamdict.bye
nspamdict.lst
selective.txt
nspamdict.bye

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-16 Thread MikeD (3)
Hello DZ-Jay,

Sunday, August 15, 2004, 9:31:38 AM, you wrote:

DJ> Some time around 08/15/2004 09:23:46, I think I heard MikeD (3) say:
>> Hello Andre,

>> Sunday, August 15, 2004, 6:44:17 AM, you wrote:

AW>>> Have you deleted you spam and non-spam dictionary files when you
AW>>> upgraded?

>> Funny, that.  When I first upgraded I did not and it seemed to work
>> fine ... until I rebooted.

DJ> Strange... rebooting shouldn't affect anything...

Well I am guessing that because I had been running the old version of
Bayesit earlier in the day, that it continued to use that until I
rebooted.  It is the only thing that I can think of that makes sense.

>> After that, yes, I deleted all the dict files I could find.
>> Apparently there were two sets, one from the old version and one set
>> from the new.

DJ> I had to do the same thing when upgrading from v0.4gm to
DJ> v0.5.4 because I was having problems.

>> I then re-trained it on the accumulated spam and ham folders I have
>> with about 2,000 messages each.  BTW, If I give Bayesit all 2,000
>> messages at once to "chew on", it would hang.  If I gave it in
>> "chunks" it seemed to work OK 

DJ> Hum... after deleting the dict files, I trained normally with
DJ> lots of spam/non-spam messages (I'm pretty sure it was more than
DJ> 2,000) without a problem.  So I don't know what could have
DJ> happened in your case (?)

DJ> I personally find BayesIt extremely powerful, accurate, and
DJ> fast (I come from POPFile, with an accuracy of 99.6 % which
DJ> required a LOT of manual tuning, had quite some false positives,
DJ> and was VERY slow...), but what it misses it *really* misses (0%,
DJ> as opposed to some mid-way value).

I have used several 0.4 versions and they worked great, so I am
guessing that I just need to 'fix' a setting somewhere ... or at least
I hope that is it 

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 12:35:29, I think I heard George Mitchell say:
> It might be coincidence, but Paul Graham has written much about
> Bayesian filtering.  I'd guess it has something to do with his
> methodology.  Even if I'm wrong, there's some interesting reading at:

> http://www.paulgraham.com/antispam.html

Thanx for the info... that would make more sense, although how come the spam-grade and 
graham values coinside in all messages without exception?  I guess I'll ask Alexey 
about it.  In the meantime, I'll check out the link you sent :)

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread George Mitchell
DZ-Jay wrote:

DJ> Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say:

>>What is Graham?
>>What is Spam-grade?

DJ> AFAIK, spam-grade would be the probability of it being spam, and
DJ> Graham, I suppose, means the probability of it being not-spam (I
DJ> suppose, non-spam-grade > ham-grade > graham ?)

It might be coincidence, but Paul Graham has written much about
Bayesian filtering.  I'd guess it has something to do with his
methodology.  Even if I'm wrong, there's some interesting reading at:

http://www.paulgraham.com/antispam.html

-- 
George

Using The Bat! 2.12.00 on Windows XP Pro 5.1, Build 2600, Service Pack 1.



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 11:57:15, I think I heard Pete Holsberg say:
> Sunday, August 15, 2004, 11:11:00 AM, you wrote:

DJ>> Some time around 08/15/2004 10:52:14, I think I heard Pete Holsberg say:

DJ>>\BayesIt\base
DJ>> or
DJ>>\MAIL\BayesIt\base


> ??? Mine are in C:\Documents and Settings\pjh\Application Data\BayesIt\base

> TB is in C:\Program Files\The Bat!\thebat.exe and BayesIt is in C:\Program 
> Files\BayesIt
> under Windows 2000.

Well, I guess those are the default installation paths:  The application in the 
Program Files directory and the BayesIt files in your profile directory.  Since I have 
TB! installed in a non-standard directory (i.e. outside the Program Files directory), 
BayesIt was installed within that directory.  I guess then I should have said:

\BayesIt\base
or
\BayesIt\base

Sorry about that.  I guess that since I don't use the default installation paths I 
don't know where things normally fall.

In any case, the dict files fall within the BayesIt working directory, which is 
specified in BayesIt options window.

> Is this significant?

Not at all.

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-15 Thread Pete Holsberg
Sunday, August 15, 2004, 11:11:00 AM, you wrote:

DJ> Some time around 08/15/2004 10:52:14, I think I heard Pete Holsberg say:
>> Sunday, August 15, 2004, 7:44:17 AM, you wrote:

AW>>> Hello MikeD,

AW>>> On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

AW>>> Have you deleted you spam and non-spam dictionary files when you
AW>>> upgraded?

>> What are their names and where are they?

DJ> Their names are spamdict.* and nspamdict.* and they are located in a directory
DJ> called "base" within the BayesIt working directory, which is normally either:

DJ> \BayesIt\base
DJ> or
DJ> \MAIL\BayesIt\base


??? Mine are in C:\Documents and Settings\pjh\Application Data\BayesIt\base

TB is in C:\Program Files\The Bat!\thebat.exe and BayesIt is in C:\Program 
Files\BayesIt
under Windows 2000.

Is this significant?

-- 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say:
> I am not seeing the "empty" tokens, but the following message is being
>received without being caught. I sent it again to myself about 5 or
>6 times and marked it as junk each time. The values do not seem to
>change at all.

Maybe this is because of your value in the "recalculating strategy" parameter, which 
governs how often automatic "retraining" is done.  Try lowering this value and 
re-marking the message as spam and see if the values change.

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say:

>What is Graham?
>What is Spam-grade?

AFAIK, spam-grade would be the probability of it being spam, and Graham, I suppose, 
means the probability of it being not-spam (I suppose, non-spam-grade > ham-grade > 
graham ?)

But in my log I see exactly what you see in yours: that the graham and spam-grade 
values are identical in every case.  This keeps getting fishier and fishier...

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-15 Thread Stuart Cuddy
Hello DZ-Jay,
Sunday, August 15, 2004, 9:25:14 AM, you wrote:

DJ> However, I recently noticed why some obviously spam messages
DJ> are given a probability of 0%:  Apparently the analysis engine is
DJ> regarding a few "empty" tokens with a value of 0%, which
DJ> "unspamifies" the final value, for example, in my log file, I get
DJ> this in some messages:

I am not seeing the "empty" tokens, but the following message is being
   received without being caught. I sent it again to myself about 5 or
   6 times and marked it as junk each time. The values do not seem to
   change at all.

   What is Graham?
   What is Spam-grade?
   

<[EMAIL PROTECTED]>
Graham:  7.59688e-029
Spam-grade:  7.59688e-029
Value for The Bat!: 0
: ---
biz:  0.01
--:  0.0212766
size:  0.01
Advance:  0.01
H this:  0.058463
partners:  0.01
Today:  0.01
H PLease:  0.01
H de:  0.0359281
Career:  0.01
text:  0.01
experience:  0.0133407
aol:  0.01
Verdana:  0.01
past:  0.01

-- 
 Stuartmailto:[EMAIL PROTECTED]
Using The Bat! v2.13 "Lucky" Beta/5 on Windows 98 4.10 Build   A 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 09:24:56, I think I heard MikeD (3) say:
DJ>> I was too.  I just upgraded yesterday to 0.5.9 and I haven't
DJ>> noticed a difference.  It does provide a white/black list, which I
DJ>> don't care to use because it defeats the purpose of a Bayesian
DJ>> filter (there's huge discussion -- more like religious wars --
DJ>> about this on the POPFile list hehe).  Also, the kludges.txt file
DJ>> doesn't seem to be implemented either (ignore list for headers).

> That's too bad 

I just learned (by re-reading a babelfished translation of the russian BayesIt page) 
that the "kludges" file (whitelist of kludges) does seem to work, except I 
misunderstood it.  I thought it worked like POPFile's "ignore" list, which ignores the 
specified tokens when computing the probability of a message.  But it is not a list of 
just "tokens", it is a list of header names that will be ignored, for example, if you 
put in the list:

message-id
x-mailer
subject

If will ignore the values of headers that start with those strings.  This is very 
useful, though.

I wonder, is the "ignore" list in the black/white list rules window what I confused 
the kludges list for? i.e. is it akin to the POPFile ignore list?  Anybody know?

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 10:52:14, I think I heard Pete Holsberg say:
> Sunday, August 15, 2004, 7:44:17 AM, you wrote:

AW>> Hello MikeD,

AW>> On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

AW>> Have you deleted you spam and non-spam dictionary files when you
AW>> upgraded?

> What are their names and where are they?

Their names are spamdict.* and nspamdict.* and they are located in a directory called 
"base" within the BayesIt working directory, which is normally either:

\BayesIt\base
or
\MAIL\BayesIt\base

dZ.



-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-15 Thread Pete Holsberg
Sunday, August 15, 2004, 7:44:17 AM, you wrote:

AW> Hello MikeD,

AW> On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

AW> Have you deleted you spam and non-spam dictionary files when you
AW> upgraded?

What are their names and where are they?


-- 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 10:20:47, I think I heard Alexander S. Kunz say:
> I just checked my POPfile bucket pages and found it very interesting that,
> despite spam is only 5.8% of my messages (lucky me, hu?), the "distinct
> word count" for those spam messages is by far the highest (only messages
> marked as "genuine/english" come close). I'd interpret that as "spam is
> *very* recognizable" after a certain training period. That could explain
> your results with BayesIt - maybe.


Yes, I agree that that could be the reason.  However, the messages that are missed 
(roughly 4% of total spam traffic) are marked with a 0%, which would qualify them as 
"unambiguosly genuine (non-spam)", but they obviously are not, as a lot of spam tokens 
are found in them.  This is why I think there might be a problem with the filter 
itself, or with my settings.

> In practice, I had similar (odd) results with BayesIt. :-) ...part of the
> reason that made me switch to POPfile...

Funny, I went the other way... POPfile was very reliable for me (99.6%) but required 
constant manual hacking of the corpus to maintain this accuracy, plus with a 
sufficiently high corpus, it was really slow (took almost a couple of seconds to 
download each message, even very small ones), which with a dial-up connection and 
hundreds of messages a day is almost unbearable.

Plus, there was no way to offer some extra weight to non-spam messages (like with 
"regarding threshold" in BayesIt), which almost completely irradicates false 
positives.  With POPfile I had to scan my spam box once in a while in order to make 
sure.  With BayesIt, after doing so for a few months without even a single false 
positive, I concluded that it was not necessary anymore to scan the spam folder often. 
 I like that :)

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread Alexander S. Kunz
Hello DZ-Jay,

15-Aug-2004 15:12, you wrote:

> I checked the BAYESIT.LOG file and realized that all messages are
> marked with either 100/99 % or 0% probability, which means that no matter
> how low I set the parameter, it will continue working the same.  I don't
> understand how come there is no "gray area", with messages marked with a,
> say, 30% probability, etc.  I do not get any false positives at all, but
> I do get about 4%  of false negatives...

I just checked my POPfile bucket pages and found it very interesting that,
despite spam is only 5.8% of my messages (lucky me, hu?), the "distinct
word count" for those spam messages is by far the highest (only messages
marked as "genuine/english" come close). I'd interpret that as "spam is
*very* recognizable" after a certain training period. That could explain
your results with BayesIt - maybe.

In practice, I had similar (odd) results with BayesIt. :-) ...part of the
reason that made me switch to POPfile...

-- 
Best regards,
 Alexander

Bradley's Bromide: If computers get too powerful, we can organize them into
a committee... that will do them in.



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 09:23:46, I think I heard MikeD (3) say:
> Hello Andre,

> Sunday, August 15, 2004, 6:44:17 AM, you wrote:

AW>> Have you deleted you spam and non-spam dictionary files when you
AW>> upgraded?

> Funny, that.  When I first upgraded I did not and it seemed to work
> fine ... until I rebooted.

Strange... rebooting shouldn't affect anything...

> After that, yes, I deleted all the dict files I could find.
> Apparently there were two sets, one from the old version and one set
> from the new.

I had to do the same thing when upgrading from v0.4gm to v0.5.4 because I was having 
problems.

> I then re-trained it on the accumulated spam and ham folders I have
> with about 2,000 messages each.  BTW, If I give Bayesit all 2,000
> messages at once to "chew on", it would hang.  If I gave it in
> "chunks" it seemed to work OK 

Hum... after deleting the dict files, I trained normally with lots of spam/non-spam 
messages (I'm pretty sure it was more than 2,000) without a problem.  So I don't know 
what could have happened in your case (?)

I personally find BayesIt extremely powerful, accurate, and fast (I come from POPFile, 
with an accuracy of 99.6 % which required a LOT of manual tuning, had quite some false 
positives, and was VERY slow...), but what it misses it *really* misses (0%, as 
opposed to some mid-way value).

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 09:24:56, I think I heard MikeD (3) say:
DJ>> I started with the "move message" setting at 40 and continued
DJ>> to lowered it without noticing any effect.  That's when I checked
DJ>> the BAYESIT.LOG file and realized that all messages are marked
DJ>> with either 100/99 % or 0% probability, which means that no matter
DJ>> how low I set the parameter, it will continue working the same.  I
DJ>> don't understand how come there is no "gray area", with messages
DJ>> marked with a, say, 30% probability, etc.  I do not get any false
DJ>> positives at all, but I do get about 4%  of false negatives...

> At the moment, everything in the log is .99.  Nothing has any other
> value.  Does that sound right?

That's more or less what I get, and in my opinion, it doesn't seem to be right.

However, I recently noticed why some obviously spam messages are given a probability 
of 0%:  Apparently the analysis engine is regarding a few "empty" tokens with a value 
of 0%, which "unspamifies" the final value, for example, in my log file, I get this in 
some messages:

: ---
15.08.2004 08:13:41 <[EMAIL PROTECTED]>
Graham:  0
Spam-grade:  0
Value for The Bat!: 0
: ---

<...>
:  0
:  0
:  0
:  0
:  0
:  0
:  0
:  0

As you can see, no matter how many spam tokens are found, all those 0's will end up 
clearing the final probability value.  This seems to me a bug in the tokenizer.  I 
haven't been able to find a common denominator for messages that cause this.

Does anybody else get "empty" tokens in their log files?

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 09:28:42, I think I heard Thomas Fernandez say:
DJ>> What is not that simple? The bayesian algorithm or how the
DJ>> "regarding threshold" is used by the plugin?

> The Bayesian algorithms. Your question, to which I answered, could be
> understood this way, so I don't feel I have to apologise.

I guess some people in this list just have to offer an answer -- any answer -- just 
because.

Well then, thank you for your wonderfully insightful answer of "Check out for a 
mathematician called Bayes. 19th century, IIRC."  No need to apologize at all, I have 
such a better grasp on the subject now, thanks!

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread Thomas Fernandez
Hello DZ-Jay,

On Sun, 15 Aug 2004 09:20:41 -0400 GMT (15/08/2004, 20:20 +0700 GMT),
DZ-Jay wrote:

DJ>>> That makes sense. But do you know how the weight is calculated?

>> Check out for a mathematician called Bayes. 19th century, IIRC.

DJ> Have you read at all the entire thread, or did you just
DJ> decided to come in and offer your insightful comments at just this
DJ> point?

I 've read the thread, but nowhere was mentioned how a Bayesian filter
works. I thought that was your question. Apparantly it wasn't, so
sorry for having wasted bandwidth.

>> It's not that simple.

DJ> What is not that simple? The bayesian algorithm or how the
DJ> "regarding threshold" is used by the plugin?

The Bayesian algorithms. Your question, to which I answered, could be
understood this way, so I don't feel I have to apologise.

-- 

Regards,
Thomas.

"Sorry, Officer, I didn't realize my radar detector wasn't plugged
in."

Message reply created with The Bat! 2.12.02
under Chinese Windows 98 4.10 Build  A 





Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-15 Thread MikeD (3)
Hello DZ-Jay,

Sunday, August 15, 2004, 8:12:23 AM, you wrote:

DJ> Some time around 08/14/2004 22:24:58, I think I heard MikeD (3) say:
>> What settings are you using?  Under the old version (0.4gm) I had it
>> trained and was getting most spam caught, no false positives with a
>> "Move message" setting of 10.  Now I have gone down as low as 1 and as
>> high as 99 without success.

DJ> I started with the "move message" setting at 40 and continued
DJ> to lowered it without noticing any effect.  That's when I checked
DJ> the BAYESIT.LOG file and realized that all messages are marked
DJ> with either 100/99 % or 0% probability, which means that no matter
DJ> how low I set the parameter, it will continue working the same.  I
DJ> don't understand how come there is no "gray area", with messages
DJ> marked with a, say, 30% probability, etc.  I do not get any false
DJ> positives at all, but I do get about 4%  of false negatives...

At the moment, everything in the log is .99.  Nothing has any other
value.  Does that sound right?

>> BTW, I am using the 0.5.5 verision that came with 2.12.  Should I be
>> using the newer version that I saw mentioned?

DJ> I was too.  I just upgraded yesterday to 0.5.9 and I haven't
DJ> noticed a difference.  It does provide a white/black list, which I
DJ> don't care to use because it defeats the purpose of a Bayesian
DJ> filter (there's huge discussion -- more like religious wars --
DJ> about this on the POPFile list hehe).  Also, the kludges.txt file
DJ> doesn't seem to be implemented either (ignore list for headers).

That's too bad 


-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-15 Thread MikeD (3)
Hello Andre,

Sunday, August 15, 2004, 6:44:17 AM, you wrote:

AW> Hello MikeD,

AW> On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

AW> Have you deleted you spam and non-spam dictionary files when you
AW> upgraded?

Funny, that.  When I first upgraded I did not and it seemed to work
fine ... until I rebooted.

After that, yes, I deleted all the dict files I could find.
Apparently there were two sets, one from the old version and one set
from the new.

I then re-trained it on the accumulated spam and ham folders I have
with about 2,000 messages each.  BTW, If I give Bayesit all 2,000
messages at once to "chew on", it would hang.  If I gave it in
"chunks" it seemed to work OK 

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 07:43:05, I think I heard Andre Wichartz say:
> Hello DZ-Jay,

> On 14 Aug 2004 at 14:42:17 -0400 GMT [20:42 CEST] you wrote:

DJ>> That makes sense.  But do you know how the weight is
DJ>> calculated? I can assume it is the product of its initial
DJ>> probability by the "regarding threshold" value, is that true?

> I don't program the thing. For specific questions you really should ask
> Alexey.

I thought that with so much traffic in this list there would be someone who knew.  Oh 
well...

DJ>> And is it only for tokens that have the same occurrence in spam and
DJ>> non-spam messages, or is the weight skewed by this threshold on all
DJ>> tokens to give them an extra "non-spamy" umph in order to avoid
DJ>> false positives?

> I just made an example. It would of course work regardless how often a
> word occurs.

So you don't know... Ok.  I'll continue looking for info, probably contacting Alexey.

Thanx
dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/14/2004 23:28:14, I think I heard Thomas Fernandez say:
DJ>> That makes sense. But do you know how the weight is calculated?

> Check out for a mathematician called Bayes. 19th century, IIRC.

Have you read at all the entire thread, or did you just decided to come in and offer 
your insightful comments at just this point?  I'm talking about the "regarding 
threshold" value and how is it used, i.e. given the bayesian probability of a message 
what *ADDITIONAL* computation occurs with that parameter.  Do you know?  Do you think 
Mr. Bayes would have had enough visionary insight to see how this BayesIt-specific 
parameter was used by Alexey in his plugin?

DJ>> I can assume it is the product of its initial probability by the
DJ>> "regarding threshold" value, is that true?

> It's not that simple.

What is not that simple? The bayesian algorithm or how the "regarding threshold" is 
used by the plugin?  Because, if you have noticed from the context of the comment, I 
am talking about the parameters in the ADVANCED.INI file and how they are implemented.

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/14/2004 22:24:58, I think I heard MikeD (3) say:
> What settings are you using?  Under the old version (0.4gm) I had it
> trained and was getting most spam caught, no false positives with a
> "Move message" setting of 10.  Now I have gone down as low as 1 and as
> high as 99 without success.

I started with the "move message" setting at 40 and continued to lowered it without 
noticing any effect.  That's when I checked the BAYESIT.LOG file and realized that all 
messages are marked with either 100/99 % or 0% probability, which means that no matter 
how low I set the parameter, it will continue working the same.  I don't understand 
how come there is no "gray area", with messages marked with a, say, 30% probability, 
etc.  I do not get any false positives at all, but I do get about 4%  of false 
negatives...

> BTW, I am using the 0.5.5 verision that came with 2.12.  Should I be
> using the newer version that I saw mentioned?

I was too.  I just upgraded yesterday to 0.5.9 and I haven't noticed a difference.  It 
does provide a white/black list, which I don't care to use because it defeats the 
purpose of a Bayesian filter (there's huge discussion -- more like religious wars -- 
about this on the POPFile list hehe).  Also, the kludges.txt file doesn't seem to be 
implemented either (ignore list for headers).

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread Andre Wichartz
Hello MikeD,

On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

M> I have been following this thread since I have been having some
M> problems too.  I was using the old version (0.4gm) until I upgraded to
M> the current version of TB.

M> The settings I used to use don't seem to work any more and I either
M> get everything filtered as junk or nothing is filtered as junk.  I
M> trained it with about 2000 spam and 2000 ham messages and still no
M> joy.  I have tried low "threshold" numbers and high with out much
M> difference.

Have you deleted you spam and non-spam dictionary files when you
upgraded?

-- 
Cheers,
 Andre

"I don't suffer from insanity.
 I enjoy every minute of it."  



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread Andre Wichartz
Hello DZ-Jay,

On 14 Aug 2004 at 14:42:17 -0400 GMT [20:42 CEST] you wrote:

DJ> That makes sense.  But do you know how the weight is
DJ> calculated? I can assume it is the product of its initial
DJ> probability by the "regarding threshold" value, is that true?

I don't program the thing. For specific questions you really should ask
Alexey.

DJ> And is it only for tokens that have the same occurrence in spam and
DJ> non-spam messages, or is the weight skewed by this threshold on all
DJ> tokens to give them an extra "non-spamy" umph in order to avoid
DJ> false positives?

I just made an example. It would of course work regardless how often a
word occurs.

-- 
Cheers,
 Andre

"Geh nicht nur die glatten Strassen:
 geh Wege, die vor Dir noch niemand ging,
 damit Du Spuren hinterlässt,und nicht nur Staub."  



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread Thomas Fernandez
Hello DZ-Jay,

On Sat, 14 Aug 2004 14:42:17 -0400 GMT (15/08/2004, 01:42 +0700 GMT),
DZ-Jay wrote:

>> Assume a word orccurs equally often in spam and non-spam mails. If you
>> set the value to 1 the word will get a spam propability of 0.5. If you
>> set it to a higher value the word will get something lower than 0.5.
>> Words in non-spam mails just count more and you can set just how much
>> more.

>> At least that's my take on it.

DJ> That makes sense. But do you know how the weight is calculated?

Check out for a mathematician called Bayes. 19th century, IIRC.

DJ> I can assume it is the product of its initial probability by the
DJ> "regarding threshold" value, is that true?

It's not that simple.

-- 

Cheers,
Thomas.

24 Dinge, die man beim Sex nicht sagen sollte: 8. Du bist fast so gut
wie mein Ex!

Message reply created with The Bat! 2.12.02
under Chinese Windows 98 4.10 Build  A 





Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-14 Thread MikeD (3)
Hello DZ-Jay,

Saturday, August 14, 2004, 5:31:59 PM, you wrote:

DJ> Some time around 08/14/2004 15:47:24, I think I heard MikeD say:
>> The settings I used to use don't seem to work any more and I either
>> get everything filtered as junk or nothing is filtered as junk.  I
>> trained it with about 2000 spam and 2000 ham messages and still no
>> joy.  I have tried low "threshold" numbers and high with out much
>> difference.

DJ> That's pretty much what I get:  messages are either
DJ> COMPLETELY spam (99 or 100 % probability) or COMPLETELY not-spam
DJ> (0% probability).  Although mine seems to catch most (~97%) of
DJ> spam, out of a few hundred emails daily, so its not that bad.  And
DJ> that's with the default settings.  I'm trying to tune it to get it
DJ> a bit higher in accuracy, if possible, but can't seem to get much
DJ> help on this subject :(

What settings are you using?  Under the old version (0.4gm) I had it
trained and was getting most spam caught, no false positives with a
"Move message" setting of 10.  Now I have gone down as low as 1 and as
high as 99 without success.

BTW, I am using the 0.5.5 verision that came with 2.12.  Should I be
using the newer version that I saw mentioned?

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Some time around 08/14/2004 14:54:38, I think I heard Pete Holsberg say:
> Saturday, August 14, 2004, 2:37:03 PM, you wrote:

DJ>> Some time around 08/14/2004 12:34:07, I think I heard Pete Holsberg say:

>>> Where do you do the setting???

DJ>> In a file called ADVANCED.INI in the BayesIt working directory, or 
DJ>> in the TB! installation directory.

> Not found anywhere on either HD!

> Can it be created manually??

Yes you can... but which version of BayesIt are you using?  Maybe you are using an 
older version...  Here's the default ADVANCED.INI file that came with BayesIt 0.5.9:

working thread priority = 2;
onexit thread priority = 3;
export selective download = 1;
selective download spam threshold = 10;
simple digits spam marks = 1;
no spaces spam marks = 1;
limit size to hash = 19;
limit size to hash header = 96;
temporary dictionary = "c:\\temp";
use expiration = 0;
age to expirate = 100;
learn from zero = 1;
max size of log file = 131072;
recalculating strategy = 3;
regarding threshold = 1.5;
use autotrain = 1;
use degeneration = 1;
number of exclamations = 5;

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Some time around 08/14/2004 15:47:24, I think I heard MikeD say:
> The settings I used to use don't seem to work any more and I either
> get everything filtered as junk or nothing is filtered as junk.  I
> trained it with about 2000 spam and 2000 ham messages and still no
> joy.  I have tried low "threshold" numbers and high with out much
> difference.

That's pretty much what I get:  messages are either COMPLETELY spam (99 or 100 % 
probability) or COMPLETELY not-spam (0% probability).  Although mine seems to catch 
most (~97%) of spam, out of a few hundred emails daily, so its not that bad.  And 
that's with the default settings.  I'm trying to tune it to get it a bit higher in 
accuracy, if possible, but can't seem to get much help on this subject :(

dZ.

> Is there a good "getting started" file somewhere that I have
> just missed?




-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-14 Thread MikeD
Hello All,

I have been following this thread since I have been having some
problems too.  I was using the old version (0.4gm) until I upgraded to
the current version of TB.

The settings I used to use don't seem to work any more and I either
get everything filtered as junk or nothing is filtered as junk.  I
trained it with about 2000 spam and 2000 ham messages and still no
joy.  I have tried low "threshold" numbers and high with out much
difference.

Is there a good "getting started" file somewhere that I have
just missed?

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-14 Thread Pete Holsberg
Saturday, August 14, 2004, 2:37:03 PM, you wrote:

DJ> Some time around 08/14/2004 12:34:07, I think I heard Pete Holsberg say:

>> Where do you do the setting???

DJ> In a file called ADVANCED.INI in the BayesIt working directory, or 
DJ> in the TB! installation directory.

Not found anywhere on either HD!

Can it be created manually??

-- 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Some time around 08/14/2004 12:27:41, I think I heard Andre Wichartz say:
> Assume a word orccurs equally often in spam and non-spam mails. If you
> set the value to 1 the word will get a spam propability of 0.5. If you
> set it to a higher value the word will get something lower than 0.5.
> Words in non-spam mails just count more and you can set just how much
> more.

> At least that's my take on it.

That makes sense.  But do you know how the weight is calculated? I can assume it is 
the product of its initial probability by the "regarding threshold" value, is that 
true?  And is it only for tokens that have the same occurrence in spam and non-spam 
messages, or is the weight skewed by this threshold on all tokens to give them an 
extra "non-spamy" umph in order to avoid false positives?

Thanx
dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Some time around 08/14/2004 12:34:07, I think I heard Pete Holsberg say:

> Where do you do the setting???

In a file called ADVANCED.INI in the BayesIt working directory, or in the TB! 
installation directory.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-14 Thread Pete Holsberg
Saturday, August 14, 2004, 12:27:41 PM, you wrote:

AW> Hello DZ-Jay,

AW> On 14 Aug 2004 at 11:30:32 -0400 GMT [17:30 CEST] you wrote:

DJ>> Yes, I am aware of its definition, but what I don't understand
DJ>> is what would be the effect of changing it to, say, 1.2 from 1.5
DJ>> (apart from the academic answer of making non-spam tokens a bit less
DJ>> heavier).  How does the plugin use this value?

AW> Assume a word orccurs equally often in spam and non-spam mails. If you
AW> set the value to 1 the word will get a spam propability of 0.5. If you
AW> set it to a higher value the word will get something lower than 0.5.
AW> Words in non-spam mails just count more and you can set just how much
AW> more.

Where do you do the setting???



-- 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread Andre Wichartz
Hello DZ-Jay,

On 14 Aug 2004 at 11:30:32 -0400 GMT [17:30 CEST] you wrote:

DJ> Yes, I am aware of its definition, but what I don't understand
DJ> is what would be the effect of changing it to, say, 1.2 from 1.5
DJ> (apart from the academic answer of making non-spam tokens a bit less
DJ> heavier).  How does the plugin use this value?

Assume a word orccurs equally often in spam and non-spam mails. If you
set the value to 1 the word will get a spam propability of 0.5. If you
set it to a higher value the word will get something lower than 0.5.
Words in non-spam mails just count more and you can set just how much
more.

At least that's my take on it.

-- 
Cheers,
 Andre

"1. If it's green or it wiggles, it's biology.
 2. If it stinks, it's chemistry.
 3. If it doesn't work, it's physics."  



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Some time around 08/14/2004 10:34:25, I think I heard Andre Wichartz say:
> Hello DZ-Jay,

> On 14 Aug 2004 at 09:28:34 -0400 GMT [15:28 CEST] you wrote:

DJ>> BTW, I do not understand very well the "regarding threshold"
DJ>> parameter, can someone explain it please?

> From advanced.ini:

> ; this number shows, how much "heavier" non-spam tokens than spam. It
> makes some kind of "guard" and keeps from false positives. Usual value
> is 2, but you can also try others...

Yes, I am aware of its definition, but what I don't understand is what would be the 
effect of changing it to, say, 1.2 from 1.5 (apart from the academic answer of making 
non-spam tokens a bit less heavier).  How does the plugin use this value?

Thanx
dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread Andre Wichartz
Hello DZ-Jay,

On 14 Aug 2004 at 09:28:34 -0400 GMT [15:28 CEST] you wrote:

DJ> BTW, I do not understand very well the "regarding threshold"
DJ> parameter, can someone explain it please?

From advanced.ini:

; this number shows, how much "heavier" non-spam tokens than spam. It
makes some kind of "guard" and keeps from false positives. Usual value
is 2, but you can also try others...

-- 
Cheers,
 Andre

"Charlie was a Chemist, but Charlie is no more.
 What Charlie thought was H20 was H2SO4."  



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Hello:

I've been running BayesIt for a while and it works beautifully.  My accuracy 
right now is at 96.75%, so I guess I shouldn't complain.  But out of a few hundred 
messages I get a day, it misses about 10 that look like obvious spam but were marked 
as not-spam.  I checked the BAYESIT.LOG file and found that almost all messages are 
valued by BayesIt at either 99, 100 or 0.  Its as if BayesIt thinks all messages are 
absolutely spam, or absolutely not-spam.  In a sense I think this is good, and its 
because I started training it with a large collection of spam/not-spam messages.  But 
I cannot help but think that there should be more of a gray area for some messages... 
For example, the 10 messages that it misses daily are valued at 0.  I think there 
should be a way for me to tune the configurations in order to make it more accurate.  
On the other hand, I do not get ANY false positives, so that is a very good thing.

This is what I have in my ADVANCED.INI:

working thread priority="2"
onexit thread priority="3"
selective download spam threshold="50"
export selective download="1"
simple digits spam marks="1"
no spaces spam marks="1"
limit size to hash="19"
limit size to hash header="96"
temporary dictionary="C:\DOCUME~1\dz\LOCALS~1\Temp"
use expiration="0"
age to expirate="90"
learn from zero="0"   ; I changed this one today, was "1"
max size of log file="5242880"
recalculating strategy="0.0002"   ; I changed this one today, was "5"
regarding threshold="1.5"  ; I changed this today, was "1.8"
use autotrain="1"
use degeneration="1"
number of exclamations="5"

Any recommendations?  BTW, I do not understand very well the "regarding threshold" 
parameter, can someone explain it please?  I use BayesIt 0.5.5

Thanx! :)

-dZ.

-- 
Powered by The Bat! v.2.12.00 times BayesIt v.0.5.5
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html