On Fri, 2013-05-10 at 17:58 -0400, David F. Skoll wrote:
> On Fri, 10 May 2013 23:14:36 +0200 Karsten Bräckelmann wrote:

> We (probably) have a much larger sample population, so this tends not
> to be as much of a problem for us.

This thread is about a default Bayes database, suitable for distri-
bution. Not a humongous database with millions of tokens.

It also would have to be usable on small sites, as well as company wide.
Train on error should not be overruled by the sheer number of tokens and
occurrences of them.

> Again, the key is a large sample size.

Yup. In the outlined case, the large sample size would most likely push
that token towards no man's land. It is, after all, a totally valid and
actually used word.

You asked for cases of "your ham is someone else's spam". That is
precisely one such case.

Your repeated counter-argument / solution of a large sample size
translates to "neither ham nor spam". Not helpful.

We're talking Bayes, thus in tokens. Spam for me, ham for me neighbor
(yes, literally).

> These are edge cases that are pretty easily handled with personal
> Bayes databases or whitelisting if the system keeps getting it wrong.

Exactly. Personal Bayes databases. The opposite of a default database.


> > Paypal. And them notifying their customers about changes in the terms
> > of use. And actually sending out the full terms of use in the same
> > mail. In this case, again, German -- but they managed to score a
> > whopping 12.2 once for me. Yes, of course, BAYES_99.
> 
> Was this with your personal Bayes data?  Even that can be wrong sometimes...

Yes, it was. And yes, it can. :)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to