Re: Bayes underperforming, HTML entities?

Amir Caspi Fri, 30 Nov 2018 14:50:09 -0800

On Nov 30, 2018, at 7:00 AM, Bill Cole 
<sausers-20150...@billmail.scconsult.com> wrote:
> 
>> Since HTML is already getting rendered to text, then perhaps the conversion 
>> code should strip (literally, just delete) any zero-width characters during 
>> this conversion? That should make normal body rules, and Bayes, function 
>> properly, no?
> 
> Not if they are *looking for* those characters.

But AFAIK we're only looking for those characters with rawbody rules, because 
it's really hard to search for them in regular body rules... no?  I'm not 
trying to advocate for removal of rawbody rules, but rather making it easier 
for normal body rules to work.

But RW's suggestion is probably a good one: offer both paths:

On Nov 30, 2018, at 7:46 AM, RW <rwmailli...@googlemail.com> wrote:
> 
> It make it harder to write rules detecting these tricks, but it may
> happen eventually. As far as Bayes is concerned, it would be a shame to
> lose the information.

I'm not sure I see how Bayes can take decent advantage out of these zero-width 
chars.  If they are interspersed randomly within words, then Bayes has to 
tokenize each and every permutation (or, at least, very many permutations) of 
each word in order to be decently effective.  But if the zero-width chars are 
stripped out, then Bayes only has to tokenize the regular, displayable word.  
Am I missing something?

But offering both converted and non-converted options is likely the best 
option, and then having Bayes work on the normalized version resolves the above.

--- Amir

Re: Bayes underperforming, HTML entities?

Reply via email to