Re: How SA reactes to a bunch of garbage characters

RW Tue, 14 Jun 2016 09:04:40 -0700

On Tue, 14 Jun 2016 08:56:50 -0400
Joe Quinn wrote:

> On 6/14/2016 8:33 AM, Matus UHLAR - fantomas wrote:
> > that is just what I would like to know: If OCR produces results
> > good enough
> > for BAYES and other rules.
> >
> > I don't think there's difference between bayes and other rules.
> > It's also possible that BAYES would have better results with misread
> > characters than other rules.  
> I've dealt with OCR in the past, and have always had to go back 
> afterwards and manually proofread the results. I expect the impact on 
> Bayes would be a massively increased dictionary of rare words that 
> result from poor "keming" in the image.


Personally I find that a typical spam adds ~30 new tokens, most of
which will be ephemeral. If image spam is a small minority of spam it's
not likely to make a huge difference. It's also not the worst offender.
A few weeks ago I was getting spam that placed an Asian character
between each letter, and that was averaging ~600 new tokens per spam.


I stopped using OCR a long time ago because I didn't find that image
spam was particularly hard to catch. These days I find that spams with
images are mostly either pictures of Russian girls or spoofed corporate
logos. 

Is OCR really all that useful?



 Some PDFs are written in 
> extractable text instead of images, but those tend to use 
> fractional-width spaces for kerning so it's not always easy to figure 
> out what's a real word there either.
> 
> That said, Google seems to use OCR on images in their filtering
> (quoth Wikipedia), so maybe it works when you have a sufficiently
> enormous data set that the OCR glitches are no longer rare and a
> decent inference can be made from them.

Re: How SA reactes to a bunch of garbage characters

Reply via email to