On Tue, 14 Jun 2016 08:56:50 -0400 Joe Quinn wrote: > On 6/14/2016 8:33 AM, Matus UHLAR - fantomas wrote: > > that is just what I would like to know: If OCR produces results > > good enough > > for BAYES and other rules. > > > > I don't think there's difference between bayes and other rules. > > It's also possible that BAYES would have better results with misread > > characters than other rules. > I've dealt with OCR in the past, and have always had to go back > afterwards and manually proofread the results. I expect the impact on > Bayes would be a massively increased dictionary of rare words that > result from poor "keming" in the image.
Personally I find that a typical spam adds ~30 new tokens, most of which will be ephemeral. If image spam is a small minority of spam it's not likely to make a huge difference. It's also not the worst offender. A few weeks ago I was getting spam that placed an Asian character between each letter, and that was averaging ~600 new tokens per spam. I stopped using OCR a long time ago because I didn't find that image spam was particularly hard to catch. These days I find that spams with images are mostly either pictures of Russian girls or spoofed corporate logos. Is OCR really all that useful? Some PDFs are written in > extractable text instead of images, but those tend to use > fractional-width spaces for kerning so it's not always easy to figure > out what's a real word there either. > > That said, Google seems to use OCR on images in their filtering > (quoth Wikipedia), so maybe it works when you have a sufficiently > enormous data set that the OCR glitches are no longer rare and a > decent inference can be made from them.