On Thu, 1 Sep 2016 15:16:37 +0200 Matus UHLAR - fantomas wrote: > >> On Thu, Sep 1, 2016 at 12:27 AM, Olivier > >> <olivier.nic...@cs.ait.ac.th> wrote: > >> > I am running it, it does not do a very good job at extracting the > >> > text from the images. Then it uses it's own list of keywords to > >> > detect spam: to me it's the biggest problem, it should push back > >> > the text to SpamAssassin and let SA rules decide what to do with > >> > it. > >> I do agree that the OCR program should be doing the OCR'ing > >> and the text filtering should be left to a program that does that > >> for a living. > > On 01.09.16 13:59, RW wrote: > >It's a long time since I've used it, but IIRC the point of FuzzyOCR > >is that it does fuzzy matching on a dictionary of "bad" words - > >similar to the way that spelling checkers find the mostly likely > >suggestions. This gives it a very limited ability to deal with > >imperfectly read words. > > it's the same as Olivier wrote above :-)
Not really, he just said it matches against a word list. My point is that out of the several SA OCR plugins that have been written, FuzzyOCR is the one that's specifically designed for doing fuzzy matching on a finite word list. If you just pass the OCR output to Bayes or add it to the body, it's not "fuzzy OCR" anymore. > >Putting garbled OCR text through SA body rules may be more trouble > >than it's worth. > > garbled, yes. I've had this discussion some years back and tesseract > has currently much much better results than it had those years ago. Unless it can cope with current CAPTCHAs the spammer has a reserve. The first OCR plugin came towards the end of a period where people were being hammered by image spam. There's been nothing like that since, probably because it doesn't work well as spam. As I've said I find it can be caught by other means. I must have put about 50k spams through SA since I last had an FN that was an image spam.