Re: Image spam - FuzzyOCR?

RW Thu, 01 Sep 2016 09:47:37 -0700

On Thu, 1 Sep 2016 15:16:37 +0200
Matus UHLAR - fantomas wrote:

> >> On Thu, Sep 1, 2016 at 12:27 AM, Olivier
> >> <olivier.nic...@cs.ait.ac.th> wrote:  
> >> > I am running it, it does not do a very good job at extracting the
> >> > text from the images. Then it uses it's own list of keywords to
> >> > detect spam: to me it's the biggest problem, it should push back
> >> > the text to SpamAssassin and let SA rules decide what to do with
> >> > it. 
> >>       I do agree that the OCR program should be doing the OCR'ing
> >> and the text filtering should be left to a program that does that
> >> for a living.  
> 
> On 01.09.16 13:59, RW wrote:
> >It's a long time since I've used it, but IIRC the point of FuzzyOCR
> >is that it does fuzzy matching on a dictionary of "bad" words -
> >similar to the way that spelling checkers find the mostly likely
> >suggestions. This gives it a very limited ability to deal with
> >imperfectly read words.  
> 
> it's the same as Olivier wrote above :-)


Not really, he just said it matches against a word list. My point is
that out of the several SA OCR plugins that have been written, FuzzyOCR
is the one that's specifically designed for doing fuzzy matching on a
finite word list. If you just pass the OCR output to Bayes or add it to
the body, it's not "fuzzy OCR" anymore.


> >Putting garbled OCR text through SA body rules may be more trouble
> >than it's worth.  
> 
> garbled, yes. I've had this discussion some years back and tesseract
> has currently much much better results than it had those years ago.


Unless it can cope with current CAPTCHAs the spammer has a reserve. 

The first OCR plugin came towards the end of a period where people were
being hammered by image spam. There's been nothing like that since,
probably because it doesn't work well as spam.  As I've said I find it
can be caught by other means. I must have put about 50k spams through
SA since I last had an FN that was an image spam.

Re: Image spam - FuzzyOCR?

Reply via email to