Jason Haar wrote:
I'm having marvelous luck with FuzzyOCR - but the spammers are learning too.

When I first started using it just a couple of months ago, it really
whacked the image-based spam. You could see why when "gocr file.gif"
returned nice text that was easy to match against.

However, now is a different matter. I just got a "lose weight" spam 10
minutes ago that gocr returns as:

      lI__c_tc)r _rc_hc_rihc_Ll _cnLl .h1c_Llic_;cll_ _u__c_c __ihc LI
              l c htc)hlc_rc)c_c_ B llr_ll l hc r_cp_


        _ t4____ __cc_'un ic) __'ri_c _ hH3s, t_k   _ ,r o_E,y _h K E,_
        _ ,_ics r _ sncu)._r. t.ihk). lhirkrr x_))  '   gg __, r
        _ Krvc)_H t)r r_irk cct .__             _
                             O _' Y O ___ TE_ E
         _Lncl nLnn __ mc)R hnrtb

That tells me to go to "www.realhgh" dot org , but their GIF processing
munged it enough to slip by gocr

Not much FuzzyOCR can do with that :-(

A few days ago, someone provided me with an image that returned garbage when using plain 'gocr <file>'. The trick to better detection is to adjust gocr's -l parameter to get better contrast (and better results). By looping 0...255 you will find a setting which will give you good results for this type of image, and if you start getting a lot of these images, adding another scanset will not add too many cpu cycles to your scan. This new setting will almost certainly give you better results with other images too, so unless you have a really overloaded system, adding another scanset will not 'break the bank'.

--
Jorge Valdes


Reply via email to