Re: Stock spam in images

Jorge Valdes Wed, 04 Oct 2006 10:16:44 -0700

Jason Haar wrote:

I'm having marvelous luck with FuzzyOCR - but the spammers are learning too.


When I first started using it just a couple of months ago, it really
whacked the image-based spam. You could see why when "gocr file.gif"
returned nice text that was easy to match against.

However, now is a different matter. I just got a "lose weight" spam 10
minutes ago that gocr returns as:

      lI__c_tc)r _rc_hc_rihc_Ll _cnLl .h1c_Llic_;cll_ _u__c_c __ihc LI
              l c htc)hlc_rc)c_c_ B llr_ll l hc r_cp_


        _ t4____ __cc_'un ic) __'ri_c _ hH3s, t_k   _ ,r o_E,y _h K E,_
        _ ,_ics r _ sncu)._r. t.ihk). lhirkrr x_))  '   gg __, r
        _ Krvc)_H t)r r_irk cct .__             _
                             O _' Y O ___ TE_ E
         _Lncl nLnn __ mc)R hnrtb

That tells me to go to "www.realhgh" dot org , but their GIF processing
munged it enough to slip by gocr

Not much FuzzyOCR can do with that :-(

A few days ago, someone provided me with an image that returned garbagewhen using plain 'gocr <file>'. The trick to better detection is toadjust gocr's -l parameter to get better contrast (and better results).By looping 0...255 you will find a setting which will give you goodresults for this type of image, and if you start getting a lot of theseimages, adding another scanset will not add too many cpu cycles to yourscan. This new setting will almost certainly give you better resultswith other images too, so unless you have a really overloaded system,adding another scanset will not 'break the bank'.


--
Jorge Valdes

Re: Stock spam in images

Reply via email to