Hi, > On 23.05.09 12:43, alex k wrote: >> It seems that image spam is back. So I wrote a new OCR plugin for >> spamassassin, which uses convert and ocrad to extract text. >> For details and download see: >> >> http://spielwiese.la-evento.com/facileOCR/ >> >> We use this plugin on our servers. It kicks out every image-spam, that >> made it through the other filters and produces not a single false >> positive. > > hmmm, last two images I've checked were much nicer read by gocr, just FYI. > > another question I've raised some time ago was the possibility of pushing > read text to spamassassin so it could be detected by other checks, e.g. > spamassassin and optionally uribl's... > The answer was gocr is not reliable enough for doing this stuff, but I > hope > it's worth trying...
I will explain a bit, how this plugin works: It doesn't matter, how nice the text is read. You can always get the extracted text from debuglog and expand your spamwords list with things like "Favorl,cllck,Fvorle". The extracted text is filtered, so you can savely use anything you find in debuglog. Thus we don't need a 100% word recognition, which would be very hard to reach. I decided to use ocrad and not gocr, because ocrad has some nice features (like text filtering and builtin resizing). I wanted to keep the list of dependencies as small as possible, so I use only ocrad (tests with tesseract or ocropus were discouraging). By the way, there already exists a plugin which extracts words with gocr and feeds it to Bayes. Do you know BayesOCR? bye, Xela > -- > Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ > Warning: I wish NOT to receive e-mail advertising to this address. > Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. > "To Boot or not to Boot, that's the question." [WD1270 Caviar] >