Re: How SA reactes to a bunch of garbage characters

Olivier Sun, 12 Jun 2016 20:44:24 -0700

Matus,

Thank you for your reply.

> On 09.06.16 10:43, Olivier wrote:
>>For years I am having FuzzyOcr pluging running, but it helps little,
>>because it has it's own list of words to keep updated.
>>
>>I am wondering if, instead of using that own list of words, the result
>>was injected back into the body of the main message.
>
> I raised this issue some years ago. The result was that pushing OCR-ed data
> bach to SA for evaluating BAYES and other rules could cause troubles,
> because freely availabel OCR SW was not very presice.

Sure the OCR results are not very precise. But could we imagine that
they are pushed in a part of the message that will not go through Bayes?

If we inject text extracted from PDF, for example, that also modify the
message and influences the Bayes tests. So maybe even PDF extraction
should not be submitted to Bayes and SA would have a mechanism for that
purpose (other than launching a completely separate SA process on that
extracted part).

>>Most of the time, what will be injected back is plain garbade:
>>w_T___l_e?_
>>
>>But other time the result is interesting like a proper English sentence
>>full of spam.
>
> what exactly do you use for OCR? 10 years ago I made a comparison between
> gocr, ocrad and tesseract, where gocr gave best results.

I have gocr, ocrad and tesseract configured.

> Now, since google sponsors tesseract development, the scaning looks much
> much better, and I started thinking about tryint that again.
>
>>So how SA will react if I reinject the garbage? Wil lit just ignore it?
>
> would be nice to see trhe results.
> I'm mostly afraid about FUZZY_* rules...

I changed the config of FuzzyOcr so I lost the log of extracted data. I
will post that in detail after a few days and I have collected some
samples.

Best regards,

Olivier

--

Re: How SA reactes to a bunch of garbage characters

Reply via email to