> > My only restriction is that FuzzyOCR uses it's own list of spam words
> > instead of pushing back the decoded text to SA for SA to analyze.
> This is necessary because of the poor quality of the OCR. It's only going to 
> be useful if the number of words you try to match against is very small.

While it happens inside a single run of SA, it will not take that much
time to run all the tests on the text extracted from fuzzyOCR.

Either the text is garbage and SA should not trigger or the text is
pretty readable and OCR gives good output and it would be a waste not to
fully test that extracted text.

The problem I see is elsewhere: running OCR is time consuming, fuzzyOCR
will perform several extractions, with different parameters, in order to
catch some obfuscation artifacts, and it will stop as soon as one
extraction has provided spammy words, so it saves computation. While if
you want to push back the extacted text to normal SA, you have to run
all the different extractions (takes time/CPU) and you may end-up having
several copies of the same text to parse with SA (I am not sure if it
would increase the spamines to have several instances of the same bad
word in a message).

Olivier

Reply via email to