> > My only restriction is that FuzzyOCR uses it's own list of spam words > > instead of pushing back the decoded text to SA for SA to analyze. > This is necessary because of the poor quality of the OCR. It's only going to > be useful if the number of words you try to match against is very small.
While it happens inside a single run of SA, it will not take that much time to run all the tests on the text extracted from fuzzyOCR. Either the text is garbage and SA should not trigger or the text is pretty readable and OCR gives good output and it would be a waste not to fully test that extracted text. The problem I see is elsewhere: running OCR is time consuming, fuzzyOCR will perform several extractions, with different parameters, in order to catch some obfuscation artifacts, and it will stop as soon as one extraction has provided spammy words, so it saves computation. While if you want to push back the extacted text to normal SA, you have to run all the different extractions (takes time/CPU) and you may end-up having several copies of the same text to parse with SA (I am not sure if it would increase the spamines to have several instances of the same bad word in a message). Olivier