[Bug 7727] New Plugin TesseractOcr

bugzilla-daemon Tue, 25 Jun 2019 08:38:32 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727


--- Comment #5 from John Mertz <[email protected]> ---
(In reply to Henrik Krohns from comment #2)

> Do you have any statistics on how well this has performed on your
> mail feeds? 

At the moment the module is installed on a couple of mildly trafficked
machines. I wanted to get all of your lovely feedback to make sure
there wasn't any glaring errors that I had missed which would destroy
a busy machine. All of our machines already had FuzzyOcr enabled, so I
don't have a baseline for what the performance is vs no OCR at all.

For the machines on which it is running, there has not been a
significant change in performance when compared to FuzzyOcr (using
gocr). The stats seem to show that it is actually a little bit lighter
on load, but not more than could be explained by fluctuations in
traffic. It seems that it is somewhat more efficient than Fuzzy, but
because Fuzzy runs conditionally if the score is already over a
threshold, this balances things out more.

Very large images can noticeably impact scantimes. A 1920x1080 image of
12pt lorem ipsum text takes about 0.8 seconds of actual scantime on my
machine. Obviously this is a worst-case. It is going to be much faster
for images with little actual text and the plugin is configurable to
only scan messages within size and dimension constraints of your choice.

> And have you actually verified what rules do hit the
> OCR'd body portion?

I can verify that the OCR'd content hits just as if it were part of the
text in the body of the email.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7727] New Plugin TesseractOcr

Reply via email to