https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727
Bug ID: 7727
Summary: New Plugin TesseractOcr
Product: Spamassassin
Version: unspecified
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Plugins
Assignee: [email protected]
Reporter: [email protected]
Target Milestone: Undefined
Created attachment 5663
--> https://bz.apache.org/SpamAssassin/attachment.cgi?id=5663&action=edit
TesseractOcr plugin
Greetings,
On behalf of Fastnet SA/MailCleaner Software, Inc., I have developed a new OCR
plugin.
The primary reasons behind this are the poor performance of FuzzyOCR in its
default configuration and the need to maintain a separate set of rules with
that plugin. For those who are not directly familiar, Fuzzy has a separate
wordlist with a value for how approximate the match can be which is fairly
opaque. When it finds a match, it simply passes the hit rule back.
TesseractOcr performs very well and efficiently. Instead of having a separate
wordlist, the plugin I have written passed the parsed text back to the parent
SpamAssassin process where the content can be matched by regular Body rules.
Fuzzy-matching can then be handled through regular expressions which are
already largely written with false-positives in mind and which are much easier
to debug.
The plugin is currently operating on several MailCleaner machines without
issue. Given the performance overhead of any OCR plugin, this is not proposed
to be enabled by default.
I look forward to any feedback.
Regards,
John Mertz
[email protected]
--
You are receiving this mail because:
You are the assignee for the bug.