The real problem is the potentially fuzzy output from the ocr engine: shure all 
the copies of the very same spam would be detected the same, but what about 
slightly different copies? Would the "use the sa force" approach be feasible? 
The use of String::Approx in fuzzyocr has shurely a meaning, but is it 
well-targeted or may we attempt to ignore detection accuracy (actual way) in 
favor of flexibility (reinjection-or-what-else-would-be)?

More or less this is what I was asking about two or three messages ago.

Regards,

-----------------------------------
Giampaolo Tomassoni - IT Consultant
Piazza VIII Aprile 1948, 4
I-53044 Chiusi (SI) - Italy
Ph: +39-0578-21100

> Stuart Johnston wrote:
> 
> > Theo Van Dinter wrote:
> >
> >> On Mon, Oct 02, 2006 at 03:18:58PM +0100, Randal, Phil wrote:
> >>
> >>>> undetected). Wouldn't it be better to inject the detected text back 
> >>>> to SA? There should be enough variants of spam worlds to let SA 
> >>>> fuzzily catch the ones from images.
> >>>
> >>> I think so.  Some of the words would be perfectly legitimate in the 
> >>> text
> >>> of emails but rarely found in attached legitimate images.
> >>>
> >>> Quite apart from the fact that Spamassassin isn't designed for
> >>> "reinjection".
> >>
> >>
> >> FWIW, 3.2 adds in support to have rendering of non-text parts.  So a 
> >> plugin
> >> could, for instance, OCR text from an image, and then the normal body 
> >> rules
> >> and such would be able to use that information.
> >>
> >
> > Would it also be possible to create a rule that matches on text 
> > rendered specifically from a non-text part and not the whole body?  
> > That way you could get the benefit of Bayes and existing body rules in 
> > the general case while still taking advantage of the fact the certain 
> > words in an image have more spammy-weight than the same words in text.
> >
> 
> Or perhaps:
> 
> tflags   RULE_NAME   ocr
> 
> 
> /Andreas
> 

Reply via email to