https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8107
Kent Oyer <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] --- Comment #2 from Kent Oyer <[email protected]> --- There are a few PDF parsing modules already but they are overkill for what we need. I have written a more streamlined parser that just inspects images & URI's and is configured to stop after page 1. I still rely on ExtractText to pull out the text. Regarding points 1, 2, & 3, they would not be an indicator of spam/ham by themselves but would be used in conjunction with other rules. The current plugin has rules such as "pdf_image_to_text_ratio" and "pdf_image_size_range" which are also weak indicators used by themselves. IMHO, it would be better to know the percentage of page area taken up by images rather than the raw number of pixels. There are only a handful of rules in the stock ruleset that use this plugin. Most of the rules in 20_pdfinfo.cf are commented out and the remaining ones are from circa 2007. Is there a way to see the effectiveness of the rules in the stock ruleset? IIRC there used to be a way to see the hit frequencies from the nightly mass check somewhere. In my setup, these rules are not very effective: GMD_PDF_HORIZ Contains pdf 100-240 (high) x 450-800 (wide) GMD_PDF_SQUARE Contains pdf 180-360 (high) x 180-360 (wide) GMD_PDF_VERT Contains pdf 450-800 (high) x 100-240 (wide) GMD_PRODUCER_GPL PDF producer was GPL Ghostscript GMD_PRODUCER_POWERPDF PDF producer was PowerPDF GMD_PRODUCER_EASYPDF PDF producer was BCL easyPDF They have a low positive score but hit more ham than spam. Is anyone having success with these rules? The KAM ruleset only includes this one: describe KAM_BADPDF1 Prevalent Junk PDF SPAMs - EMPTY BODY & ENCRYPTED score KAM_BADPDF1 2.5 meta KAM_BADPDF1 (GMD_PDF_EMPTY_BODY + GMD_PDF_ENCRYPTED >= 2) I would be curious to know how this rule is working in your environment. Regarding point 5, yes I have examples of PDF's that are encrypted with a blank password. Most PDF readers will seamlessly open the PDF without prompting for a password so to the user it seems like a normal PDF. But the data is not visible to SA without decrypting it first. Thanks Kent -- You are receiving this mail because: You are the assignee for the bug.
