[Bug 8107] Change how PDF's are parsed with the PDFInfo plugin

bugzilla-daemon Fri, 20 Jan 2023 15:28:11 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8107


Kent Oyer <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #2 from Kent Oyer <[email protected]> ---
There are a few PDF parsing modules already but they are overkill for what we
need. I have written a more streamlined parser that just inspects images &
URI's and is configured to stop after page 1. I still rely on ExtractText to
pull out the text.

Regarding points 1, 2, & 3, they would not be an indicator of spam/ham by
themselves but would be used in conjunction with other rules. The current
plugin has rules such as "pdf_image_to_text_ratio" and "pdf_image_size_range"
which are also weak indicators used by themselves. IMHO, it would be better to
know the percentage of page area taken up by images rather than the raw number
of pixels. 

There are only a handful of rules in the stock ruleset that use this plugin.
Most of the rules in 20_pdfinfo.cf are commented out and the remaining ones are
from circa 2007. Is there a way to see the effectiveness of the rules in the
stock ruleset? IIRC there used to be a way to see the hit frequencies from the
nightly mass check somewhere. In my setup, these rules are not very effective:

GMD_PDF_HORIZ          Contains pdf 100-240 (high) x 450-800 (wide)
GMD_PDF_SQUARE         Contains pdf 180-360 (high) x 180-360 (wide)
GMD_PDF_VERT           Contains pdf 450-800 (high) x 100-240 (wide)
GMD_PRODUCER_GPL       PDF producer was GPL Ghostscript
GMD_PRODUCER_POWERPDF  PDF producer was PowerPDF
GMD_PRODUCER_EASYPDF   PDF producer was BCL easyPDF

They have a low positive score but hit more ham than spam. Is anyone having
success with these rules?

The KAM ruleset only includes this one:

describe   KAM_BADPDF1  Prevalent Junk PDF SPAMs - EMPTY BODY & ENCRYPTED
score      KAM_BADPDF1  2.5
meta       KAM_BADPDF1  (GMD_PDF_EMPTY_BODY + GMD_PDF_ENCRYPTED >= 2)

I would be curious to know how this rule is working in your environment. 

Regarding point 5, yes I have examples of PDF's that are encrypted with a blank
password. Most PDF readers will seamlessly open the PDF without prompting for a
password so to the user it seems like a normal PDF. But the data is not visible
to SA without decrypting it first.

Thanks
Kent

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 8107] Change how PDF's are parsed with the PDFInfo plugin

Reply via email to