[Bug 8107] Change how PDF's are parsed with the PDFInfo plugin

bugzilla-daemon Thu, 19 Jan 2023 22:18:19 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8107


Kevin A. McGrail <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #1 from Kevin A. McGrail <[email protected]> ---
Certainly the image detection failure is a good thing to work on.  Is there a
good module for PDF parsing as you describe?

Re: The additional features, here's my thoughts:

1. mask images

KAM: not sure this will be an indicator of spam/ham

2. scaling 

KAM: not sure this will be an indicator of spam/ham

3. Images used multiple times 

KAM: not sure this will be an indicator of spam/ham


4. We could prioritize content on page 1 (or simply ignore content on all other
pages). Spammers usually put the payload on page 1 and if there are other
pages, it's only there to confuse the filters.

KAM: This sounds like an interesting balance on efficiency that could be very
useful

5. Access images and URI's located in binary data. 

KAM: Are their PDFs avoiding scanning using this technique?


Re: I've already started working on this and I think it's doable but I don't
want to duplicate work if someone else is already working on it. 

I'm not aware of anything in progress and we love new blood.

Re: I would also like feedback on whether this should be a drop-in replacement
or a totally new plugin. 

How would it affect the stock ruleset would be my main question to help answer
that?
What changes would people need to make?  For example, are their any affected
rules in the KAM Ruleset?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 8107] Change how PDF's are parsed with the PDFInfo plugin

Reply via email to