Dallas Engelken wrote, on 14/07/07 12:17 AM:
James MacLean wrote:
Hi folks,

Regrets if this is the wrong list.

Wanted to be able to score on text found in PDF files. Did not see any obvious route, so made a plugin that calls XPDF's pdfinfo and pdftotext to get the text that is then scored.

Sample local.cf could be :

pdftotext_cmd /usr/local/bin/pdftotext
pdfinfo_cmd /usr/local/bin/pdfinfo
body PDF_TO_TEXT eval:check_pdftext("^Error","sex","drugs",'Title:\s+stock_tmp.pdf:4','Creator:\s+OpenOffice.org 1.1.4:4')

Notice that a :4 gives a find of that regex 4 points.

Really don't know if this was the right road to follow, as I copied the AntiVirus.pm and came up with this:
http://support.ednet.ns.ca/SpamAssassin/PDFText.pm

So far... it appears to work as expected and didn't take down a pretty busy server ;).

Enjoy hearing any positive criticisms :).

I did this the other day with CAM::PDF, but Theo recommended this work should be done in the post_message_parse() plugin call. Then you could just write body rules against the text, uris would get checked by uribldns plugin, etc....

--
Dallas Engelken
[EMAIL PROTECTED]
http://uribl.com

I did start with keeping it all in Perl, but when I tested my first SPAM with the CAM::PDF utils, it resulted in just a bunch of space separated letters :(. Interested in getting something working, I switched to the XPDF utils. Maybe getpdftext.pl is not a good example of how the modules work?

Where do I find information on hooking into post_message_parse()? Tried greping in the module area with no luck :(. Certainly agree it would be better to get the text out and let everyone at it :). I couldn't see how to do that when I started down this road. I was even first trying to see if Exim would add another attachment to the e-mail which would be the output of pfdtotext, but again, wanted to get something running, so opted for what is there now :(.

Thanks,
JES

Reply via email to