pdf spam solution idea

arni Wed, 27 Jun 2007 18:14:26 -0700

Hi,

its come up several times now that people ask for a way to directlydetect pdf spam by the pdf content and not only through headers or othermeans (hashes, bayes).I've found a solution that should be pretty easy to realise in aFuzzy-OCR like plugin. Here is what it should do:

Use xpdf (http://www.foolabs.com/xpdf/download.html) to read the pdfdocument

export the images to ppm files using `pdfimages`
export the text parts to a simple text using `pdftotext`

This plugin should run as one of the first to make the raw text readavailable (for example by attaching it as an extra mime part or somehowinternally) as well as make the images available to FuzzyOCR or similarby the same means as above.

Unfortunately i wont be able to write such a plugin myself, it should berather easy to do but i cant start to learn pearl just for this ;-)


Maybe i gave some hints ...

arni

pdf spam solution idea

Reply via email to