PDF Decoder - Show of concept

2007-07-11 Thread arni

Hi,

what i'm going to show you is purely show or prove of concept - there is 
no way you should use the code in a productional environment, because it 
most likely has exploitable bugs as well as inacuracies that will not be 
able to parse all mail properly.


I put this together within an around an hour to show how its possible to 
cope with pdf spam - the script compeltely decodes the pdf attachment 
into text and images and reattaches them. Like this the text is fully 
available to all means of sa processing, as well as the images to 
FuzzyOCR, if installed.

The code is php, because thats easiest for me to write.

It also has a nice side effect, that you are able to see the text from a 
pdf without having to open it ;-)


If someone could make a sa plugin that can do the same thing in a clean 
and safe manner, this would be great,

arni
Content-type: text/html
X-Powered-By: PHP/4.3.9


?


$mail = str_replace("\n\r", "\n", join('',file("test.eml")));

list($header, $body) = explode("\n\n", $mail, 2);
preg_match("/boundary=\"([^\"]*)\"/m", $mail, $border);

$border = $border[1];
$parts = preg_split("/-*$border-*/", $body);

array_shift($parts);
array_pop($parts);

$mailout = $header . "\n\n";

foreach($parts AS $part) {
list($phead, $pbody) = explode("\n\n", $part, 2);
$mailout .= "--$border";
$mailout .= $part;
if(strpos($phead, "pdf") !== false) {
$binary = base64_decode($pbody);
$tmpname = rand("1", "9");
$out = fopen("$tmpname.pdf", "w");
fputs($out, $binary);
fclose($out);
exec("pdftotext -htmlmeta -nopgbrk $tmpname.pdf $tmpname.txt 2 /dev/null");
$text = join('', file("$tmpname.txt"));
unlink("$tmpname.txt");
if(trim(strip_tags($text)) != "") {
$mailout .= "--$border\n";
$mailout .= "Content-Type: text/html; charset = \"iso-8859-1\"\nContent-Transfer-Encoding: 8bit\nContent-Disposition: attachment; filename=\"pdftext.htm\"\n\n";
$mailout .= $text."\n";
}
exec("pdfimages -j $tmpname.pdf $tmpname 2 /dev/null");
$cnt = 0;
$handle=opendir('.');
while ($file = readdir($handle)) {
if($file != "."  $file != ".."  is_file($file)) {
if(substr($file, 0, strlen($tmpname)) == $tmpname) {
@list($name, $ext) = explode(".",$file);
if($ext == "ppm") {
exec("ppmtogif $file  $file.gif 2 /dev/null");
$binary = join('', file("$file.gif"));
unlink("$file.gif");
$mailout .= "--$border\n";
$mailout .= "Content-Type: image/gif;\nContent-Transfer-Encoding: base64\nContent-Disposition: attachment; filename=\"pdfimage$cnt.gif\"\n\n";
$cnt++;
$mailout .= wordwrap(base64_encode($binary), 76, "\n", 1)."\n";
}
elseif($ext == "jpg") {
$binary = join('', file($file));
$mailout .= "--$border\n";
$mailout .= "Content-Type: image/jpeg;\nContent-Transfer-Encoding: base64\nContent-Disposition: attachment; filename=\"pdfimage$cnt.jpg\"\n\n";
$cnt++;
$mailout .= wordwrap(base64_encode($binary), 76, "\n", 1)."\n";
}
unlink($file);
}
}
}
closedir($handle);
}
}

$mailout .= "--$border--\n";

$out = fopen("out.eml", "w");
fputs($out, $mailout);



Re: PDF Decoder - Show of concept

2007-07-11 Thread Theo Van Dinter
On Thu, Jul 12, 2007 at 04:00:33AM +0200, arni wrote:
 I put this together within an around an hour to show how its possible to 
 cope with pdf spam - the script compeltely decodes the pdf attachment 
 into text and images and reattaches them. Like this the text is fully 
 available to all means of sa processing, as well as the images to 
 FuzzyOCR, if installed.

Please don't do that (adding in new message parts), btw.  There's a 3.2
plugin call (post_message_parse, per bug 5069) which was specifically
added such that plugins can manipulate messages after the initial parse
has completed.  This allows for things like OCR of images and PDF-text,
and the rendered text can go right in the message part, and then gets
included automatically by SA as body text and so is available for body
rules, uri parsing, etc.


-- 
Randomly Selected Tagline:
Never go off on tangents, which are lines that intersect a curve at only
 one point and were discovered by Euclid, who live in the 6th century,
 which was an era dominated by the Goths, who lived in what we now know
 as Poland. - Unknown from Nov. 1998 issue of Infosystems Executive.


pgpWoyScSQErx.pgp
Description: PGP signature