Hi,

what i'm going to show you is purely show or prove of concept - there is no way you should use the code in a productional environment, because it most likely has exploitable bugs as well as inacuracies that will not be able to parse all mail properly.

I put this together within an around an hour to show how its possible to cope with pdf spam - the script compeltely decodes the pdf attachment into text and images and reattaches them. Like this the text is fully available to all means of sa processing, as well as the images to FuzzyOCR, if installed.
The code is php, because thats easiest for me to write.

It also has a nice side effect, that you are able to see the text from a pdf without having to open it ;-)

If someone could make a sa plugin that can do the same thing in a clean and safe manner, this would be great,
arni
Content-type: text/html X-Powered-By: PHP/4.3.9 <?


$mail
= str_replace("\n\r", "\n", join('',file("test.eml")));

list(
$header, $body) = explode("\n\n", $mail, 2);
preg_match("/boundary=\"([^\"]*)\"/m", $mail, $border);

$border = $border[1];
$parts = preg_split("/-*$border-*/", $body);

array_shift($parts);
array_pop($parts);

$mailout = $header . "\n\n";

foreach(
$parts AS $part) {
   list(
$phead, $pbody) = explode("\n\n", $part, 2);
   
$mailout .= "--$border";
   
$mailout .= $part;
   if(
strpos($phead, "pdf") !== false) {
      
$binary = base64_decode($pbody);
      
$tmpname = rand("10000", "99999");
      
$out = fopen("$tmpname.pdf", "w");
      
fputs($out, $binary);
      
fclose($out);
      
exec("pdftotext -htmlmeta -nopgbrk $tmpname.pdf $tmpname.txt 2> /dev/null");
      
$text = join('', file("$tmpname.txt"));
      
unlink("$tmpname.txt");
      if(
trim(strip_tags($text)) != "") {
         
$mailout .= "--$border\n";
         
$mailout .= "Content-Type: text/html; charset = \"iso-8859-1\"\nContent-Transfer-Encoding: 8bit\nContent-Disposition: attachment; filename=\"pdftext.htm\"\n\n";
         
$mailout .= $text."\n";
      }
      
exec("pdfimages -j $tmpname.pdf $tmpname 2> /dev/null");
      
$cnt = 0;
      
$handle=opendir('.');
      while (
$file = readdir($handle)) {
          if(
$file != "." && $file != ".." && is_file($file)) {
             if(
substr($file, 0, strlen($tmpname)) == $tmpname) {
                @list(
$name, $ext) = explode(".",$file);
                if(
$ext == "ppm") {
                   
exec("ppmtogif $file > $file.gif 2> /dev/null");
                   
$binary = join('', file("$file.gif"));
                   
unlink("$file.gif");
                   
$mailout .= "--$border\n";
                   
$mailout .= "Content-Type: image/gif;\nContent-Transfer-Encoding: base64\nContent-Disposition: attachment; filename=\"pdfimage$cnt.gif\"\n\n";
                   
$cnt++;
                   
$mailout .= wordwrap(base64_encode($binary), 76, "\n", 1)."\n";
                }
                elseif(
$ext == "jpg") {
                   
$binary = join('', file($file));
                   
$mailout .= "--$border\n";
                   
$mailout .= "Content-Type: image/jpeg;\nContent-Transfer-Encoding: base64\nContent-Disposition: attachment; filename=\"pdfimage$cnt.jpg\"\n\n";
                   
$cnt++;
                   
$mailout .= wordwrap(base64_encode($binary), 76, "\n", 1)."\n";
                }
                
unlink($file);
             }
          }
      }
      
closedir($handle);
   }
}

$mailout .= "--$border--\n";

$out = fopen("out.eml", "w");
fputs($out, $mailout);

Reply via email to