Hi,
what i'm going to show you is purely show or prove of concept - there is
no way you should use the code in a productional environment, because it
most likely has exploitable bugs as well as inacuracies that will not be
able to parse all mail properly.
I put this together within an around an hour to show how its possible to
cope with pdf spam - the script compeltely decodes the pdf attachment
into text and images and reattaches them. Like this the text is fully
available to all means of sa processing, as well as the images to
FuzzyOCR, if installed.
The code is php, because thats easiest for me to write.
It also has a nice side effect, that you are able to see the text from a
pdf without having to open it ;-)
If someone could make a sa plugin that can do the same thing in a clean
and safe manner, this would be great,
arni
Content-type: text/html
X-Powered-By: PHP/4.3.9
?
$mail = str_replace("\n\r", "\n", join('',file("test.eml")));
list($header, $body) = explode("\n\n", $mail, 2);
preg_match("/boundary=\"([^\"]*)\"/m", $mail, $border);
$border = $border[1];
$parts = preg_split("/-*$border-*/", $body);
array_shift($parts);
array_pop($parts);
$mailout = $header . "\n\n";
foreach($parts AS $part) {
list($phead, $pbody) = explode("\n\n", $part, 2);
$mailout .= "--$border";
$mailout .= $part;
if(strpos($phead, "pdf") !== false) {
$binary = base64_decode($pbody);
$tmpname = rand("1", "9");
$out = fopen("$tmpname.pdf", "w");
fputs($out, $binary);
fclose($out);
exec("pdftotext -htmlmeta -nopgbrk $tmpname.pdf $tmpname.txt 2 /dev/null");
$text = join('', file("$tmpname.txt"));
unlink("$tmpname.txt");
if(trim(strip_tags($text)) != "") {
$mailout .= "--$border\n";
$mailout .= "Content-Type: text/html; charset = \"iso-8859-1\"\nContent-Transfer-Encoding: 8bit\nContent-Disposition: attachment; filename=\"pdftext.htm\"\n\n";
$mailout .= $text."\n";
}
exec("pdfimages -j $tmpname.pdf $tmpname 2 /dev/null");
$cnt = 0;
$handle=opendir('.');
while ($file = readdir($handle)) {
if($file != "." $file != ".." is_file($file)) {
if(substr($file, 0, strlen($tmpname)) == $tmpname) {
@list($name, $ext) = explode(".",$file);
if($ext == "ppm") {
exec("ppmtogif $file $file.gif 2 /dev/null");
$binary = join('', file("$file.gif"));
unlink("$file.gif");
$mailout .= "--$border\n";
$mailout .= "Content-Type: image/gif;\nContent-Transfer-Encoding: base64\nContent-Disposition: attachment; filename=\"pdfimage$cnt.gif\"\n\n";
$cnt++;
$mailout .= wordwrap(base64_encode($binary), 76, "\n", 1)."\n";
}
elseif($ext == "jpg") {
$binary = join('', file($file));
$mailout .= "--$border\n";
$mailout .= "Content-Type: image/jpeg;\nContent-Transfer-Encoding: base64\nContent-Disposition: attachment; filename=\"pdfimage$cnt.jpg\"\n\n";
$cnt++;
$mailout .= wordwrap(base64_encode($binary), 76, "\n", 1)."\n";
}
unlink($file);
}
}
}
closedir($handle);
}
}
$mailout .= "--$border--\n";
$out = fopen("out.eml", "w");
fputs($out, $mailout);