According to Robert Isaac: > Some of the search results of pdf files have the following : > > CONTENTS 164 eb ild Nona p oblem on m l i al e engine , headligh b lb A > icle b USA membe 740 inhibi oi ch, c een a he fa l 850 e ice ligh e e P op > haf 700 c een a he fl id leakage 940 coolan lo 120 ca e 240 ea facing eain > o a 145 Rp oofing oVol o B akeem o e ha l 850 glo e bo fa l L meni ion o o > p ... > > It appears some of the text is missing and letters and words are jumbled > up. It is only on a few. Any ideas what causes this? > > htdig 3.1.6 on Cobalt RaQ550
I've seen this in one PDF we generated from a Corel Draw file. doc2html.pl and conv_doc.pl run pdftotext with the -raw option, which outputs the text in the order in with the original application spat out the text to the PostScript printer driver (which Acrobat Distiller grabs to put into the PDF). Some applications spit out text in a funny order, which doesn't matter for printing as long as the letters wind up at the right coordinates on the page. However, this doesn't produce ideal results when indexing these PDF files. I've found that running pdftotext without -raw will fix this problem, but can introduce even worse problems with other PDF files (especially those with multicolumn text). The latest 2.01 release of xpdf is supposed to have a pdftotext utility that works much better for this purpose, without the -raw option, but I haven't tried it yet. If you want to give that a shot, you can get it from http://www.foolabs.com/xpdf/ if you don't have xpdf 2.01 already. Another possibility is that this PDF file uses a strange (non-standard) font encoding which pdftotext has trouble mapping back to ISO-8859-1 characters. I've seen one PDF where all the 'v' characters in one particular font were mapped to the wrong letter in the text output. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.NET email is sponsored by: Take your first step towards giving your online business a competitive advantage. Test-drive a Thawte SSL certificate - our easy online guide will show you how. Click here to get started: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0027en _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

