According to Robert Isaac:
> Some of the search results of pdf files have the following :
> 
> CONTENTS 164 eb ild Nona p oblem on m l i al e engine , headligh b lb A 
> icle b USA membe 740 inhibi oi ch, c een a he fa l 850 e ice ligh e e P op 
> haf 700 c een a he fl id leakage 940 coolan lo 120 ca e 240 ea facing eain 
> o a 145 Rp oofing oVol o B akeem o e ha l 850 glo e bo fa l L meni ion o o 
> p ...
> 
> It appears some of the text is missing and letters and words are jumbled 
> up. It is only on a few. Any ideas what causes this?
> 
> htdig 3.1.6 on Cobalt RaQ550

I've seen this in one PDF we generated from a Corel Draw file.
doc2html.pl and conv_doc.pl run pdftotext with the -raw option, which
outputs the text in the order in with the original application spat out
the text to the PostScript printer driver (which Acrobat Distiller grabs
to put into the PDF).  Some applications spit out text in a funny order,
which doesn't matter for printing as long as the letters wind up at
the right coordinates on the page.  However, this doesn't produce ideal
results when indexing these PDF files.

I've found that running pdftotext without -raw will fix this problem, but
can introduce even worse problems with other PDF files (especially those
with multicolumn text).  The latest 2.01 release of xpdf is supposed
to have a pdftotext utility that works much better for this purpose,
without the -raw option, but I haven't tried it yet.  If you want to
give that a shot, you can get it from http://www.foolabs.com/xpdf/
if you don't have xpdf 2.01 already.

Another possibility is that this PDF file uses a strange (non-standard)
font encoding which pdftotext has trouble mapping back to ISO-8859-1
characters.  I've seen one PDF where all the 'v' characters in one
particular font were mapped to the wrong letter in the text output.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.NET email is sponsored by: Take your first step towards giving 
your online business a competitive advantage. Test-drive a Thawte SSL 
certificate - our easy online guide will show you how. Click here to get 
started: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0027en
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to