> > the unix program pdf2text can convert keeping the text places, but I wanted > > to ask you guys if you know something better, > > AFAIK, PDFBox has a lower-level API that allows you to get hold of text > positions.
In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and font information for each word. You can get the xpdf sources from http://www.foolabs.com/xpdf/, and the patch file is at http://uplib.parc.com/misc/xpdf-3.02-PATCH. To extract the byte positions, use pdftotext with the "-wordboxes" switch, and see the pdftotext man page for more info. This is run automatically in UpLib before the indexing is done. Bill --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]