From: Christian Blume <chr.bl...@gmail.com> > I've been using the podofotxtextract tool to test PoDoFo but when I > run it on a PDF I get many words split up into pieces. It seems > impossible to put the words back together with the information that > PoDoFo (or the PDF itself?) provides. Is this expected?
Unfortunately, PDF only has a bunch of “go to (10,120), use this font, write the text "xy"” commands. Very often, programs writing PDF will piece together words with multiple such commands, and will also often jump around on the page and, for example, write the italic words after everything else. That does not make it impossible to extract text from a lot of documents. (There will always be cases where software does not return text in the same order a human would, but you can make that rare.) But it does mean doing so takes a lot more implementation than podofotxtextract is doing, you need to keep track of the position on the page and somehow combine all those little bits and pieces of text into a coherent whole. Which does not require a full pdf interpreter, but large parts of one. HTH, Christopher The MathWorks GmbH | Friedlandstr.18 | 52064 Aachen | District Court Aachen | HRB 8082 | Managing Directors: Bertrand Dissler, Steven D. Barbo, Jeanne O’Keefe _______________________________________________ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users