Re: [Podofo-users] podofotxtextract

Christopher Creutzig Wed, 26 Jan 2022 08:16:33 -0800

From: Christian Blume <chr.bl...@gmail.com> 

> I've been using the podofotxtextract tool to test PoDoFo but when I
> run it on a PDF I get many words split up into pieces. It seems
> impossible to put the words back together with the information that
> PoDoFo (or the PDF itself?) provides. Is this expected?


Unfortunately, PDF only has a bunch of “go to (10,120), use this font, write 
the text "xy"” commands.
Very often, programs writing PDF will piece together words with multiple such 
commands,
and will also often jump around on the page and, for example, write the italic 
words after
everything else.

That does not make it impossible to extract text from a lot of documents. 
(There will always
be cases where software does not return text in the same order a human would, 
but you
can make that rare.) But it does mean doing so takes a lot more implementation 
than
podofotxtextract is doing, you need to keep track of the position on the page 
and
somehow combine all those little bits and pieces of text into a coherent whole.
Which does not require a full pdf interpreter, but large parts of one.


HTH,
Christopher

The MathWorks GmbH | Friedlandstr.18 | 52064 Aachen | District Court Aachen | 
HRB 8082 | Managing Directors: Bertrand Dissler, Steven D. Barbo, Jeanne O’Keefe



_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Re: [Podofo-users] podofotxtextract

Reply via email to