from pdf to some sort of XMLish ODT kind of file ...

Albretch Mueller Mon, 18 Jul 2022 02:04:23 -0700

 it is in its name: https://en.wikipedia.org/wiki/PDF
 but, as a corpora researcher, I have always wondered what exactly are
the "portable", "document" and "format" aspects of it.  PDF is just a
"visually appealing" GUI.


 The processes of conversion of the different kinds of PDFs to text is
not exactly straightforward, it is way too entropic (too much of the
necessary "information" to do the conversion is lost). Some pdf files
are image-based (no text at all), some are image-based, but include
(some of) the text, some of the image-based pdf files also contain
images, ...

 Do you know of any kind of prior art studying and/or explaining
possible solutions to these kinds of pdf to xmlish text conversion
problems? Any suggestion about how you would approach a solution to
them?

 Thank you,
 lbrtchx

from pdf to some sort of XMLish ODT kind of file ...

Reply via email to