it is in its name: https://en.wikipedia.org/wiki/PDF but, as a corpora researcher, I have always wondered what exactly are the "portable", "document" and "format" aspects of it. PDF is just a "visually appealing" GUI.
The processes of conversion of the different kinds of PDFs to text is not exactly straightforward, it is way too entropic (too much of the necessary "information" to do the conversion is lost). Some pdf files are image-based (no text at all), some are image-based, but include (some of) the text, some of the image-based pdf files also contain images, ... Do you know of any kind of prior art studying and/or explaining possible solutions to these kinds of pdf to xmlish text conversion problems? Any suggestion about how you would approach a solution to them? Thank you, lbrtchx
