Hi!
I'm trying to use poppler to extract text from PDFs, and I've found
empirically
that using the "raw order" option gives better results (I can supply example
files where non-raw order returns mangled text, if needed).
This option is only exposed for the C++ bindings, not the Glib ones.
I could use either binding, but I also need something like poppler-glib's
"poppler_page_get_text_attributes".
As far as I can see, I could either:
* hack something so I can extract text in raw-order using the Glib-bindings
(I'd prefer staying C-only, but I don't see how this would be possible,
except by adding it to the bindings)
* or re-implement poppler_page_get_text_attributes in C++, using poppler's
private API (or take poppler's implementation)
What do you think would be the best way to go about that?
Thanks!
Richard
PS:
My use case, in case there's an even better way to do that: I'm trying to
heuristically extract titles and authors of PDFs without usable metadata.
The backend has a bunch of rules like "the thing with the biggest font
size is
probably the title". This works surprisingly well - except for said PDFs
where poppler_page_get_text only returns garbage, obviously.
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler