Hi!

I'm trying to use poppler to extract text from PDFs, and I've found empirically
that using the "raw order" option gives better results (I can supply example
files where non-raw order returns mangled text, if needed).

This option is only exposed for the C++ bindings, not the Glib ones.
I could use either binding, but I also need something like poppler-glib's
"poppler_page_get_text_attributes".

As far as I can see, I could either:

* hack something so I can extract text in raw-order using the Glib-bindings
  (I'd prefer staying C-only, but I don't see how this would be possible,
   except by adding it to the bindings)

* or re-implement poppler_page_get_text_attributes in C++, using poppler's
  private API (or take poppler's implementation)

What do you think would be the best way to go about that?

Thanks!

Richard

PS:

My use case, in case there's an even better way to do that: I'm trying to
heuristically extract titles and authors of PDFs without usable metadata.
The backend has a bunch of rules like "the thing with the biggest font size is
probably the title". This works surprisingly well - except for said PDFs
where poppler_page_get_text only returns garbage, obviously.

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to