[poppler] text extraction in raw order + text attributes

Richard Wossal Fri, 06 Dec 2013 08:52:10 -0800

Hi!

I'm trying to use poppler to extract text from PDFs, and I've foundempirically

that using the "raw order" option gives better results (I can supply example
files where non-raw order returns mangled text, if needed).


This option is only exposed for the C++ bindings, not the Glib ones.
I could use either binding, but I also need something like poppler-glib's
"poppler_page_get_text_attributes".

As far as I can see, I could either:

* hack something so I can extract text in raw-order using the Glib-bindings
  (I'd prefer staying C-only, but I don't see how this would be possible,
   except by adding it to the bindings)

* or re-implement poppler_page_get_text_attributes in C++, using poppler's
  private API (or take poppler's implementation)

What do you think would be the best way to go about that?

Thanks!

Richard

PS:

My use case, in case there's an even better way to do that: I'm trying to
heuristically extract titles and authors of PDFs without usable metadata.

The backend has a bunch of rules like "the thing with the biggest fontsize is

probably the title". This works surprisingly well - except for said PDFs
where poppler_page_get_text only returns garbage, obviously.

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] text extraction in raw order + text attributes

Reply via email to