Richard Wossal <[email protected]> writes:

> Hi!
>
> I'm trying to use poppler to extract text from PDFs, and I've found 
> empirically
> that using the "raw order" option gives better results (I can supply example
> files where non-raw order returns mangled text, if needed).

Yes, please it would help to see any of those examples.

> This option is only exposed for the C++ bindings, not the Glib ones.
> I could use either binding, but I also need something like poppler-glib's
> "poppler_page_get_text_attributes".

poppler_page_get_text, get_text_layout and get_text_attributes returns
the text in reading order, using heuristics to follow columns and
tables. It's not perfect, of course, since it's based on heuristics. 

> As far as I can see, I could either:
>
> * hack something so I can extract text in raw-order using the Glib-bindings
>    (I'd prefer staying C-only, but I don't see how this would be possible,
>     except by adding it to the bindings)
>
> * or re-implement poppler_page_get_text_attributes in C++, using poppler's
>    private API (or take poppler's implementation)
>
> What do you think would be the best way to go about that?

I you really need to get the text in raw order we can add new methods in
the API for that. I'm thinking that maybe we could add a more generic
text iteration API with options like area, order and even the break
iterator (so that you can iter over characters, lines and words).

> Thanks!
>
> Richard
>
> PS:
>
> My use case, in case there's an even better way to do that: I'm trying to
> heuristically extract titles and authors of PDFs without usable metadata.
> The backend has a bunch of rules like "the thing with the biggest font 
> size is
> probably the title". This works surprisingly well - except for said PDFs
> where poppler_page_get_text only returns garbage, obviously.

What's exactly garbage?

Regards, 
-- 
Carlos Garcia Campos
PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462

Attachment: pgpwrIkU84yh9.pgp
Description: PGP signature

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to