Richard Wossal <[email protected]> writes: > Hi! > > I'm trying to use poppler to extract text from PDFs, and I've found > empirically > that using the "raw order" option gives better results (I can supply example > files where non-raw order returns mangled text, if needed).
Yes, please it would help to see any of those examples. > This option is only exposed for the C++ bindings, not the Glib ones. > I could use either binding, but I also need something like poppler-glib's > "poppler_page_get_text_attributes". poppler_page_get_text, get_text_layout and get_text_attributes returns the text in reading order, using heuristics to follow columns and tables. It's not perfect, of course, since it's based on heuristics. > As far as I can see, I could either: > > * hack something so I can extract text in raw-order using the Glib-bindings > (I'd prefer staying C-only, but I don't see how this would be possible, > except by adding it to the bindings) > > * or re-implement poppler_page_get_text_attributes in C++, using poppler's > private API (or take poppler's implementation) > > What do you think would be the best way to go about that? I you really need to get the text in raw order we can add new methods in the API for that. I'm thinking that maybe we could add a more generic text iteration API with options like area, order and even the break iterator (so that you can iter over characters, lines and words). > Thanks! > > Richard > > PS: > > My use case, in case there's an even better way to do that: I'm trying to > heuristically extract titles and authors of PDFs without usable metadata. > The backend has a bunch of rules like "the thing with the biggest font > size is > probably the title". This works surprisingly well - except for said PDFs > where poppler_page_get_text only returns garbage, obviously. What's exactly garbage? Regards, -- Carlos Garcia Campos PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
pgpwrIkU84yh9.pgp
Description: PGP signature
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
