I have been working on PDF extraction. I find that PDF combines 'what' (text itself) with 'how' (transformations, presentation). The table that we see if often just a collection of lines and rectangles put together in an adhoc fashion. It could be due to pdf generator libraries themselves. It feels like the 'C' of this space. IMO we are missing the frameworks and higher levels of abstraction and/or representations. They may be available in the adobe ecosystem somewhere but it is not obvious to an outsider like me as to what they are.
On Mon, May 12, 2014 at 2:16 PM, Sriram Karra <karra....@gmail.com> wrote: > > > http://www.thehindu.com/opinion/op-ed/limitations-of-the-pdf/article5998841.ece > > == snip == > > The basic format doesn’t include any requirement that text be selectable > or searchable, while data presented as charts and tables is often > impossible to export in any useable way. > > It’s the standard file format for nearly every academic paper, political > briefing and research note. But a new report by the World Bank suggests > that the venerable pdf is keeping valuable information buried in servers, > unread and unloved. > > == /snip == > > -- > For more details about this list > http://datameet.org/discussions/ > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to datameet+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- For more details about this list http://datameet.org/discussions/ --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.