I have been working on PDF extraction. I find that PDF
combines 'what' (text itself) with 'how' (transformations,
presentation). The table that we see if often just a collection
of lines and rectangles put together in an adhoc fashion.
It could be due to pdf generator libraries themselves. It feels
like the 'C' of this space. IMO we are missing the frameworks
and higher levels of abstraction and/or representations. They
may be available in the adobe ecosystem somewhere but it
is not obvious to an outsider like me as to what they are.



On Mon, May 12, 2014 at 2:16 PM, Sriram Karra <karra....@gmail.com> wrote:

>
>
> http://www.thehindu.com/opinion/op-ed/limitations-of-the-pdf/article5998841.ece
>
> == snip ==
>
> The basic format doesn’t include any requirement that text be selectable
> or searchable, while data presented as charts and tables is often
> impossible to export in any useable way.
>
> It’s the standard file format for nearly every academic paper, political
> briefing and research note. But a new report by the World Bank suggests
> that the venerable pdf is keeping valuable information buried in servers,
> unread and unloved.
>
> == /snip ==
>
> --
> For more details about this list
> http://datameet.org/discussions/
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
For more details about this list
http://datameet.org/discussions/
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to