On 10/6/14 2:49 PM, Peter F. Patel-Schneider wrote:


On 10/06/2014 11:03 AM, Kingsley Idehen wrote:
On 10/6/14 12:48 PM, Peter F. Patel-Schneider wrote:
It's not hard to query PDFs with SPARQL. All you have to do is extract the metadata from the document and turn it into RDF, if needed. Lots of programs
extract and display this metadata already.

Peter,

Having had 200+ (some-non-rdf-doc} to RDF document transformers built under my
direct guidance, there are issues with your claim above:

Huh? Every single PDF reader that I use can extract the PDF metadata and display it.

Again, this isn't about metadata.

The metadata that I see in PDF documents uses a core set of properties that are easy to transform into RDF.

Metadata isn't the issue at hand.

Of course, this core set is very small (title, author, and a few other things) so you don't get all that much out of the core set.

See my comments above :)




1. The extractors are platform specific -- AWWW is about platform agnosticism (I don't want to mandate an OS for experiencing the power of Linked Open Data
transformers / rdfizers)

Well, the extractors would be specific to PDF, but that's hardly surprising, I think.

2. It isn't solely about metadata  -- we also have raw data inside these
documents confined to Tables, paragraphs of sentences

Well, sure, but is extracting information directly from the figures or tables or text being considered here? I sure would like this to be possible. How would it work in an HTML context?

Each table is a Class.
Each table record is an instance of the Class represented by the table.
Each table field is a property of a Class represented by the table
Each table field value's data type can be used to discern the range of each Class property.

Depending on what the sentences and paragraphs are about you can make an RDF statement per sentence.

3. If querying a PDF was marginally simple, I would be demonstrating that
using a SPARQL results URL in response to this post :-)

I'm not saying that it is so simple. You do have to find the metadata block in the PDF and then look for the /Title, /Author, ... stuff.

But it could be simple if PDF didn't have the issues I outlined in regards to extraction technology. Funnily enough, there's a massive opportunity for Adobe to solve this problem, especially as they've now ventured heavily into cloud enabling their technologies, If they provide APIs from the cloud, this problem could become much simpler to address in regards to productive solutions where PDFs become less of the data silos that they are today.

Possible != Simple and Productive.

Yes, but there are lots of tools that display PDF metadata, so there are some who believe that the benefit is greater than the cost.

Metadata isn't the fundamental quest here.


We want to leverage the productivity and simplicity that AWWW brings to data
representation, access, interaction, and integration.

Sure, but the additional costs, if any, on paper authors, reviewers, and readers have to be considered. If these costs are eliminated or at least minimized then this good is much more likely to be realized.

With some help from Adobe we can have the best of all worlds here. I am going to take a look at their latest cloud offerings and associated APIs.




peter







--
Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to