When I first raise my "confusion" (not "objection") over Davide's
"unstructured-to-structured" wording, my intension was to clarify
what kind
of problems that the bioRDF group attempts to tackle More
specifically, I
was refering it within the context of "GRDDL" because IMHO, I don't
think
GRDDL is designed to help RDF-ize natural language; GRDDL is
designed to
specifically target at the XML-based documents. Because the draft
proposal
of the bioRDF only says: "Learn about GRDDL, SPARQL, OWL, etc.", I
want to
clarify where they are heading.
There are several threads ongoing here, and I'm going to split this
one off from "Structured vs. Unstructured".
Xiaoshu, like you, my focus of interest here is GRDDL specifically.
Let me just give you "take" on GRDDL and hopefully Eric Miller and/or
others can help correct any misconceptions I have.
My understanding of GRDDL is that it was originally proposed in the
(X)HTML community. The problem it was intended to address is that
there is no way of validating arbitrary RDF using XML schema (in
other words, there is no XSD for RDF, because XML schema is
insufficiently expressive). Consequently for XML instances that are
intended to be validated according to some schema--and this could
include (X)HTML--RDF embedding requires some kind of "expedient",
otherwise the RDF will "break" the schema and render the instance non-
validatable.
Many "expedients" for embedding the RDF will work--for example
separating out the RDF into an appinfo element, attaching it as a
separate file, hiding it inside CDATA--and all of these have been
tried successfully in one or another application setting. But the (X)
HTML community wanted a *web-standard* way of embedding RDF in such a
way that the semantic intent ("I hereby officially declare to the WWW
that this RDF is inseparably part of the semantics of this XML
instance.") would be clear.
GRDDL allows the instance author to make the public declaration above
by referencing the URL of some xml transform, that the author thereby
publicly identifies as the "key" to extract the intended RDF from his
instance. In this very nice way, GRDDL allows the instance author the
freedom to package his/her RDF any way he/she pleases, so long as he/
she also provides the "decoder ring" of an xml transform to extract
it. Furthermore, the author's statement of semantic inseparability is
explicitly entailed by his/her use of the GRDDL standard to render
the RDF.
Eric, once again, if I'm getting any of this wrong, correct me...
It's always been my understanding that the primary use case for GRDDL
is the one where the instance author explicitly has in mind a
"finished" set of RDF triples that he/she wants to embed. He/she
"encodes" these triples, packages them into the instance XML, assigns
the intended extraction transform a url, attaches that, and sends the
resulting instance document off into the world. Easy peasy.
But now here's the part that I (and I think maybe also Xiaoshu)
aren't sure about.
Question #1 (which Eric has already answered in the affirmative):
Will this work for non-(X)HTML too? Answer: yes. And this is
important because most healthcare records documents aren't (X)HTML.
Question #2: Will this work for the case where the instance author
**doesn't** explicitly know the actual RDF triple set up front, and
the referenced extraction transform is actually acting as a "language
processor" to generate triples "that thereby see the light for the
first time"?
Question #3: If the answer to #2 is "yes", then is there a
conceivable extension to GRDDL where the GRDDL url is not just an xml
transform, but ---for example-- a web service fronting for some kind
of natural language processor??