On Sun, 04 Jan 2009 03:51:53 +1100, Calogero Alex Baldacchino
<alex.baldacch...@email.it> wrote:
Charles McCathieNevile ha scritto:
... it shouldn't be too difficoult to create a custom parser, comforming
to RDFa spec and availing of data-* attributes...
That is, since RDFa can be "emulated" somehow in HTML5 and tested
without changing current specification, perhaps there isn't a strong
need for an early adoption of the former, and instead an "emulated"
mergence might be tested first within current timeline.
In principle this is possible. But the data-* attributes are designed for
private usage, and introducing a public usage means creating a risk of
clashes that pollute RDFa data gathered this way. In other words, this is
indeed feasible, but one would expect it to show that the data generated
was unreliable (unless privately nobody is interested in basic terms like
about). Such results have been used to suggest that poorly implemented
features should be dropped, but this hypothetical case suggests to me that
the argument is wrong, and that if in the face of reasons why the data
would be bad people use them, one might expect better usage by formalising
the status of such features and getting decent implementations.
What is the cost of having different data use specialised formats?
If the data model, or a part of it, is not explicit as in RDF but is
implicit in code made to treat it (as is the case with using scripts to
process things stored in arbitrarily named data-* attributes, and is
also the case in using undocumented or semi-documented XML formats, it
requires people to understand the code as well as the data model in
order to use the data. In a corporate situation where hundreds or tens
of thousands of people are required to work with the same data, this
makes the data model very fragile.
I'm not sure RDF(a) solves such a problem. AIUI, RDFa just binds (xml)
properties and attributes (in the form of curies) to RDF concepts,
modelling a certain kind of relationships, whereas it relies on external
schemata to define such properties. Any undocumented or semi-documented
XML formats may lead to misuses and, thus, to unreliably modelled data,
...
I think the same applies to data-* attributes, because _they_ describe
data (and data semantics) in a custom model and thus _they_ need to be
documented for others to be able to manipulate them; the use of a custom
script rather than a built-in parser does not change much from this
point of view.
RDFa binds data to RDF. RDF provides a well-known schema language with
machine-processable definition of vocabularies, and how to merge
information between them. In other words, if you get the underlying model
for your data right enough, people will be able to use it without needing
to know what you do.
Naturally not everyone will get their data model right, and naturally not
all information will be reliable anyway. However, it would seem to me that
making it harder to merge the data in the first place does not assist in
determining whether it is useful. On the other hand, certain forms of RDF
data such as POWDER, FOAF, Dublin Core and the like have been very
carefully modelled, and are relatively well-known and re-used in other
data models. Making it easy to parse this data and merge it, according to
the existing well-developed models seems valuable.
Ian wrote:
For search engines, I am not convinced. Google's experience is that
natural language processing of the actual information seen by the
actual end user is far, far more reliable than any source of metadata.
Thus from Google's perspective, investing in RDFa seems like a poorer
investment than investing in natural language processing.
Indeed. But Google is something of an edge case, since they can afford
to run a huge organisation with massive computer power and many
engineers to address a problem where a "near-enough" solution brings
themn the users who are in turn the product they sell to advertisers.
There are many other use cases where a small group of people want a way
to reliably search trusted data.
I think the point with general purpose search engines is another one:
natural language processing, whereas being expensive, grants a far more
accurate solution than RDFa and/or any other kind of metadata can bring
to a problem requiring data must never need to be trusted (and, instead,
a data processor must be able to determine data's level of trust without
any external aid).
No, I don't think so. Google searches based on analysis of the open web
are *not* generally more reliable than faceted searches over a reliable
dataset, and in some instances are less reliable.
The point is that only a few people can afford to invest in being a
general-purpose search engine, whereas many can afford to run a
metadata-based search system over a chosen dataset, that responds to their
needs (and doesn't require either publishing their data, or paying Google
to index it).
Since there is no "direct" relationship between the semantics expressed
by RDFa and the real semantics of a web page content, relying on RDFa
metadata would lead to widespread cheats, as it was when the keywords
meta tag was introduced.
Sure. There would also be many many cases of organisations using decent
metadata, as with existing approaches. My point was that I don't expect
Google to naively trust metadata it finds on the open web, and in the
general case probably not even to look at it. However, Google is not the
measure of the Web, it is a company that sells advertising based on
information it has gleaned about users by offering them services.
So the fact that some things on the Web are not directly beneficial to
Google isn't that important. I do not see how the presence of explicit
metadata threatens google any more than the presence of plain text (which
can also be misleading).
Thus, a trust chain/evaluation mechanism (such as the use of signatures)
would be needed,
Indeed such a thing is needed for a general purpose search engine. But
there are many cases where an alternative is fine. For example, T-mobile
publish POWDER data about web pages. Opera doesn't need to believe all the
POWDER data it finds on the Web in order to improve its offerings based on
T-mobile's data, if we can decide how to read that specific data. Which
can be done by deciding that we trust a particular set of URIs more than
others. No signature necessary, beyond the already ubiquitous TLS and the
idea that we trust people we have a relationship with and whose domains we
know.
My concern is that any data model requiring any level of trust to
achieve a good-working interoperability may address very small (and
niche) use cases, and even if a lot of such niche use cases might be
grouped in a whole category consistently addressed by RDFa (perhaps
beside other models), the result might not be an enough significant use
case fitting actual specification guidelines (which are somehow hostile
to (xml) extensibility, as far as I've understood them) -- though they
might be changed when and if really needed.
A concern of mine is that it is unclear what the required level of
usefulness is. The "google highlight" element (once called m but I think
it changed its name again) is currently in the spec, the longdesc
attribute currently isn't. I presume these facts boil down to judgement
calls by the editor while the spec is still an early draft, but it is not
easy to understand what information would determine whether something is
"sufficiently important". Which makes it hard to determine whether it is
worth the considerable investment of discussing in this group, or easier
to just go through the W3C process of objecting later on.
cheers
Chaals
--
Charles McCathieNevile Opera Software, Standards Group
je parle français -- hablo español -- jeg lærer norsk
http://my.opera.com/chaals Try Opera: http://www.opera.com