On Sun, 04 Jan 2009 03:51:53 +1100, Calogero Alex Baldacchino <alex.baldacch...@email.it> wrote:

Charles McCathieNevile ha scritto:
... it shouldn't be too difficoult to create a custom parser, comforming to RDFa spec and availing of data-* attributes...

That is, since RDFa can be "emulated" somehow in HTML5 and tested without changing current specification, perhaps there isn't a strong need for an early adoption of the former, and instead an "emulated" mergence might be tested first within current timeline.

In principle this is possible. But the data-* attributes are designed for private usage, and introducing a public usage means creating a risk of clashes that pollute RDFa data gathered this way. In other words, this is indeed feasible, but one would expect it to show that the data generated was unreliable (unless privately nobody is interested in basic terms like about). Such results have been used to suggest that poorly implemented features should be dropped, but this hypothetical case suggests to me that the argument is wrong, and that if in the face of reasons why the data would be bad people use them, one might expect better usage by formalising the status of such features and getting decent implementations.

What is the cost of having different data use specialised formats?

If the data model, or a part of it, is not explicit as in RDF but is implicit in code made to treat it (as is the case with using scripts to process things stored in arbitrarily named data-* attributes, and is also the case in using undocumented or semi-documented XML formats, it requires people to understand the code as well as the data model in order to use the data. In a corporate situation where hundreds or tens of thousands of people are required to work with the same data, this makes the data model very fragile.


I'm not sure RDF(a) solves such a problem. AIUI, RDFa just binds (xml) properties and attributes (in the form of curies) to RDF concepts, modelling a certain kind of relationships, whereas it relies on external schemata to define such properties. Any undocumented or semi-documented XML formats may lead to misuses and, thus, to unreliably modelled data,
...

I think the same applies to data-* attributes, because _they_ describe data (and data semantics) in a custom model and thus _they_ need to be documented for others to be able to manipulate them; the use of a custom script rather than a built-in parser does not change much from this point of view.

RDFa binds data to RDF. RDF provides a well-known schema language with machine-processable definition of vocabularies, and how to merge information between them. In other words, if you get the underlying model for your data right enough, people will be able to use it without needing to know what you do.

Naturally not everyone will get their data model right, and naturally not all information will be reliable anyway. However, it would seem to me that making it harder to merge the data in the first place does not assist in determining whether it is useful. On the other hand, certain forms of RDF data such as POWDER, FOAF, Dublin Core and the like have been very carefully modelled, and are relatively well-known and re-used in other data models. Making it easy to parse this data and merge it, according to the existing well-developed models seems valuable.


Ian wrote:
For search engines, I am not convinced. Google's experience is that
natural language processing of the actual information seen by the actual end user is far, far more reliable than any source of metadata.
Thus from Google's perspective, investing in RDFa seems like a poorer
investment than investing in natural language processing.

Indeed. But Google is something of an edge case, since they can afford to run a huge organisation with massive computer power and many engineers to address a problem where a "near-enough" solution brings themn the users who are in turn the product they sell to advertisers. There are many other use cases where a small group of people want a way to reliably search trusted data.


I think the point with general purpose search engines is another one: natural language processing, whereas being expensive, grants a far more accurate solution than RDFa and/or any other kind of metadata can bring to a problem requiring data must never need to be trusted (and, instead, a data processor must be able to determine data's level of trust without any external aid).

No, I don't think so. Google searches based on analysis of the open web are *not* generally more reliable than faceted searches over a reliable dataset, and in some instances are less reliable.

The point is that only a few people can afford to invest in being a general-purpose search engine, whereas many can afford to run a metadata-based search system over a chosen dataset, that responds to their needs (and doesn't require either publishing their data, or paying Google to index it).

Since there is no "direct" relationship between the semantics expressed by RDFa and the real semantics of a web page content, relying on RDFa metadata would lead to widespread cheats, as it was when the keywords meta tag was introduced.

Sure. There would also be many many cases of organisations using decent metadata, as with existing approaches. My point was that I don't expect Google to naively trust metadata it finds on the open web, and in the general case probably not even to look at it. However, Google is not the measure of the Web, it is a company that sells advertising based on information it has gleaned about users by offering them services.

So the fact that some things on the Web are not directly beneficial to Google isn't that important. I do not see how the presence of explicit metadata threatens google any more than the presence of plain text (which can also be misleading).

Thus, a trust chain/evaluation mechanism (such as the use of signatures) would be needed,

Indeed such a thing is needed for a general purpose search engine. But there are many cases where an alternative is fine. For example, T-mobile publish POWDER data about web pages. Opera doesn't need to believe all the POWDER data it finds on the Web in order to improve its offerings based on T-mobile's data, if we can decide how to read that specific data. Which can be done by deciding that we trust a particular set of URIs more than others. No signature necessary, beyond the already ubiquitous TLS and the idea that we trust people we have a relationship with and whose domains we know.

My concern is that any data model requiring any level of trust to achieve a good-working interoperability may address very small (and niche) use cases, and even if a lot of such niche use cases might be grouped in a whole category consistently addressed by RDFa (perhaps beside other models), the result might not be an enough significant use case fitting actual specification guidelines (which are somehow hostile to (xml) extensibility, as far as I've understood them) -- though they might be changed when and if really needed.

A concern of mine is that it is unclear what the required level of usefulness is. The "google highlight" element (once called m but I think it changed its name again) is currently in the spec, the longdesc attribute currently isn't. I presume these facts boil down to judgement calls by the editor while the spec is still an early draft, but it is not easy to understand what information would determine whether something is "sufficiently important". Which makes it hard to determine whether it is worth the considerable investment of discussing in this group, or easier to just go through the W3C process of objecting later on.

cheers

Chaals

--
Charles McCathieNevile  Opera Software, Standards Group
    je parle français -- hablo español -- jeg lærer norsk
http://my.opera.com/chaals       Try Opera: http://www.opera.com

Reply via email to