Charles McCathieNevile ha scritto:
On Sun, 04 Jan 2009 03:51:53 +1100, Calogero Alex Baldacchino <alex.baldacch...@email.it> wrote:

Charles McCathieNevile ha scritto:
... it shouldn't be too difficoult to create a custom parser, comforming to RDFa spec and availing of data-* attributes...

That is, since RDFa can be "emulated" somehow in HTML5 and tested without changing current specification, perhaps there isn't a strong need for an early adoption of the former, and instead an "emulated" mergence might be tested first within current timeline.

In principle this is possible. But the data-* attributes are designed for private usage, and introducing a public usage means creating a risk of clashes that pollute RDFa data gathered this way. In other words, this is indeed feasible, but one would expect it to show that the data generated was unreliable (unless privately nobody is interested in basic terms like about).

This is why I was thinking about somewhat "data-rdfa-about", "data-rdfa-property", "data-rdfa-content" and so on, so that, for the purposes of an RDFa processor working on top of HTML5 UAs (perhaps in a test phase, if needed at all, of course), an element dataset would give access to "rdfa-about", instead of just "about", that is using the prefix "rdfa-" as acting as a namespace prefix in xml (hence, as if there were "rdfa:about" instead of "data-rdfa-about" in the markup).

This way, the public exposure of RDFa attributes on top of the generic and normally-private dataset feature might be enough circumscribed to avoid clashes. That is, if RDFa shows its best benefits when used to address small-scale needs involving trusted/reliable (meta-)data, it should be fair to assume all involved parties are aware that each one is using RDFa, and aren't just running an RDFa processor in the hope to gather enough informations.

From this point of view, it should be quite unlike to find people using "data-rdfa-about" to express different semantics in the same page (whereas data-property might cause ambiguity, for instance), as well as it is (or should be) quite unlike to find namespaces using the very same prefix involved in the same xml document (that is, I think choosing a name including a namespace prefix for a data-* attribute (and also for a class in a generic container as a div or a span, to tell it represents an external element) can replicate quite safely the xml extensibility for custom uses, to some extent, without requiring a wide support for it in text/html document - since it seems that xhtml extensibility is not a major concern, at least not enough to be worth merging it into html).

Just an idea, though.

However, AIUI, actual xml serialization (xhtml5) allows the use of namespaces and prefixed attributes, thus couldn't a proper namespace be introduced for RDFa attributes, so they can be used, if needed, in xhtml5 documents? I think such might be a valuable choice, because it seems to me RDFa attributes can be used to address such cases where metadata must stay as close as possible to correspondent data, but a mistake in a piece of markup may trigger the adoption agency or foster parenting algorithms, eventually causing a separation between metadata and content, thus possibly breaking reliability of gathered informations. From this perspective, a parser stopping on the very first error might give a quicker feedback than one rearranging misnested elements as far as it is reasonably possible (not affecting, and instead improving, content presentation and users' "direct" experience, but possibly causing side-effects with metadata).

Also, if the above is true, using namespaced and prefixed attributes instead of ones laying in the same namespace shared both by html5 and by xhtml5 (in theory) might prevent the use of such metadata in a document whose parsing rules might lead to possible side-effects.

Such results have been used to suggest that poorly implemented features should be dropped, but this hypothetical case suggests to me that the argument is wrong, and that if in the face of reasons why the data would be bad people use them, one might expect better usage by formalising the status of such features and getting decent implementations.


Generally speaking, I think reasoning in terms of "poor implementation" vs "rare usage" is quite like moving as a dog biting his own tail, because poorly implemented features are forcedly rarely used, and rarely used features can't convince UAs developers to implement them (in general). But, if a feature is widely needed, several hacks may born, thus providing an evidence of a global problem to be solved in a certain manner by implementing a certain, well-conceived feature.

As far as I've understood it, that's the main guideline to change actual specification, which is moving on the base of a bullet-tracing evolution (perhaps weighted on the need for completely new features, as a balance between the need for innovation and that for backward compatibility), rather than a "cathedral-wise" definition of what is or can be a useful feature to be implemented. For this reason, I think that mapping RDFa attributes on data-rdfa-* attributes to experiment a convergence between RDFa attributes and html5 specific features might be a start point to get RDFa attributes both specified and widely supported by implementations (either as they're defined in W3C Recommendation, or in the form of data-rdfa-*, hence dealt with differently from data-* attributes, for backward compatibility with such early implementations - a slightly different (or somehow prefixed) name shouldn't be much of a problem, as far as the name is not a problem per se (e.g. it is not prone to clashes) and allows a one-to-one correspondence).

However, if a custom/small scale solution met a wide support and a deep integration into major browsers, maybe misuses and abuses (which a proper formalisation couldn't prevent) might become widespread, thus making disadvantages (appear or be) greater than advantages, if measured on a wider scale (the same as the implementation). Therefore, I think a good start point can consist of partly introducing support on top of existing features (in the case of RDFa, either through well-groomed, custom data-* attributes in html5, or by defining a proper namespace with a proper prefix for xhtml5), without requiring a deep integration of a processor for the new feature, but instead letting it be a (custom) plugin/extension, or an api for a (custom) web application needing it - since a person just wishing to get access to some content without caring of metadata and metadata reliability could just visit a page, while an organisation wishing to interchange RDFa modelled data with another one can run a separate processor (eventually a webapp based on a browser built-in API, or a plugin, to create a suitable interface for queries) to extract and merge informations.

What is the cost of having different data use specialised formats?

If the data model, or a part of it, is not explicit as in RDF but is implicit in code made to treat it (as is the case with using scripts to process things stored in arbitrarily named data-* attributes, and is also the case in using undocumented or semi-documented XML formats, it requires people to understand the code as well as the data model in order to use the data. In a corporate situation where hundreds or tens of thousands of people are required to work with the same data, this makes the data model very fragile.


I'm not sure RDF(a) solves such a problem. AIUI, RDFa just binds (xml) properties and attributes (in the form of curies) to RDF concepts, modelling a certain kind of relationships, whereas it relies on external schemata to define such properties. Any undocumented or semi-documented XML formats may lead to misuses and, thus, to unreliably modelled data,
...

I think the same applies to data-* attributes, because _they_ describe data (and data semantics) in a custom model and thus _they_ need to be documented for others to be able to manipulate them; the use of a custom script rather than a built-in parser does not change much from this point of view.

RDFa binds data to RDF. RDF provides a well-known schema language with machine-processable definition of vocabularies, and how to merge information between them. In other words, if you get the underlying model for your data right enough, people will be able to use it without needing to know what you do.

Naturally not everyone will get their data model right, and naturally not all information will be reliable anyway. However, it would seem to me that making it harder to merge the data in the first place does not assist in determining whether it is useful. On the other hand, certain forms of RDF data such as POWDER, FOAF, Dublin Core and the like have been very carefully modelled, and are relatively well-known and re-used in other data models. Making it easy to parse this data and merge it, according to the existing well-developed models seems valuable.


I admit I'm not very expert in RDF use, thus I have a few questions. Specifically, maybe I can guess the advantages when using the same (carefully modelled, and well-known) vocabulary/ies; but when two organizations develop their own vocabularies, similar yet different, to model the same kind of informations, is merging of data enough? Can a processor give more than a collection of triples, to be then interpreted basing on knowledge on the used vocabulary/ies?

I mean, I assume my tools can extract RDF(a) data from whatever document, but my query interface is based on my own vocabulary: when I merge informations from an external vocabulary, do I need to translate one vocabulary to the other (or at least to modify the query backend, so that certain curies are recognized as representing the same concepts - e.g. to tell my software that 'foaf:name' and 'ex:someone' are equivalent, for my purposes)? If so, merging data might be the minor part of the work I need to do, with respect to non-RDF(a) metadata (that is, I'd have tools to extract and merge data anyway, and once I translated external metadata to my format, I could use my own tools to merge data), specially if the same model is used both by mine and an external organization (therefore requiring an easier translation).

Thus, I'm thinking the most valuable benefit of using RDF/RDFa is the sureness that both parties are using the very same data model, despite the possible use of different vocabularies -- it seems to me that the concept of triples consisting of a subject, a predicate and an object is somehow similar to a many-to-many association in a database, whereas one might prefer a one-to-many approach - though, the former might be a natural choice to model data which are usually sparse, as in a document prose.


Ian wrote:
For search engines, I am not convinced. Google's experience is that
natural language processing of the actual information seen by the actual end user is far, far more reliable than any source of metadata.
Thus from Google's perspective, investing in RDFa seems like a poorer
investment than investing in natural language processing.

Indeed. But Google is something of an edge case, since they can afford to run a huge organisation with massive computer power and many engineers to address a problem where a "near-enough" solution brings themn the users who are in turn the product they sell to advertisers. There are many other use cases where a small group of people want a way to reliably search trusted data.


I think the point with general purpose search engines is another one: natural language processing, whereas being expensive, grants a far more accurate solution than RDFa and/or any other kind of metadata can bring to a problem requiring data must never need to be trusted (and, instead, a data processor must be able to determine data's level of trust without any external aid).

No, I don't think so. Google searches based on analysis of the open web are *not* generally more reliable than faceted searches over a reliable dataset, and in some instances are less reliable.

The point is that only a few people can afford to invest in being a general-purpose search engine, whereas many can afford to run a metadata-based search system over a chosen dataset, that responds to their needs (and doesn't require either publishing their data, or paying Google to index it).


My point is that possible assumptions over datasets reliability is the borderline between wide-scale data extraction/classification, which is the main problem solved by a general purpose search engine, and implies the best assumption by default is datasets reliability is uncertain, and (very) small-scale data modelling, were a direct and immediate evaluation over datasets reliability is possible and easy to do, so that a custom search engine could reliably be based on such metadata. I think no comparison is possible between the two scales, thus no generalization is possible when trying to guess whether metadata can do more good than harm, but instead each case should be analysed separately, and everyone should agree which one is the best context (eventually both) where RDFa should be used, to understand what's the best way to implement it and if it's worth to be introduced in html5 -- as far as I can tell, both of us agree that small-scale is the main context.

But perhaps some edge-side case should be considered to draw a better picture. For instance, one such case might be a browser availing of metadata to search a resource in its local history, or within a web page and related/linked pages (to a certain digree and level of depth), because its scale would be small with respect to the effective number of scanned resources, but wide with respect to the potential number of sources for those resources, that is, because a browser implementing a metadata extraction and merging engine and a query interface to look for gleaned informations would deal with a | limited number | of | etherogeneous sources | at a given time.

Once major browsers provided (and exposed by default) such a functionality, a growing number of users would (try and) use it, thus a growing number of sites would experiment metadata. At the beginning everything might work fine, since only honest sites would experiment honest metadata (such as wikis, for instance), but once the number of sites and users availing of metadata reached a threshold point spammers would start including spam metadata in their sites (with otherwise trustful content) and in other sites through advertisements. Such a scenario might lead to a bad balance between benefits and disadvantages for the average user, thus pushing (some) browser vendors to limit or even to wholly drop native support, and I guess this is not a wishable solution for the Semantic Web Industry.

That is, choosing a proper level of integration for RDF(a) support into a web browser might divide success from failure. I don't know what's the best possible level, but I guess the deepest may be the worst, thus starting from an external support through out plugins, or scripts to be embedded in a webbapp, and working on top of other feature might work fine and lead to a better, native support by all vendors, yet limited to an API for custom applications -- whereas any changes to html to include RDFa attributes would be fully meaningful if leading to a full support and exposed features to avail of metadata, which I don't think is much of a benefit for the great majority of (home) users.

Everything, IMHO

WBR, Alex



--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP 
autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
Meetic: il leader italiano ed europeo per trovare l'anima gemella online. 
Provalo ora
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8291&d=9-1

Reply via email to