Re: [whatwg] Trying to work out the problems solved by RDFa

Calogero Alex Baldacchino Thu, 08 Jan 2009 17:54:35 -0800

Charles McCathieNevile ha scritto:

On Sun, 04 Jan 2009 03:51:53 +1100, Calogero Alex Baldacchino<alex.baldacch...@email.it> wrote:
Charles McCathieNevile ha scritto:
... it shouldn't be too difficoult to create a custom parser,comforming to RDFa spec and availing of data-* attributes...
That is, since RDFa can be "emulated" somehow in HTML5 and testedwithout changing current specification, perhaps there isn't a strongneed for an early adoption of the former, and instead an "emulated"mergence might be tested first within current timeline.
In principle this is possible. But the data-* attributes are designedfor private usage, and introducing a public usage means creating arisk of clashes that pollute RDFa data gathered this way. In otherwords, this is indeed feasible, but one would expect it to show thatthe data generated was unreliable (unless privately nobody isinterested in basic terms like about).

This is why I was thinking about somewhat "data-rdfa-about","data-rdfa-property", "data-rdfa-content" and so on, so that, for thepurposes of an RDFa processor working on top of HTML5 UAs (perhaps in atest phase, if needed at all, of course), an element dataset would giveaccess to "rdfa-about", instead of just "about", that is using theprefix "rdfa-" as acting as a namespace prefix in xml (hence, as ifthere were "rdfa:about" instead of "data-rdfa-about" in the markup).

This way, the public exposure of RDFa attributes on top of the genericand normally-private dataset feature might be enough circumscribed toavoid clashes. That is, if RDFa shows its best benefits when used toaddress small-scale needs involving trusted/reliable (meta-)data, itshould be fair to assume all involved parties are aware that each one isusing RDFa, and aren't just running an RDFa processor in the hope togather enough informations.

From this point of view, it should be quite unlike to find people using"data-rdfa-about" to express different semantics in the same page(whereas data-property might cause ambiguity, for instance), as well asit is (or should be) quite unlike to find namespaces using the very sameprefix involved in the same xml document (that is, I think choosing aname including a namespace prefix for a data-* attribute (and also for aclass in a generic container as a div or a span, to tell it representsan external element) can replicate quite safely the xml extensibilityfor custom uses, to some extent, without requiring a wide support for itin text/html document - since it seems that xhtml extensibility is not amajor concern, at least not enough to be worth merging it into html).


Just an idea, though.

However, AIUI, actual xml serialization (xhtml5) allows the use ofnamespaces and prefixed attributes, thus couldn't a proper namespace beintroduced for RDFa attributes, so they can be used, if needed, inxhtml5 documents? I think such might be a valuable choice, because itseems to me RDFa attributes can be used to address such cases wheremetadata must stay as close as possible to correspondent data, but amistake in a piece of markup may trigger the adoption agency or fosterparenting algorithms, eventually causing a separation between metadataand content, thus possibly breaking reliability of gatheredinformations. From this perspective, a parser stopping on the very firsterror might give a quicker feedback than one rearranging misnestedelements as far as it is reasonably possible (not affecting, and insteadimproving, content presentation and users' "direct" experience, butpossibly causing side-effects with metadata).

Also, if the above is true, using namespaced and prefixed attributesinstead of ones laying in the same namespace shared both by html5 and byxhtml5 (in theory) might prevent the use of such metadata in a documentwhose parsing rules might lead to possible side-effects.

Such results have been used to suggest that poorly implementedfeatures should be dropped, but this hypothetical case suggests to methat the argument is wrong, and that if in the face of reasons why thedata would be bad people use them, one might expect better usage byformalising the status of such features and getting decentimplementations.

Generally speaking, I think reasoning in terms of "poor implementation"vs "rare usage" is quite like moving as a dog biting his own tail,because poorly implemented features are forcedly rarely used, and rarelyused features can't convince UAs developers to implement them (ingeneral). But, if a feature is widely needed, several hacks may born,thus providing an evidence of a global problem to be solved in a certainmanner by implementing a certain, well-conceived feature.

As far as I've understood it, that's the main guideline to change actualspecification, which is moving on the base of a bullet-tracing evolution(perhaps weighted on the need for completely new features, as a balancebetween the need for innovation and that for backward compatibility),rather than a "cathedral-wise" definition of what is or can be a usefulfeature to be implemented. For this reason, I think that mapping RDFaattributes on data-rdfa-* attributes to experiment a convergence betweenRDFa attributes and html5 specific features might be a start point toget RDFa attributes both specified and widely supported byimplementations (either as they're defined in W3C Recommendation, or inthe form of data-rdfa-*, hence dealt with differently from data-*attributes, for backward compatibility with such early implementations -a slightly different (or somehow prefixed) name shouldn't be much of aproblem, as far as the name is not a problem per se (e.g. it is notprone to clashes) and allows a one-to-one correspondence).

However, if a custom/small scale solution met a wide support and a deepintegration into major browsers, maybe misuses and abuses (which aproper formalisation couldn't prevent) might become widespread, thusmaking disadvantages (appear or be) greater than advantages, if measuredon a wider scale (the same as the implementation). Therefore, I think agood start point can consist of partly introducing support on top ofexisting features (in the case of RDFa, either through well-groomed,custom data-* attributes in html5, or by defining a proper namespacewith a proper prefix for xhtml5), without requiring a deep integrationof a processor for the new feature, but instead letting it be a (custom)plugin/extension, or an api for a (custom) web application needing it -since a person just wishing to get access to some content without caringof metadata and metadata reliability could just visit a page, while anorganisation wishing to interchange RDFa modelled data with another onecan run a separate processor (eventually a webapp based on a browserbuilt-in API, or a plugin, to create a suitable interface for queries)to extract and merge informations.

What is the cost of having different data use specialised formats?
If the data model, or a part of it, is not explicit as in RDF but isimplicit in code made to treat it (as is the case with using scriptsto process things stored in arbitrarily named data-* attributes, andis also the case in using undocumented or semi-documented XMLformats, it requires people to understand the code as well as thedata model in order to use the data. In a corporate situation wherehundreds or tens of thousands of people are required to work withthe same data, this makes the data model very fragile.
I'm not sure RDF(a) solves such a problem. AIUI, RDFa just binds(xml) properties and attributes (in the form of curies) to RDFconcepts, modelling a certain kind of relationships, whereas itrelies on external schemata to define such properties. Anyundocumented or semi-documented XML formats may lead to misuses and,thus, to unreliably modelled data,
...
I think the same applies to data-* attributes, because _they_describe data (and data semantics) in a custom model and thus _they_need to be documented for others to be able to manipulate them; theuse of a custom script rather than a built-in parser does not changemuch from this point of view.
RDFa binds data to RDF. RDF provides a well-known schema language withmachine-processable definition of vocabularies, and how to mergeinformation between them. In other words, if you get the underlyingmodel for your data right enough, people will be able to use itwithout needing to know what you do.
Naturally not everyone will get their data model right, and naturallynot all information will be reliable anyway. However, it would seem tome that making it harder to merge the data in the first place does notassist in determining whether it is useful. On the other hand, certainforms of RDF data such as POWDER, FOAF, Dublin Core and the like havebeen very carefully modelled, and are relatively well-known andre-used in other data models. Making it easy to parse this data andmerge it, according to the existing well-developed models seems valuable.

I admit I'm not very expert in RDF use, thus I have a few questions.Specifically, maybe I can guess the advantages when using the same(carefully modelled, and well-known) vocabulary/ies; but when twoorganizations develop their own vocabularies, similar yet different, tomodel the same kind of informations, is merging of data enough? Can aprocessor give more than a collection of triples, to be then interpretedbasing on knowledge on the used vocabulary/ies?

I mean, I assume my tools can extract RDF(a) data from whateverdocument, but my query interface is based on my own vocabulary: when Imerge informations from an external vocabulary, do I need to translateone vocabulary to the other (or at least to modify the query backend, sothat certain curies are recognized as representing the same concepts -e.g. to tell my software that 'foaf:name' and 'ex:someone' areequivalent, for my purposes)? If so, merging data might be the minorpart of the work I need to do, with respect to non-RDF(a) metadata (thatis, I'd have tools to extract and merge data anyway, and once Itranslated external metadata to my format, I could use my own tools tomerge data), specially if the same model is used both by mine and anexternal organization (therefore requiring an easier translation).

Thus, I'm thinking the most valuable benefit of using RDF/RDFa is thesureness that both parties are using the very same data model, despitethe possible use of different vocabularies -- it seems to me that theconcept of triples consisting of a subject, a predicate and an object issomehow similar to a many-to-many association in a database, whereas onemight prefer a one-to-many approach - though, the former might be anatural choice to model data which are usually sparse, as in a documentprose.

Ian wrote:
For search engines, I am not convinced. Google's experience is that
natural language processing of the actual information seen by theactual end user is far, far more reliable than any source of metadata.
Thus from Google's perspective, investing in RDFa seems like a poorer
investment than investing in natural language processing.
Indeed. But Google is something of an edge case, since they canafford to run a huge organisation with massive computer power andmany engineers to address a problem where a "near-enough" solutionbrings themn the users who are in turn the product they sell toadvertisers. There are many other use cases where a small group ofpeople want a way to reliably search trusted data.
I think the point with general purpose search engines is another one:natural language processing, whereas being expensive, grants a farmore accurate solution than RDFa and/or any other kind of metadatacan bring to a problem requiring data must never need to be trusted(and, instead, a data processor must be able to determine data'slevel of trust without any external aid).
No, I don't think so. Google searches based on analysis of the openweb are *not* generally more reliable than faceted searches over areliable dataset, and in some instances are less reliable.
The point is that only a few people can afford to invest in being ageneral-purpose search engine, whereas many can afford to run ametadata-based search system over a chosen dataset, that responds totheir needs (and doesn't require either publishing their data, orpaying Google to index it).

My point is that possible assumptions over datasets reliability is theborderline between wide-scale data extraction/classification, which isthe main problem solved by a general purpose search engine, and impliesthe best assumption by default is datasets reliability is uncertain, and(very) small-scale data modelling, were a direct and immediateevaluation over datasets reliability is possible and easy to do, so thata custom search engine could reliably be based on such metadata. I thinkno comparison is possible between the two scales, thus no generalizationis possible when trying to guess whether metadata can do more good thanharm, but instead each case should be analysed separately, and everyoneshould agree which one is the best context (eventually both) where RDFashould be used, to understand what's the best way to implement it and ifit's worth to be introduced in html5 -- as far as I can tell, both of usagree that small-scale is the main context.

But perhaps some edge-side case should be considered to draw a betterpicture. For instance, one such case might be a browser availing ofmetadata to search a resource in its local history, or within a web pageand related/linked pages (to a certain digree and level of depth),because its scale would be small with respect to the effective number ofscanned resources, but wide with respect to the potential number ofsources for those resources, that is, because a browser implementing ametadata extraction and merging engine and a query interface to look forgleaned informations would deal with a | limited number | of |etherogeneous sources | at a given time.

Once major browsers provided (and exposed by default) such afunctionality, a growing number of users would (try and) use it, thus agrowing number of sites would experiment metadata. At the beginningeverything might work fine, since only honest sites would experimenthonest metadata (such as wikis, for instance), but once the number ofsites and users availing of metadata reached a threshold point spammerswould start including spam metadata in their sites (with otherwisetrustful content) and in other sites through advertisements. Such ascenario might lead to a bad balance between benefits and disadvantagesfor the average user, thus pushing (some) browser vendors to limit oreven to wholly drop native support, and I guess this is not a wishablesolution for the Semantic Web Industry.

That is, choosing a proper level of integration for RDF(a) support intoa web browser might divide success from failure. I don't know what's thebest possible level, but I guess the deepest may be the worst, thusstarting from an external support through out plugins, or scripts to beembedded in a webbapp, and working on top of other feature might workfine and lead to a better, native support by all vendors, yet limited toan API for custom applications -- whereas any changes to html to includeRDFa attributes would be fully meaningful if leading to a full supportand exposed features to avail of metadata, which I don't think is muchof a benefit for the great majority of (home) users.


Everything, IMHO

WBR, Alex



--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP 
autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
Meetic: il leader italiano ed europeo per trovare l'anima gemella online. 
Provalo ora
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8291&d=9-1

Re: [whatwg] Trying to work out the problems solved by RDFa

Reply via email to