Re: [whatwg] Trying to work out the problems solved by RDFa

Charles McCathieNevile Wed, 31 Dec 2008 20:41:40 -0800

Summary:

I believe that there are use cases for RDFa - and that they are preciselythe sort of thing that Yahoo, Google, Ask, and their ilk are not going tobe interested in, since they are based on solving problems that thosesearch engines do not efficiently solve, such as (among others) usingprivate data or dealing with trustworthy data to answer very specificquestions automatically.

If Ian needs to understand the Semantic Web Industry and why people haveinvested in the RDFa proposal, then it is important to identify the rightquestions, and having him alone identify the sub-questions when he doesn'tunderstand the issue isn't going to help him make a well-informed decision.

Some of Ian's questions are discussed here. I cut the mail "short" since Ithink it is already too long for many people, which means that the debatewill simply pass without their reading or input.


On Wed, 31 Dec 2008 20:46:01 +1100, Ian Hickson <i...@hixie.ch> wrote:

One of the outstanding issues for HTML5 is the question of whether HTML5
should solve the problem that RDFa solves, e.g. by embedding RDFa

...

Before I can determine whether we should solve this problem, and before I
can evaluate proposals for solving this problem, I need to learn what the
problem is.

Earlier this year, there was a thread on RDFa on the WHATWG list. Very
little of the thread focused on describing the problem. This e-mail is an
attempt to work out what the problem is based on that feedback, on
discussions at the recent TPAC, and on other research I have done.


On Mon, 25 Aug 2008, Manu Sporny wrote:

Ian Hickson wrote:
> I have no idea what problem RDFa is trying to solve. I have no idea
> what the requirements are.

Web browsers currently do not understand the meaning behind human
statements or concepts on a web page. If web browsers could understand
that a particular page was describing a piece of music, a movie, an
event, a person or a product, the browser could then help the user find
more information about the particular item in question. It would help
automate the browsing experience. Not only would the browsing experience
be improved, but search engine indexing quality would be better due to a
spider's ability to understand the data on the page with more accuracy.


Let's see if I can rephrase that in terms of requirements.

* Web browsers should be able to help users find information related to
  the items that page they are looking at discusses.

* Search engines should be able to determine the contents of pages with
  more accuracy than today.

Is that right?

Are those the only requirements/problems that RDFa is attempting to
address? If not, what other requirements are there?


I don't think so. I think there are some other requirements:

A standard way to include arbitrary data in a web page and extract it formachine processing, without having to pre-coordinate their data models.

Since many people use RDF as an interchange, storage and processing formatfor this kind of data (because it provides for automated mapping of datafrom one schema to many others, without requiring anyone to touch theoriginal schemata or agree in advance how they should be created), Ibelieve there is a requirement for a method that allows third parties toinclude RDF data in, and extract it from information encoded within anHTML page.

The Microformats community has done a remarkable job of working on the
web semantics problem, creating several different methods of expressing
common human concepts (contact information (hCard), events (hCalendar),
and audio recordings (hAudio)).


Right; with Microformats, each Microformat has its own problem space and
thus each one can be evaluated separately. It is much harder to evaluate
something when the problem space is as generic as it appears RDFa's is.

The point is that there are a very large set of very small problem spacesrelevant to a small group at a time. Like RDF itself, RDFa is meeting theproblem of allowing these people to share machine-processable data withoutpreviously coordinating their approach.

The results of the first set of Microformats efforts were some pretty
cool applications, like the following one demonstrating how a web
browser could forward event information from your PC web browser to your
phone via Bluetooth:

http://www.youtube.com/watch?v=azoNnLoJi-4


It's a technically very interesting application. What has the adoption
rate been like? How does it compare to other solutions to the problem,
like CalDav, iCal, or Microsoft Exchange? Do people publish calendar
events much? There are a lot of Web-based calendar systems, like MobileMe
or WebCalendar. Do people expose data on their Web page that can be used
to import calendar data to these systems?

In some cases this data is indeed exposed to Webpages. However, anecdotalevidence (which unfortunately is all that is available when trying tostudy the enormous collections of data in private intranets) suggests thatthis is significantly more valuable when it can be done within arestricted access website.

...

In short, RDFa addresses the problem of a lack of a standardized
semantics expression mechanism in HTML family languages.
A standardized semantics expression mechanism is a solution. The lack ofa solution isn't a problem description. What's the problem that a
standardized semantics expression mechanism solves?

There are many many small problems involving encoding arbitrary data inpages - apparently at least enough to convince you that the data-*attributes are worth incorporating.

There are many cases where being able to extract that data with a simpletoolkit from someone else's content, or using someone else's toolkitwithout having to tell them about your data model, solves a local problem.The data-* attributes, because they do not represent a formal model thatcan be manipulated, are insufficient to enable sharing of tools which canextract arbitrary modelled data.

RDF, in particular, also provides estabished ways of merging existing dataencoded in different existing schemata.

There are many cases where people build their own dataset and queries tosolve a local problem. As an example, Opera is not intersted in askingGoogle to index data related to internal developer documents, and use itto produce further documentation we need. However, we do automaticallyextract various kinds of data from internal documents and re-use it. WhileOpera does not in fact use the RDF toolstack for that process, there aremany other large companies and organisations who do, and who would benefitfrom being able to use RDFa in that process.

RDFa not only enables the use cases described in the videos listed
above, but all use cases that struggle with enabling web browsers and
web spiders understand the context of the current page.

It would be helpful if we could list these use cases clearly and indetail so that we could evaluate the solutions proposed against them.


Here's a list of the use cases and requirements so far in this e-mail:

* Web browsers should be able to help users find information related to
  the items that page they are looking at discusses.

* Search engines should be able to determine the contents of pages with
  more accuracy than today.

* Exposing calendar events so that users can add those events to their
  calendaring systems.

* Exposing music samples on a page so that a user can listen to all the
  samples.

* Getting data out of poorly written Web pages, so that the user can find
  more information about the page's contents.

* Finding more information about a movie when looking at a page about the
  movie, when the page contains detailed data about the movie.

Can we list some more use cases?


Here are some other questions that I would like the answers to so that I
can better understand what is being proposed here:

Does it make sense to solve all these problems with the same syntax?


That depends on the answers to your next two questions.

Moreover, that is not actually a very good question in this case. I thinkthe judgement call should be whether a syntax that allows people to solvethe identified problem set consistently is sufficiently valuable (measuredin terms of the advantages weighed against the disadvantages) to justifybeing part of HTML5.

What are the disadvantanges of doing so?


I am not sure.

What are the advantages?

Many people will be able to use standard tools which are part of theirexisting infrastructure to manipulate important data. They will be able tostore that data in a visible form, in web pages. They will also be able topresent the data easily in a form that does not force them to loseimportant semantics.

People will be able to build toolkits that allow for processing of datafrom webpages without knowing, a priori, the data model used for thatinformation.

What is the
opportunity cost of encouraging everyone to expose data in the same way?


I don't know. I don't see much of an opportunity cost.

What is the cost of having different data use specialised formats?

If the data model, or a part of it, is not explicit as in RDF but isimplicit in code made to treat it (as is the case with using scripts toprocess things stored in arbitrarily named data-* attributes, and is alsothe case in using undocumented or semi-documented XML formats, it requirespeople to understand the code as well as the data model in order to usethe data. In a corporate situation where hundreds or tens of thousands ofpeople are required to work with the same data, this makes the data modelvery fragile.

Such considerations also apply to larger communities, for example thosedealing with complex scientific information.

Do publishers actually want to use a common data format?

It would appear so - even in cases where they don't want to publish theirdata in such an easy-to-use format for commercial reasons.

How have past efforts in creating data formats fared?

Some have been pretty successful. Dublin Core is a general format forlabelling content that is widely used. MARC records have been verysuccessful.

Are enough data providers actually willing to expose their data in a
machine readable manner for this to be truly useful?

To make this truly useful it doesn't need to be exposed to the public. Itwould appear that organisations are prepared to make large investments inRDF data whether they expose them or not (and some very large ones doexpose data), which suggests that this data is truly useful.

If data providers
will be willing to expose their data as RDFa, why are they not already
exposing their data in machine-readable form today?

 - For example, why doesn't Amazon expose a CSV file of your usage
   history, or an Atom feed of the comments for each product, or an
   hProduct annotated form of their product data? (Or do they? And if so,
   do we know if users use this data?)


Why would they need to?

- As another example, why doesn't Craigslist like their data beingreused in mashups? Would they be willing to allow their users to reuse
   their data in these new and exciting ways, or would they go out of
   their way to prevent the data from being accessible as soon as a
   critical mass of users started using it?

This is a key question. Why *should* a data provider be required to offertheir product (data) for other people to use, in order to demonstrate thatthe data is useful. Google, a large provider of data, insists on certainconditions being met before it makes its services available, and thatseems perfectly reasonably to me.

Whether Craigslist actively attempts to make their data easier toaggregate, or actively avoids facilitating that process, strikes me asirrelevant to the question of whether there is value in enabling them todo so. Because large organisations specialising in gathering people'sdata, from Flickr to Google and Facebook to Government taxationdepartments are not the only consumers and producers of data thatdetermine value for users.

It would seem important that the Web easily enable small-time users ofdata to efficiently communicate with one another, without the need to haveone of the giants as an intermediary. When libraries in the DominicanRepublic want to share data, and librarians in Léon want to use that data,it seems that the Web should facilitate that without resorting tointermediaries like Amazon or Yahoo! and since we already have thetechnology to do so in a way that enables very powerful data models to beused without requiring coordination, it seems odd that you don't evenunderstand how this could be valuable.

What will the licensing situation be like for this data? Will thelicenses allow for the reuse being proposed to solve the problems and
use cases listed above?

In some cases yes, and in some cases no. In other words, making such dataavailable does not distort natural market conditions one way or another.

How are Web browsers going to expose user interfaces to answer user
questions?

I am glad to see that you think user interface behaviour is in factimportant to the process of specifying HTML (I had been under theimpression that you believed the spec should not touch on it). There arevarious query systems already available in browsers, from the searchengine in Opera that lets you do a free-text search on pages stored inyour history to Tabulator - a substantial RDF browser available as aWidget for Opera or as an extension to Firefox, that allows for a varietyof pre-configured questions as well as free-form questions.

Can only previously configured, hard-coded questions be asked,
or will Web browsers be able to answer arbitrary free-form questions from
users using the data exposed by RDFa?

Both of these are possible. The value of RDFa is that it actually supportsthe possibility of asking free-form questions by using a data model thatis sufficiently well specified to enable constructions of tools that arenot dependent on being preconfigured to recognise the exact type of databeing queried (unlike, say, microformats, which require an intermediateagreement to enable people to extract the data, and don't provide formerging data of different types for rich queries).

How are Web browsers that expose this data going to handle data that is
not exposed in the same format? For example, if a site exposes data in
JSON or CSV format rather than RDFa, will that data be available to the
user in the same way?

Who cares? But for those who do, this is up to Web browsers. They canchoose to implement transformations between some particular CSV data andRDFa. The difficulty here (and therefore illustration of the value ofRDFa) is that CSV data has important details of the meaning of the dataonly available out of band in looking at how the data is recorded, whileRDF allows for automating the process of merging data originally encodedin different RDFa vocabularies.

...

What is the expected strategy to fight spam in these systems? Is it
expected that user agents will just collect data in the background? Ifso, how are user agents expected to distinguish between pages that have
reliable data and pages that expose data that is misleading or wrong?

Aggregating data in real-time is relatively expensive, so is a strategymore suited to dealing with asking new questions. Typical systems so farhave aggregated data in the background to deal with known queries (oneexample is Google, which crawls pages in advance, anticipating searchesthat match terms against the content of those pages), and use livequerying for cases where the result cannot reliably be stored (e.g.airline reservation systems like TravelJungle or LastMinute whichdetermine price and availability based on constantly changing data).

Different use cases will imply different strategies for fighting spam.Some obvious ones are to rely on trusted sites and secured and signeddata, to use reputation managers, to follow the "shape" of data over timeso that anamolies can be highlighted and checked more carefully (in themanner of Bayesian filters for email). Some use cases don't care muchabout spam, or are not very interesting to spammers. Some use cases areprivate data anyway.

- Systems like Yahoo! Search and Live Search expend extraordinaryamounts of resources on spam fighting technology; such technology
   would not be accessible to Web browsers unless they interacted with
   anti-spam services much like browsers today interact with
   anti-phishing services.

Actually, at least Opera already incorporates anti-spam technology in itsmail client. Where browsers are the primary consumers of data there isnothing at all to suggest that they cannot incorporate anti-spamtechnology directly. (Indeed, the POWDER specification is designed in partto make that easy - and it is exactly the sort of data that mightsometimes be usefully encoded in RDFa since it is based on an RDF model).

   Yet anti-phishing services have been controversial, since they involve
   exposing the user's browsing history to third parties; anti-spam
   services would be a significantly greater problem due to the vastly
   greater level of spamming compared to phishing. What is the solution
   proposed to tackle this problem?

It is not clear that this problem is any different in the context of RDFato the general problem already faced by the Web. In general, the solutionsproposed are the same as those already used on the Web, and of coursethose in development.

 - Even with a mechanism to distinguish trusted sites from spammy sites,
   how would Web browsers deal with trusted sites that have been subject
   to spamming attacks? This is common, for instance, on blogs or wikis.

Right. But that doesn't mean we question whether browsers should enableblogs or wikis. Why would RDFa data be different enough to make thisquestion relevant?

These are not rhetorical questions, and I don't know the answers to them.

Some of them seem to be poorly phrased, although if you don't understandwhy people have been working on this technology and why they think itwould be valuable to have it available in HTML I guess that is almostinevitable.

We need detailed answers to all those questions before we can really
evaluate the various proposals that have been made here.

No, we apparently need you to personally understand the Semantic WebIndustry. Determining answers to the questions which are important isprobably helpful, but also helpful is explaining when your questions areirrelevant because they are based on a lack of understanding. This is notintended as a slight, but to clarify the process required to havesomething as large as the "Sematic Web" (capital letters, implying thewhole W3C activity, the industry based around RDF, and so on) evaluatedfor potential inclusion in the HTML5 specification.

I presume the same would apply if the "Web Services" people came and askedto have all of their things included in HTML, and offered a specificationthat could be used to achieve their desires.

...

[not clear what the context was here, so citing as it was]

> I don't think more metadata is going to improve search engines. In
> practice, metadata is so highly gamed that it cannot be relied upon.
> In fact, search engines probably already "understand" pages with far
> more accuracy than most authors will ever be able to express.

You are correct, more erroneous metadata is not going to improve search
engines. More /accurate/ metadata, however, IS going to improve search
engines. Nobody is going to argue that the system could not be gamed. I
can guarantee that it will be gamed.

However, that's the reality that we have to live with when introducing
any new web-based technology. It will be mis-used, abused and corrupted.
The question is, will it do more good than harm? In the case of RDFa
/and/ Microformats, we do think it will do more good than harm.


For search engines, I am not convinced. Google's experience is that
natural language processing of the actual information seen by the actual
end user is far, far more reliable than any source of metadata. Thus from
Google's perspective, investing in RDFa seems like a poorer investment
than investing in natural language processing.

Indeed. But Google is something of an edge case, since they can afford torun a huge organisation with massive computer power and many engineers toaddress a problem where a "near-enough" solution brings themn the userswho are in turn the product they sell to advertisers. There are many otheruse cases where a small group of people want a way to reliably searchtrusted data.

From global virtual library systems to a single websites, there are manyothers who find that processing structured data is more efficient fortheir needs than doing free-text analysis of web pages (something thatthey effectively contract out to Google, Ask, Yahoo! and their manycompetitors who specialise in it). Some of these are the people whe havedecided that investing in RDFa is a far more valuable exercis than tryingto out-invest Google in natural language processing.

This email is already too long for most people to get through it :( Ibelieve that this discussion is going to last for some time (I cannotimagine why, given the HTML timeline, it would need to be resolved beforeJune), so there will be time for others to discuss more fully the manypoints Ian raises as ones he would like to understand.


cheers

Chaals

--
Charles McCathieNevile  Opera Software, Standards Group
    je parle français -- hablo español -- jeg lærer norsk
http://my.opera.com/chaals       Try Opera: http://www.opera.com

Re: [whatwg] Trying to work out the problems solved by RDFa

Reply via email to