Summary:
I believe that there are use cases for RDFa - and that they are precisely
the sort of thing that Yahoo, Google, Ask, and their ilk are not going to
be interested in, since they are based on solving problems that those
search engines do not efficiently solve, such as (among others) using
private data or dealing with trustworthy data to answer very specific
questions automatically.
If Ian needs to understand the Semantic Web Industry and why people have
invested in the RDFa proposal, then it is important to identify the right
questions, and having him alone identify the sub-questions when he doesn't
understand the issue isn't going to help him make a well-informed decision.
Some of Ian's questions are discussed here. I cut the mail "short" since I
think it is already too long for many people, which means that the debate
will simply pass without their reading or input.
On Wed, 31 Dec 2008 20:46:01 +1100, Ian Hickson <i...@hixie.ch> wrote:
One of the outstanding issues for HTML5 is the question of whether HTML5
should solve the problem that RDFa solves, e.g. by embedding RDFa
...
Before I can determine whether we should solve this problem, and before I
can evaluate proposals for solving this problem, I need to learn what the
problem is.
Earlier this year, there was a thread on RDFa on the WHATWG list. Very
little of the thread focused on describing the problem. This e-mail is an
attempt to work out what the problem is based on that feedback, on
discussions at the recent TPAC, and on other research I have done.
On Mon, 25 Aug 2008, Manu Sporny wrote:
Ian Hickson wrote:
> I have no idea what problem RDFa is trying to solve. I have no idea
> what the requirements are.
Web browsers currently do not understand the meaning behind human
statements or concepts on a web page. If web browsers could understand
that a particular page was describing a piece of music, a movie, an
event, a person or a product, the browser could then help the user find
more information about the particular item in question. It would help
automate the browsing experience. Not only would the browsing experience
be improved, but search engine indexing quality would be better due to a
spider's ability to understand the data on the page with more accuracy.
Let's see if I can rephrase that in terms of requirements.
* Web browsers should be able to help users find information related to
the items that page they are looking at discusses.
* Search engines should be able to determine the contents of pages with
more accuracy than today.
Is that right?
Are those the only requirements/problems that RDFa is attempting to
address? If not, what other requirements are there?
I don't think so. I think there are some other requirements:
A standard way to include arbitrary data in a web page and extract it for
machine processing, without having to pre-coordinate their data models.
Since many people use RDF as an interchange, storage and processing format
for this kind of data (because it provides for automated mapping of data
from one schema to many others, without requiring anyone to touch the
original schemata or agree in advance how they should be created), I
believe there is a requirement for a method that allows third parties to
include RDF data in, and extract it from information encoded within an
HTML page.
The Microformats community has done a remarkable job of working on the
web semantics problem, creating several different methods of expressing
common human concepts (contact information (hCard), events (hCalendar),
and audio recordings (hAudio)).
Right; with Microformats, each Microformat has its own problem space and
thus each one can be evaluated separately. It is much harder to evaluate
something when the problem space is as generic as it appears RDFa's is.
The point is that there are a very large set of very small problem spaces
relevant to a small group at a time. Like RDF itself, RDFa is meeting the
problem of allowing these people to share machine-processable data without
previously coordinating their approach.
The results of the first set of Microformats efforts were some pretty
cool applications, like the following one demonstrating how a web
browser could forward event information from your PC web browser to your
phone via Bluetooth:
http://www.youtube.com/watch?v=azoNnLoJi-4
It's a technically very interesting application. What has the adoption
rate been like? How does it compare to other solutions to the problem,
like CalDav, iCal, or Microsoft Exchange? Do people publish calendar
events much? There are a lot of Web-based calendar systems, like MobileMe
or WebCalendar. Do people expose data on their Web page that can be used
to import calendar data to these systems?
In some cases this data is indeed exposed to Webpages. However, anecdotal
evidence (which unfortunately is all that is available when trying to
study the enormous collections of data in private intranets) suggests that
this is significantly more valuable when it can be done within a
restricted access website.
...
In short, RDFa addresses the problem of a lack of a standardized
semantics expression mechanism in HTML family languages.
A standardized semantics expression mechanism is a solution. The lack of
a solution isn't a problem description. What's the problem that a
standardized semantics expression mechanism solves?
There are many many small problems involving encoding arbitrary data in
pages - apparently at least enough to convince you that the data-*
attributes are worth incorporating.
There are many cases where being able to extract that data with a simple
toolkit from someone else's content, or using someone else's toolkit
without having to tell them about your data model, solves a local problem.
The data-* attributes, because they do not represent a formal model that
can be manipulated, are insufficient to enable sharing of tools which can
extract arbitrary modelled data.
RDF, in particular, also provides estabished ways of merging existing data
encoded in different existing schemata.
There are many cases where people build their own dataset and queries to
solve a local problem. As an example, Opera is not intersted in asking
Google to index data related to internal developer documents, and use it
to produce further documentation we need. However, we do automatically
extract various kinds of data from internal documents and re-use it. While
Opera does not in fact use the RDF toolstack for that process, there are
many other large companies and organisations who do, and who would benefit
from being able to use RDFa in that process.
RDFa not only enables the use cases described in the videos listed
above, but all use cases that struggle with enabling web browsers and
web spiders understand the context of the current page.
It would be helpful if we could list these use cases clearly and in
detail so that we could evaluate the solutions proposed against them.
Here's a list of the use cases and requirements so far in this e-mail:
* Web browsers should be able to help users find information related to
the items that page they are looking at discusses.
* Search engines should be able to determine the contents of pages with
more accuracy than today.
* Exposing calendar events so that users can add those events to their
calendaring systems.
* Exposing music samples on a page so that a user can listen to all the
samples.
* Getting data out of poorly written Web pages, so that the user can find
more information about the page's contents.
* Finding more information about a movie when looking at a page about the
movie, when the page contains detailed data about the movie.
Can we list some more use cases?
Here are some other questions that I would like the answers to so that I
can better understand what is being proposed here:
Does it make sense to solve all these problems with the same syntax?
That depends on the answers to your next two questions.
Moreover, that is not actually a very good question in this case. I think
the judgement call should be whether a syntax that allows people to solve
the identified problem set consistently is sufficiently valuable (measured
in terms of the advantages weighed against the disadvantages) to justify
being part of HTML5.
What are the disadvantanges of doing so?
I am not sure.
What are the advantages?
Many people will be able to use standard tools which are part of their
existing infrastructure to manipulate important data. They will be able to
store that data in a visible form, in web pages. They will also be able to
present the data easily in a form that does not force them to lose
important semantics.
People will be able to build toolkits that allow for processing of data
from webpages without knowing, a priori, the data model used for that
information.
What is the
opportunity cost of encouraging everyone to expose data in the same way?
I don't know. I don't see much of an opportunity cost.
What is the cost of having different data use specialised formats?
If the data model, or a part of it, is not explicit as in RDF but is
implicit in code made to treat it (as is the case with using scripts to
process things stored in arbitrarily named data-* attributes, and is also
the case in using undocumented or semi-documented XML formats, it requires
people to understand the code as well as the data model in order to use
the data. In a corporate situation where hundreds or tens of thousands of
people are required to work with the same data, this makes the data model
very fragile.
Such considerations also apply to larger communities, for example those
dealing with complex scientific information.
Do publishers actually want to use a common data format?
It would appear so - even in cases where they don't want to publish their
data in such an easy-to-use format for commercial reasons.
How have past efforts in creating data formats fared?
Some have been pretty successful. Dublin Core is a general format for
labelling content that is widely used. MARC records have been very
successful.
Are enough data providers actually willing to expose their data in a
machine readable manner for this to be truly useful?
To make this truly useful it doesn't need to be exposed to the public. It
would appear that organisations are prepared to make large investments in
RDF data whether they expose them or not (and some very large ones do
expose data), which suggests that this data is truly useful.
If data providers
will be willing to expose their data as RDFa, why are they not already
exposing their data in machine-readable form today?
- For example, why doesn't Amazon expose a CSV file of your usage
history, or an Atom feed of the comments for each product, or an
hProduct annotated form of their product data? (Or do they? And if so,
do we know if users use this data?)
Why would they need to?
- As another example, why doesn't Craigslist like their data being
reused in mashups? Would they be willing to allow their users to reuse
their data in these new and exciting ways, or would they go out of
their way to prevent the data from being accessible as soon as a
critical mass of users started using it?
This is a key question. Why *should* a data provider be required to offer
their product (data) for other people to use, in order to demonstrate that
the data is useful. Google, a large provider of data, insists on certain
conditions being met before it makes its services available, and that
seems perfectly reasonably to me.
Whether Craigslist actively attempts to make their data easier to
aggregate, or actively avoids facilitating that process, strikes me as
irrelevant to the question of whether there is value in enabling them to
do so. Because large organisations specialising in gathering people's
data, from Flickr to Google and Facebook to Government taxation
departments are not the only consumers and producers of data that
determine value for users.
It would seem important that the Web easily enable small-time users of
data to efficiently communicate with one another, without the need to have
one of the giants as an intermediary. When libraries in the Dominican
Republic want to share data, and librarians in Léon want to use that data,
it seems that the Web should facilitate that without resorting to
intermediaries like Amazon or Yahoo! and since we already have the
technology to do so in a way that enables very powerful data models to be
used without requiring coordination, it seems odd that you don't even
understand how this could be valuable.
What will the licensing situation be like for this data? Will the
licenses allow for the reuse being proposed to solve the problems and
use cases listed above?
In some cases yes, and in some cases no. In other words, making such data
available does not distort natural market conditions one way or another.
How are Web browsers going to expose user interfaces to answer user
questions?
I am glad to see that you think user interface behaviour is in fact
important to the process of specifying HTML (I had been under the
impression that you believed the spec should not touch on it). There are
various query systems already available in browsers, from the search
engine in Opera that lets you do a free-text search on pages stored in
your history to Tabulator - a substantial RDF browser available as a
Widget for Opera or as an extension to Firefox, that allows for a variety
of pre-configured questions as well as free-form questions.
Can only previously configured, hard-coded questions be asked,
or will Web browsers be able to answer arbitrary free-form questions from
users using the data exposed by RDFa?
Both of these are possible. The value of RDFa is that it actually supports
the possibility of asking free-form questions by using a data model that
is sufficiently well specified to enable constructions of tools that are
not dependent on being preconfigured to recognise the exact type of data
being queried (unlike, say, microformats, which require an intermediate
agreement to enable people to extract the data, and don't provide for
merging data of different types for rich queries).
How are Web browsers that expose this data going to handle data that is
not exposed in the same format? For example, if a site exposes data in
JSON or CSV format rather than RDFa, will that data be available to the
user in the same way?
Who cares? But for those who do, this is up to Web browsers. They can
choose to implement transformations between some particular CSV data and
RDFa. The difficulty here (and therefore illustration of the value of
RDFa) is that CSV data has important details of the meaning of the data
only available out of band in looking at how the data is recorded, while
RDF allows for automating the process of merging data originally encoded
in different RDFa vocabularies.
...
What is the expected strategy to fight spam in these systems? Is it
expected that user agents will just collect data in the background? If
so, how are user agents expected to distinguish between pages that have
reliable data and pages that expose data that is misleading or wrong?
Aggregating data in real-time is relatively expensive, so is a strategy
more suited to dealing with asking new questions. Typical systems so far
have aggregated data in the background to deal with known queries (one
example is Google, which crawls pages in advance, anticipating searches
that match terms against the content of those pages), and use live
querying for cases where the result cannot reliably be stored (e.g.
airline reservation systems like TravelJungle or LastMinute which
determine price and availability based on constantly changing data).
Different use cases will imply different strategies for fighting spam.
Some obvious ones are to rely on trusted sites and secured and signed
data, to use reputation managers, to follow the "shape" of data over time
so that anamolies can be highlighted and checked more carefully (in the
manner of Bayesian filters for email). Some use cases don't care much
about spam, or are not very interesting to spammers. Some use cases are
private data anyway.
- Systems like Yahoo! Search and Live Search expend extraordinary
amounts of resources on spam fighting technology; such technology
would not be accessible to Web browsers unless they interacted with
anti-spam services much like browsers today interact with
anti-phishing services.
Actually, at least Opera already incorporates anti-spam technology in its
mail client. Where browsers are the primary consumers of data there is
nothing at all to suggest that they cannot incorporate anti-spam
technology directly. (Indeed, the POWDER specification is designed in part
to make that easy - and it is exactly the sort of data that might
sometimes be usefully encoded in RDFa since it is based on an RDF model).
Yet anti-phishing services have been controversial, since they involve
exposing the user's browsing history to third parties; anti-spam
services would be a significantly greater problem due to the vastly
greater level of spamming compared to phishing. What is the solution
proposed to tackle this problem?
It is not clear that this problem is any different in the context of RDFa
to the general problem already faced by the Web. In general, the solutions
proposed are the same as those already used on the Web, and of course
those in development.
- Even with a mechanism to distinguish trusted sites from spammy sites,
how would Web browsers deal with trusted sites that have been subject
to spamming attacks? This is common, for instance, on blogs or wikis.
Right. But that doesn't mean we question whether browsers should enable
blogs or wikis. Why would RDFa data be different enough to make this
question relevant?
These are not rhetorical questions, and I don't know the answers to them.
Some of them seem to be poorly phrased, although if you don't understand
why people have been working on this technology and why they think it
would be valuable to have it available in HTML I guess that is almost
inevitable.
We need detailed answers to all those questions before we can really
evaluate the various proposals that have been made here.
No, we apparently need you to personally understand the Semantic Web
Industry. Determining answers to the questions which are important is
probably helpful, but also helpful is explaining when your questions are
irrelevant because they are based on a lack of understanding. This is not
intended as a slight, but to clarify the process required to have
something as large as the "Sematic Web" (capital letters, implying the
whole W3C activity, the industry based around RDF, and so on) evaluated
for potential inclusion in the HTML5 specification.
I presume the same would apply if the "Web Services" people came and asked
to have all of their things included in HTML, and offered a specification
that could be used to achieve their desires.
...
[not clear what the context was here, so citing as it was]
> I don't think more metadata is going to improve search engines. In
> practice, metadata is so highly gamed that it cannot be relied upon.
> In fact, search engines probably already "understand" pages with far
> more accuracy than most authors will ever be able to express.
You are correct, more erroneous metadata is not going to improve search
engines. More /accurate/ metadata, however, IS going to improve search
engines. Nobody is going to argue that the system could not be gamed. I
can guarantee that it will be gamed.
However, that's the reality that we have to live with when introducing
any new web-based technology. It will be mis-used, abused and corrupted.
The question is, will it do more good than harm? In the case of RDFa
/and/ Microformats, we do think it will do more good than harm.
For search engines, I am not convinced. Google's experience is that
natural language processing of the actual information seen by the actual
end user is far, far more reliable than any source of metadata. Thus from
Google's perspective, investing in RDFa seems like a poorer investment
than investing in natural language processing.
Indeed. But Google is something of an edge case, since they can afford to
run a huge organisation with massive computer power and many engineers to
address a problem where a "near-enough" solution brings themn the users
who are in turn the product they sell to advertisers. There are many other
use cases where a small group of people want a way to reliably search
trusted data.
From global virtual library systems to a single websites, there are many
others who find that processing structured data is more efficient for
their needs than doing free-text analysis of web pages (something that
they effectively contract out to Google, Ask, Yahoo! and their many
competitors who specialise in it). Some of these are the people whe have
decided that investing in RDFa is a far more valuable exercis than trying
to out-invest Google in natural language processing.
This email is already too long for most people to get through it :( I
believe that this discussion is going to last for some time (I cannot
imagine why, given the HTML timeline, it would need to be resolved before
June), so there will be time for others to discuss more fully the many
points Ian raises as ones he would like to understand.
cheers
Chaals
--
Charles McCathieNevile Opera Software, Standards Group
je parle français -- hablo español -- jeg lærer norsk
http://my.opera.com/chaals Try Opera: http://www.opera.com