On Wed, Dec 31, 2008 at 10:41 PM, Charles McCathieNevile <cha...@opera.com> wrote: > A standard way to include arbitrary data in a web page and extract it for > machine processing, without having to pre-coordinate their data models.
This isn't a requirement (or in other words, a problem), it's a solution. What are the problems that need to be solved, and for which having a standard way to include arbitrary data in a web page and have it easily extractable would be helpful? (Note: I think there certainly *are* problems that *would* find this helpful, I'm just trying to lead your argument into the right direction.) (As well, since the discussion is about RDFa specifically, not data-markup in general, what are the problems that need RDFa *specifically* as a solution, as compared to the myriad other ways to embed data?) > Since many people use RDF as an interchange, storage and processing format > for this kind of data (because it provides for automated mapping of data > from one schema to many others, without requiring anyone to touch the > original schemata or agree in advance how they should be created), I believe > there is a requirement for a method that allows third parties to include RDF > data in, and extract it from information encoded within an HTML page. Solutions for this already exist; embedded N3 in a <script> tag, just to name something that Ian already mentioned, allows you to mash RDF data into a page in a machine-extractable way, and brings in any of the specific ancillary benefits of RDF. >>> The Microformats community has done a remarkable job of working on the >>> web semantics problem, creating several different methods of expressing >>> common human concepts (contact information (hCard), events (hCalendar), >>> and audio recordings (hAudio)). >> >> Right; with Microformats, each Microformat has its own problem space and >> thus each one can be evaluated separately. It is much harder to evaluate >> something when the problem space is as generic as it appears RDFa's is. > > The point is that there are a very large set of very small problem spaces > relevant to a small group at a time. Like RDF itself, RDFa is meeting the > problem of allowing these people to share machine-processable data without > previously coordinating their approach. Not quite correct. Again, the problem of embedded shareable data in a web page has been solved multiple times. The specific problem of sharing *RDF* data (due to needing/wanting the specific benefits RDF can offer) has also been solved. What are the precise problems that require *RDFa* as a solution? (I won't belabor this point, though it could be brought up several times more in your email. This is and was the primary point of contention between RDFa supporters and those of us who aren't convinced it belongs in the HTML5 spec. It is the major thrust of much of Ian's email; he's trying to help you (RDFa supporters in general, that is) find exactly what the problem is that RDFa specifically is trying to solve.) > Moreover, that is not actually a very good question in this case. I think > the judgement call should be whether a syntax that allows people to solve > the identified problem set consistently is sufficiently valuable (measured > in terms of the advantages weighed against the disadvantages) to justify > being part of HTML5. Well, there are many things that would offer more advantages than disadvantages by themselves. We can't possibly include all of them in the spec; you can think about this as including a hidden large disadvantage of 'will grow the size of the spec and the amount of work implementors have to do'. Thus the advantages must generally be significantly larger than the disadvantages; this is why the best argument for including something in the spec is often "there are already widespread hacks to accomplish this". <video>, for example, was included based on pretty much precisely that argument. Of course, that just means that we've identified a problem that is significant enough to be solved in the spec. There is still significant work involved in ensuring that we identify a solution that actually hits the problem squarely; the existing hacks are usually inadequate, not through any true fault of their own, but merely because they had not considered the problem broadly enough, or lacked enough eyes to find rough edges and missing spots. >> What are the advantages? > > Many people will be able to use standard tools which are part of their > existing infrastructure to manipulate important data. They will be able to > store that data in a visible form, in web pages. They will also be able to > present the data easily in a form that does not force them to lose important > semantics. > > People will be able to build toolkits that allow for processing of data from > webpages without knowing, a priori, the data model used for that > information. Part of the point of Ian's email is that this is not a problem that is solved by RDFa, it's a problem that's solved by *any* sufficient data format. Many solutions currently exist which don't require any addition to the spec. >> What is the >> opportunity cost of encouraging everyone to expose data in the same way? > > I don't know. I don't see much of an opportunity cost. There is no perfect data model, or perfect representation method. Every group of data is different, has different ideal representations, and incurs some degree of cost when forced into an existing data model (that is, one not tailored to the data's specs). This must thus be considered. >> - As another example, why doesn't Craigslist like their data being >> reused in mashups? Would they be willing to allow their users to reuse >> their data in these new and exciting ways, or would they go out of >> their way to prevent the data from being accessible as soon as a >> critical mass of users started using it? > > This is a key question. Why *should* a data provider be required to offer > their product (data) for other people to use, in order to demonstrate that > the data is useful. Google, a large provider of data, insists on certain > conditions being met before it makes its services available, and that seems > perfectly reasonably to me. > > Whether Craigslist actively attempts to make their data easier to aggregate, > or actively avoids facilitating that process, strikes me as irrelevant to > the question of whether there is value in enabling them to do so. Because > large organisations specialising in gathering people's data, from Flickr to > Google and Facebook to Government taxation departments are not the only > consumers and producers of data that determine value for users. > > It would seem important that the Web easily enable small-time users of data > to efficiently communicate with one another, without the need to have one of > the giants as an intermediary. When libraries in the Dominican Republic want > to share data, and librarians in Léon want to use that data, it seems that > the Web should facilitate that without resorting to intermediaries like > Amazon or Yahoo! and since we already have the technology to do so in a way > that enables very powerful data models to be used without requiring > coordination, it seems odd that you don't even understand how this could be > valuable. This is precisely a key question because of many of the arguments that RDFa supporters have brought up (specifically, in the last flurry of emails to the group on this subject), that having RDFa will allow web users to query their browsers, which can then seek out structured data to answer their questions. If large websites are not willing to provide their data to the web-at-large in a structured format, though, then all the data formats in the world won't accomplish the goal. In this email, though, you are largely arguing for smaller, more personal use cases. Most of the questions are still valid, however. Problem: Librarians across the world want to share data. What are the requirements here? How is RDFa meet those requirements? Are there other solutions which meet those requirements better? Are existing solutions adequate if deployed consistently (thus negating the need for a new technology)? Specifically, small-time users seem (to me, at least) to need RDFa as a solution the least. They can negotiate a shared data format themselves, or at least present an API that can be engineered against by others. RDF itself may be a useful tool here, if it allows reuse of existing tools and thus simplifies the process of sharing and consuming the data, but RDFa specifically is a solution for embedding this data within a web page and allowing browsers to digest it as they encounter it. This is not an appropriate solution for the sharing of catalog data between libraries; it *may* be a solution for the average web user to have their browser grab the embedded information on a page for a specific book and query for reviews on the product across the web. This, though, then once again brings up the traditional questions. Is RDFa the best solution for this? Are there existing solutions to this? Ian specifically mentioned simply Googling for the book title; this is indeed often quite adequate for a web user. Does the use of RDFa and the active involvement of the browser in this process offer enough of a benefit above just typing a phrase into the search bar to justify inclusion into the spec? If you believe so, can you explain precisely why? >> Can only previously configured, hard-coded questions be asked, >> or will Web browsers be able to answer arbitrary free-form questions from >> users using the data exposed by RDFa? > > Both of these are possible. The value of RDFa is that it actually supports > the possibility of asking free-form questions by using a data model that is > sufficiently well specified to enable constructions of tools that are not > dependent on being preconfigured to recognise the exact type of data being > queried (unlike, say, microformats, which require an intermediate agreement > to enable people to extract the data, and don't provide for merging data of > different types for rich queries). This is not a benefit of RDFa. It *may* be a benefit of RDF. What does RDFa bring to the table that other solutions do not? What does it take away? > Aggregating data in real-time is relatively expensive, so is a strategy more > suited to dealing with asking new questions. Typical systems so far have > aggregated data in the background to deal with known queries (one example is > Google, which crawls pages in advance, anticipating searches that match > terms against the content of those pages), Google is a large company, and can indeed invest resources into trawling and recording such data. This is explicitly not an option for the smaller uses you seem to be highlighting in this email, though. RDFa is specifically a (very) distributed data storage system. Can it address these sorts of problems, if the small-time users simply can't trawl the entire web for matching information? When the info is relatively contained (such that finding and reading the pages it exists on is feasible), is trawling the pages for RDFa data the best solution? Are there other solutions which would work better (such as providing an API for hitting a database)? Are there existing solutions which work adequately? > and use live querying for cases > where the result cannot reliably be stored (e.g. airline reservation systems > like TravelJungle or LastMinute which determine price and availability based > on constantly changing data). Similarly, would these sites work by trawling reservation sites for RDFa data? As well, what if the reservation sites aren't interested in providing the data in a machine-readable format (for example, if they want users to go directly to their sites)? Would it be better for these types of sites to hit an API provided by the reservation sites directly? Would it be better for the discount sites to trawl with custom algorithms that don't require the cooperation of the reservation sites? Within the space of page-embedded data, are there better solutions, or existing adequate solutions? >> - Systems like Yahoo! Search and Live Search expend extraordinary >> amounts of resources on spam fighting technology; such technology >> would not be accessible to Web browsers unless they interacted with >> anti-spam services much like browsers today interact with >> anti-phishing services. > > Actually, at least Opera already incorporates anti-spam technology in its > mail client. Where browsers are the primary consumers of data there is > nothing at all to suggest that they cannot incorporate anti-spam technology > directly. (Indeed, the POWDER specification is designed in part to make that > easy - and it is exactly the sort of data that might sometimes be usefully > encoded in RDFa since it is based on an RDF model). Fighting email spam is a different problem from fighting black-hat SEO spamming. The attack surfaces presented by RDFa are much closer to the latter than the former. >> - Even with a mechanism to distinguish trusted sites from spammy sites, >> how would Web browsers deal with trusted sites that have been subject >> to spamming attacks? This is common, for instance, on blogs or wikis. > > Right. But that doesn't mean we question whether browsers should enable > blogs or wikis. Why would RDFa data be different enough to make this > question relevant? Users are interacting with blogs/wikis on a human level, and thus can exercise their own (admittedly poor in practice) judgement. This is a different problem from the browser automatically parsing data on a page and removing the spam. > I presume the same would apply if the "Web Services" people came and asked > to have all of their things included in HTML, and offered a specification > that could be used to achieve their desires. It would be the case that they would be subject to the same questions as the RDFa spec is, yes. > ... > > [not clear what the context was here, so citing as it was] >>> >>> > I don't think more metadata is going to improve search engines. In >>> > practice, metadata is so highly gamed that it cannot be relied upon. >>> > In fact, search engines probably already "understand" pages with far >>> > more accuracy than most authors will ever be able to express. >>> >>> You are correct, more erroneous metadata is not going to improve search >>> engines. More /accurate/ metadata, however, IS going to improve search >>> engines. Nobody is going to argue that the system could not be gamed. I >>> can guarantee that it will be gamed. >>> >>> However, that's the reality that we have to live with when introducing >>> any new web-based technology. It will be mis-used, abused and corrupted. >>> The question is, will it do more good than harm? In the case of RDFa >>> /and/ Microformats, we do think it will do more good than harm. >> >> For search engines, I am not convinced. Google's experience is that >> natural language processing of the actual information seen by the actual >> end user is far, far more reliable than any source of metadata. Thus from >> Google's perspective, investing in RDFa seems like a poorer investment >> than investing in natural language processing. > > Indeed. But Google is something of an edge case, since they can afford to > run a huge organisation with massive computer power and many engineers to > address a problem where a "near-enough" solution brings themn the users who > are in turn the product they sell to advertisers. There are many other use > cases where a small group of people want a way to reliably search trusted > data. > > From global virtual library systems to a single websites, there are many > others who find that processing structured data is more efficient for their > needs than doing free-text analysis of web pages (something that they > effectively contract out to Google, Ask, Yahoo! and their many competitors > who specialise in it). Some of these are the people whe have decided that > investing in RDFa is a far more valuable exercis than trying to out-invest > Google in natural language processing. "Processing structured data" is something that can be done without RDFa. The reason for the resistance to RDFa from this working group so far is the lack of sufficient significant problems that are best solved by RDFa specifically. As well, the use cases for in-the-small data interchange and in-the-large data interchange are significantly different. Again, RDFa is a very distributed data storage format; you don't see the entire 'database' until you've trawled all the pages which include it. This is why there is such a focus on whether RDFa is a decent solution for search engines - they *see* the web better than anyone else, and thus appear to be able to utilize such a distributed data format most effectively than anyone else. However, Ian is pointing out that those same search engines (at least Google, though I expect Yahoo, etc. feel the same) believe that natural-language processing is a far more effective method of gathering information. It is less prone to gaming (natural language being naturally unstructured, it's harder to emit spam data that has the same statistical characteristics), and allows for extracting far more data automatically than any one user would ever think to include. > This email is already too long for most people to get through it :( I > believe that this discussion is going to last for some time (I cannot > imagine why, given the HTML timeline, it would need to be resolved before > June), so there will be time for others to discuss more fully the many > points Ian raises as ones he would like to understand. The HTML timeline is partially a joke (2023 is the date for 'full compliance'; there isn't a single browser yet who has fully implemented *html4* ^_^). We still would like things resolved with all due speed; the faster they hit the spec, the faster they'll be integrated into browsers. Conclusion ========== There is significant confusion (or at least lack of distinction) in your email (and generally in the arguments from RDFa supporters in my experience) between RDFa and RDF, RDF and the general concept of data interchange formats, distributed and centralized data storage, in-the-small data interchange and in-the-large data interchange, and personal use (ie web users) and organization use (ie search engines). Each of these individually confuse the argument; when brought together as they typically are, they render many arguments completely useless. Separating RDFa from RDF ------------------------ The bonuses/maluses of RDF itself are completely irrelevant to this discussion. This is because there already exists several methods in active use for embedding RDF in a web page. In other words, whatever problem requires you to embed RDF in a webpage has been *solved*, and without any necessity of cooperation from the html language itself. RDFa is specifically a proposal to embed structured data in a web page using attributes on elements. *This* is the solution we need to find problems for if we want RDFa merged into the spec. Separating RDF from general data interchange formats ---------------------------------------------------- Many of the problems that can be solved by using a common data interchange format don't require specifically what RDF brings to the table. As noted earlier in this email, every collection of data has its own shape, and its own particular 'ideal' representation. RDF forces a particular method of representation. This has its bonuses and maluses, but they are *completely separate* from the bonuses/maluses of generically using a data interchange format. Libraries don't need RDF to exchange data, they just need *some* agreement on data representation. What problems are specifically solved by RDF and its specific representation being favored in the spec over a more general method of data representation? Separating distributed and centralized data storage ---------------------------------------------------- RDFa is a distributed data storage format - a single page includes only a fraction of the relevant data. The opposite possibility is centralized data storage - a single entity holding the data in a particular place (such as a database on their servers). The latter is very common, simple, and natural. To get at the data, you just run queries against the single database. This does require the entity with the data to produce an API to run queries against, but the same is required for use of a distributed data format (the company in charge of the site has to specifically code to expose that data in the given format). Both storage methods, though, allow sharing of data and enable all manner of useful web services. What problems are specifically solved by a distributed data strategy which are solved worse or not at all by a centralized data strategy? Separating in-the-small and in-the-large data interchange --------------------------------------------------------- In-the-small data interchange involves a small number of entities who can trust each other and generally receive a direct benefit from structuring and sharing their data. In-the-large data interchange involves a large number of disparate entities who *can't* trust each other and won't generally receive direct benefit for structuring their data. What problems are shared by these two situations? Which are best solved by RDFa? Are there existing solutions to these problems that are adequate? If RDFa is intended to be for one or the other of these situations, it would be convenient for advocates to agree which it is, so that we can then focus the discussion on that. As it is we are getting into useless arguments where someone is talking about one situation, and then someone else brings up a "Yes, but..." involving the other situation. Separating personal consumption from corporate consumption ---------------------------------------------------------- It has already been noted that existing search engines have found metadata to be generally unreliable, and instead rely on natural-language processing to extract information from pages. Can RDFa offer better solutions to the problems of search engines than they currently employ? Personal use is an entirely different issue. RDFa is often touted as making it easy for users to look up information about data on the page. It has also been noted, though, that simply highlighting some text (say, a song title) and selecting "Search Google for the text '...'" (specific text is from my machine; your experience may vary) does essentially the same thing, and possibly offers much more. As well, new features such as IE8's accelerators offer even more advanced functionality when you need it, such as allowing you to search IMBD.com specifically for your highlighted text, using IMDB's own search form. Are there significant problems left in this space? Does RDFa solve them? Are they better solved by other solutions? ~TJ