Re: [whatwg] Trying to work out the problems solved by RDFa
Ben Adida ha scritto: Ian Hickson wrote: We have to make sure that whatever we specify in HTML5 actually is going to be useful for the purpose it is intended for. If a feature intended for wide-scale automated data extraction is especially susceptible to spamming attacks, then it is unlikely to be useful for wide-scale automated data extraction. It's no more susceptible to spam than existing HTML, as per my previous response. Perhaps this is why general purpose search engines do not rely (entirely) on metadata and markup semantics to classify content, nor does Yahoo with SearchMonkey. SearchMonkey documentation points out that metadata never affects page ranks, nor is semantics interpreted for any purpose; metadata only affects additional informations presented to the user at the user will, and if the user chose to get informations of a certain kind (gathered by a certain data service), thus spammy metadata can be thought as circumscribed in this case, they might corrupt SearchMonkey additional data, but not the user's overall experience with the search engine. From this point of view, SearchMonkey is some kind of wide-range but small-scale use case (with respect to each tool and each site the user might enable), because the user can easily choose which sources to trust (e.g. which data services to use, or which sites to look for additional infos), and in any case he can get enough infos without metadata. On the other hand, a client UA implementing a feature entirely based on metadata couldn't easily circumscribe abused metadata and bring valid informations to the user attention, nor could the average user take easily trusted and spammy sites apart, because he wouldn't understand the problem (and a site with spammy metadata might still contain informations users were interested in previously, or in a different context), whereas in SearchMonkey the average user would notice something doesn't work in enhanced results, but he'd also get the basic infos he was looking for. Thus there are different requirements to be taken into account for different scenarios (SearchMonkey and client UA are such different scenarios) Moreover, SearchMonkey is a kind of centralised service based on distributed metadata, it doesn't need collaboration by any other UA (that is, it doesn't need support for metadata in other software) by default (whereas it allows custom data services to autonomously extract metadata, but always for the purposes of SearchMonkey), it only requires that web sites adhering to the project (or just willing to provide additional infos) embed some kind of metadata only for the purpose of making them available to SearchMonkey services, or at least that authors create appropriate metadata and send them to Yahoo (in the form of dataRSS embedded in a Atom document). That is, SearchMonkey seems to me a clear example of a use case for metadata not requiring any changes to html5 spec, since any kind of supported metadata are used by SearchMonkey as if they were custom, private metadata; whatever happens to such metadata client-side, even if they're just stripped by a browser, doesn't really matter. Furthermore, SearchMonkey supports several kinds of metadata, not only RDFa, but also eRDF, microformats and dataRSS external to the document. So, why should SearchMonkey be the reason to introduce explicit support to RDFa and not also for eRDF, which doesn't require new attributes, but just a parser? One might think one solution is better than the other, and this might be true in theory, but what really counts is what people do find easier to use, and this might be determined by experience with SearchMonkey (that is, let's see what people use more often, then decide what's more needed). Moreover, RDFa is thought for xhtml, thus it can't be introduced in html serialization just by defining a few new attributes: a processor would or might need some knowledge over /namespaces/, thus the whole "family" of *xmlns* attributes (with and without prefixes) should be specified for use with the html serialization, unless an alternative mechanism, similar to the one chosen for eRDF, were defined, and maybe such would result in a new, hybrid mechanism (stitching together pieces from eRDF and RDFa). Buf if we introduce xmlns and xmlns: into html serialization, why not also prefixed attributes? That is, can RDFa be introduced into html serialization "as is", without resorting to the whole xml extensibility? This should be taken into account as well, because just adding new attributes to the language might work fine for xml-serialized documents, but might not for html-serialized ones. This means RDFa support might be more difficult than it may seem at first glance, whereas it might not be needed for custom and/or small scale use cases (and I think SearchMonkey is one such case). Nobody is suggesting that user agents derive any behavior from , s
Re: [whatwg] Trying to work out the problems solved by RDFa
On 10/1/09 00:37, Ian Hickson wrote: On Fri, 9 Jan 2009, Ben Adida wrote: Is inherent resistance to spam a condition (even a consideration) for HTML5? We have to make sure that whatever we specify in HTML5 actually is going to be useful for the purpose it is intended for. If a feature intended for wide-scale automated data extraction is especially susceptible to spamming attacks, then it is unlikely to be useful for wide-scale automated data extraction. I've been looking at such concerns a bit for RDFa. One issue (shared with HTML in general I think) is user-supplied content, eg. blog comments and 'rel=nofollow' scenarios). Is there any way in HTML5 to indicate that a whole chunk of Web page is from an (in some to-be-defined sense) untrusted source? I see http://www.whatwg.org/specs/web-apps/current-work/#link-type-nofollow "The nofollow keyword indicates that the link is not endorsed by the original author or publisher of the page, or that the link to the referenced document was included primarily because of a commercial relationship between people affiliated with the two pages." While I'm unsure about the "commercial relationship" clause quite capturing what's needed, the basic idea seems sound. Is there any provision (or plans) for applying this notion to entire blocks of markup, rather than just to simple hyperlinks? This would be rather useful for distinguishing embedded metadata that comes from the page author from that included from blog comments or similar. Thanks for any pointers, cheers, Dan -- http://danbri.org/
Re: [whatwg] Trying to work out the problems solved by RDFa
On Fri, 9 Jan 2009, Ben Adida wrote: > > SearchMonkey, which you continue to ignore, is an important use case. When did I ignore it? I discussed it in depth in my e-mail in December, listing a number of use cases and requirements that I thought it demonstrated, and asking if there were any others I'd missed. > Before I invest significant time in responding to your barrage of > questions, I'm looking for a hint of objective evaluation on your end. All I'm trying to do is evaluate things objectively. I don't know how much more I can "hint" towards this. Indeed, every question I asked in the aforementioned e-mail had no reason _other_ than to enable me to objectively evaluate the proposals. > > Note that search engines aren't the problem here > > Actually, we were discussing SearchMonkey, so I think it's very much the > context for this sub-thread. I meant that search engines weren't the problem when it came to spam. Search engines can deal with distributed spam. The techniques developed to combat distributed spam don't really work on the scale of a single user's machine and browser. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Trying to work out the problems solved by RDFa
Ian Hickson wrote: > We have to make sure that whatever we specify in HTML5 actually is going > to be useful for the purpose it is intended for. If a feature intended for > wide-scale automated data extraction is especially susceptible to spamming > attacks, then it is unlikely to be useful for wide-scale automated data > extraction. It's no more susceptible to spam than existing HTML, as per my previous response. > Nobody is suggesting that user agents derive any behavior from , so > it doesn't matter if is spammed or not. And RDFa does not mandate any specific behavior, only the ability to express structure. The power lies in products like SearchMonkey that make use of this structure with innovative applications. Can one imagine tools that make poor use of this structured data so that they incentivize spam? Absolutely. Is this the bar for HTML5? If bad or poorly conceived applications can be imagined, then it's not in the standard? > It is less likely for a user to intentionally visit a > spammy page than for a user to visit a page that happens to contain spammy > content embedded within it (e.g. in blog comments). You've done plenty of web security work, and I suspect you know well that spammy RDFa is the least in a large set of problems that come with accepting arbitrary markup in blog comments. This is a strawman. > However, browsers don't do this kind of processing -- > indeed, this kind of processing appears to be exactly what RDFa proponents > are trying to enable (though to what end, I'm still trying to find out, > since nobody has actually replied to all the questions I asked yet [1]). While client-side processing is indeed an important use case (Ubiquity, Fuzzbot, etc...), it's not the only one. SearchMonkey, which you continue to ignore, is an important use case. Before I invest significant time in responding to your barrage of questions, I'm looking for a hint of objective evaluation on your end. I thought I saw an opportunity for productive discussion based on common ground with SearchMonkey, but this has led again into a new and close-to-bogus reason for blocking consideration of RDFa. > Note that search engines aren't the problem here Actually, we were discussing SearchMonkey, so I think it's very much the context for this sub-thread. You continue to ignore SearchMonkey, for reasons which, as I've pointed out in a response earlier today, are factually incorrect. -Ben
Re: [whatwg] Trying to work out the problems solved by RDFa
Tab Atkins Jr. wrote: > To answer your specific question, is under the control of the > site author, and search engines already have elaborate methods to tell > a spammy site from a hammy one, thus downranking them. And RDFa is also entirely under the control of the site author. > On the other hand, the hypothetical attack scenario I outlined was > about metadata that could be added to the page by external parties. I thought your attack concerned both author markup and commenter markup. But it seems we agree on author markup: no additional risk there. So on to commenter markup. Most blogging software already white-lists the HTML elements and attributes they allow, otherwise they are easily hacked with XSS. This means that, by default, most blogging software will strip RDFa from comments, which is exactly the right approach, since comments should not have authority over the structured data of the page. -Ben
Re: [whatwg] Trying to work out the problems solved by RDFa
On Fri, 9 Jan 2009, Ben Adida wrote: > > Is inherent resistance to spam a condition (even a consideration) for > HTML5? We have to make sure that whatever we specify in HTML5 actually is going to be useful for the purpose it is intended for. If a feature intended for wide-scale automated data extraction is especially susceptible to spamming attacks, then it is unlikely to be useful for wide-scale automated data extraction. > If so, where is the concern around , which is clearly featured in > search engine results? Nobody is suggesting that user agents derive any behavior from , so it doesn't matter if is spammed or not. The only effect would be some spam in the user's session history. Furthermore, is page- wide, meaning that the actual page author would have to spam the page for it to be spamed. It is less likely for a user to intentionally visit a spammy page than for a user to visit a page that happens to contain spammy content embedded within it (e.g. in blog comments). If browsers were expected to crawl all pages for all links and then populate the browser's interface with the most popular links, then one would quickly expect everyone's browsers to be advertising Viagra, porn sites, and the like. However, browsers don't do this kind of processing -- indeed, this kind of processing appears to be exactly what RDFa proponents are trying to enable (though to what end, I'm still trying to find out, since nobody has actually replied to all the questions I asked yet [1]). Note that search engines aren't the problem here -- large operations like search engines are quite capable of running the massive processing required to filter spam. The problem is automated processing on the client, where those resources aren't available. [1] http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-December/018023.html -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Trying to work out the problems solved by RDFa
On Fri, Jan 9, 2009 at 5:13 PM, Ben Adida wrote: > Tab Atkins Jr. wrote: >> This brings up different issues, however. > > Is inherent resistance to spam a condition (even a consideration) for > HTML5? If so, where is the concern around , which is clearly > featured in search engine results? Well, it's something that we probably want to keep in mind, because it's so relevant for the success of any such proposal. I wouldn't want to lend support to a feature that turned out to be immediately useless due to spam. Lot of wasted effort on the WG's, Ian's, and possibly browser developer's part. To answer your specific question, is under the control of the site author, and search engines already have elaborate methods to tell a spammy site from a hammy one, thus downranking them. On the other hand, the hypothetical attack scenario I outlined was about metadata that could be added to the page by external parties. If we were today discussing adding to HTML5 to help search engines provide a short summary of a page, and part of the proposal might allow blog commenters to change the title of pages on a whim, I'd certainly be equally concerned. ^_^ ~TJ
Re: [whatwg] Trying to work out the problems solved by RDFa
Tab Atkins Jr. wrote: > This brings up different issues, however. Is inherent resistance to spam a condition (even a consideration) for HTML5? If so, where is the concern around , which is clearly featured in search engine results? -Ben
Re: [whatwg]
We started putting a wiki page together for this that will be kept up to date here: http://esw.w3.org/topic/foaf+ssl Henry On 9 Jan 2009, at 00:28, Story Henry wrote: Dear WhatWG, I just subscribed to this list having noticed a thread earlier this month on the topic of the tag. As it happens we are working on a protocol foaf+ssl where keygen turns out to be extremely useful. It allows us to create web services to give people very secure certificates which can then be used to build a secure distributed social network based on a web of trust. The foaf+ssl protocol works as it happens with most existing browsers - though we have not done a detailed study of this yet (if people could help this would be greatly appreciated). The protocol is summarized here: http://www.w3.org/2008/09/msnws/papers/foaf+ssl.html And you can find more on my blog at http://blogs.sun.com/bblfish . The discussion on which produces spkac public keys which it sends to the server can be found on the foaf-protocols mailing list archive under 'spkac' http://lists.foaf-project.org/pipermail/foaf-protocols/2009-January/date.html To tell you the truth I just discovered this tag recently myself, wrote some code to test that it worked, found it to work on Opera, Netscape, and Firefox, though it works slightly differently on each platform. http://lists.foaf-project.org/pipermail/foaf-protocols/2009-January/000153.html I also put up a page on wikipedia: http://en.wikipedia.org/wiki/Spkac So please do keep the tag, and perhaps work on making it easier to work with. Henry Blog: http://blogs.sun.com/bblfish Ian Hickson wrote on January 6 2009: Over the years, several people (most of them bcc'ed) have asked for HTML5 to include a definition of . Some have even gone as far as finding documentation on the element -- thank you. As I understand it based on the documentation, basically generates a public/private asymmetric cryptographic key pair, and then sends the public component as its form value. Unfortunately, this seems completely and utterly useless, as at no point does there seem to be any way to ever use the private component either for signing or for decrypting anything, nor does there appear to be a way to use the certificate for authentication. Without further information along these lines describing how to actually make practical use of the element, I do not intend to document in the HTML5 specification. If anyone can fill in these holes that would be very helpful. Cheers,
Re: [whatwg] Origins, reprise
Adam Barth wrote: On Fri, Jan 9, 2009 at 10:42 AM, Boris Zbarsky wrote: 3) Those for which the URI is same-origin with itself but no other URI (not to be confused with the globally unique identifier case). Can you give an example of this kind of URI? Yes, of course. IMAP URIs [1] have an authority component which is the IMAP server. At the same time, each message needs to be treated as a separate trust domain. Similar for the proposed nntp URIs [2]. -Boris [1] http://www.rfc-editor.org/rfc/rfc5092.txt [2] http://tools.ietf.org/html/draft-ellermann-news-nntp-uri-11
Re: [whatwg] Trying to work out the problems solved by RDFa
On Fri, Jan 9, 2009 at 3:22 PM, Ben Adida wrote: > Tab Atkins Jr. wrote: >> However, Ian has a point in his first paragraph. SearchMonkey does >> *not* do auto-discovery; it relies entirely on site owners telling it >> precisely what data to extract, where it's allowed to extract it from, >> and how to present it. > > That's incorrect. > > You can build a SearchMonkey infobar that is set to function on all URLs > (just use "*" in your URL field.) > > For example, the Creative Commons SearchMonkey application: > > http://gallery.search.yahoo.com/application?smid=kVf.s > > (currently broken because of a recent change in the SearchMonkey PHP API > that we need to address, so here's a photo: > > http://www.flickr.com/photos/ysearchblog/2869419185/ > ) > > By adding the CC RDFa markup to your page, it will show up with the > infobar in Yahoo searches. Ah, hadn't considered a net-wide SearchMonkey script. Interesting. This brings up different issues, however. Something I see immediately: Say I'm a scammer. I know that the CC SearchMonkey app is in wide use (pretend, here). I start putting CC-RDF data in spam blog comments, with my own spammy stuff in the relevant fields. Now people don't even have to click on the blog link in the search results and read my obviously spammy comment to be introduced to my offers for discount Viagra! They'll just see a little CC bar, click on it to have it open in-place, and there I am. I could even hide my link in legitimate license data, so that people only hit my malicious site when they click the link to see more information about the license. Issues like these make wide-scale auto-trusted use of metadata difficult. It also makes me more reluctant to want it in the spec yet. I'd rather see the community work out these problems first. It may be that there's a relatively simple solution. It may be that the crawlers can reliably distinguish between ham and spam CC data. But then, it may be that there *is* no good solution enabling us to use this approach, and this kind of metadata on arbitrary sites just can't be trusted. I, personally, don't know the answer to this yet. I suspect that you don't, either; if the arbitrary-site CC infobar works at all, it's because few people *use* CC RDF yet, and so it's still limited to a community with implicit trust. > So site-specific microformats are clearly less powerful. And > vocabulary-specific microformats, while useful, are also not as useful > here (consider a SearchMonkey application that picks up CC-licensed > items, be they video, audio, books, scientific data, etc... Different > microformats = development hell.) Indeed, they are less powerful. As I explored above, though, too much power can be damning. It may be that the site-specific little-m microformat (or something equivalent, allowing a developer to extract metadata through actively targeting site structure) is powerful enough to be useful, but weak enough to *remain* useful in the face of abuse. (Also, I know CC is sort of the darling of the RDFa community, but there's significant enough debate over in-band vs out-of-band licensing info, etc. that detracts from the core issues we're trying to discuss here that it's probably not the best example to use.) > Have you read the RDFa Primer? > http://www.w3.org/TR/xhtml-rdfa-primer/ > > It describes (pre-SearchMonkey) the kind of applications that can be > built with RDFa. SearchMonkey is an ideal example, but it's by no means > the only one. Yup; I was an active participant in this discussion when it started last August. The example applications discussed in the paper, unfortunately, are precisely the kind where trusting metadata is likely a *bad* idea. For example, finding reviews of shows produced by friends of Alice, using foaf and hreview, is rife with opportunity for spamming. SearchMonkey seems to avoid this for the most part; when designing applications for particular URLs, at least, you are relying on relatively trustworthy data, not arbitrary data scattered across the web. Perhaps something similar has application within trusted networks, but in that case it comprises a completely different use case than what SearchMonkey hits, with possibly different requirements. ~TJ
Re: [whatwg] Trying to work out the problems solved by RDFa
Ben Adida ha scritto: Tab Atkins Jr. wrote: Actually, SearchMonkey is an excellent use case, and provides a problem statement. I'm surprised, but very happily so, that you agree. My confusion stems from the fact that Ian clearly mentioned SearchMonkey in his email a few days ago, then proceeded to say it wasn't a good use case. -Ben It seems to me that's a very custom use case - though requiring metadata to be embedded in a big number of pages, but that's an optional requirement, because search results don't rely only on metadata - since metadata are used as an optional source for informations by the server and don't require any collaboration by other kinds of UA (excluding, at most, some custom data services - whereas, for instance, a search engine using the mark element to highlight a keyword would require a client UA to understand and style it properly -- I expect it not to be working on IE6, for instance, because IEx browsers deal with unknown elements as if their content where misplaced). That is, Yahoo might develop his own data model and work fine with sites implementing it; perhaps RDF(a) was chosen because they might think RDF is a natural way to model data which are sparse in a web page (and re-mapping microformats on RDF might result in an easier implementation); anyway, in this case the only UA needing to understand RDFa, in this case, is SearchMonkey itself, thus a client browser might just drop RDFa attributes without breaking SearchMonkey functionalities -- at least, this is my first impression. Furthermore, it's a very recent (yet potentially interesting) application, so why not to wait and see how it grows, if the opt-in mechanism will effectively prevent spam (e.g. spammers might model data basing on widely diffused vocabularies and data services, and find a way to make such data available in searches when users asks for additional infos, for instance through an ad within a page of an accomplice author, or exploiting some kind of errors in authors' selection of URLs to be crawled for metadata, or the alike), or just which model become the most used among RDFa, eRDF, Microformats, Atom embedding dataRSS and whatever else Yahoo might decide to support, before choosing to include one or the other into html5 specification (or to include each one because equally diffused)? Moreover, it seems that some xml processing is needed to create a custom data service, thus it might be natural to use xhtml (possibly along with namespaces and prefixed attributes) to provide metadata to such a data service, which might rely on an xml parser instead of implementing one from scratch (and html parser might not support namespaces for the purpose to expose them through DOM interfaces, as I understand html serialization) -- the use of prefixed RDFa attributes, or perhaps even unprefixed ones, within an xml-serialized document, shouldn't require a formalization in html5 spec, as far as there is no strict requirement for UAs to support RDF processing - as it is for the purposes of SearchMonkey and its related data services. WBR, Alex -- Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f Sponsor: Con Danone Activia, puoi vincere cellulari Nokia e Macbook Air. Scopri come Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8551&d=9-1
Re: [whatwg] Trying to work out the problems solved by RDFa
Tab Atkins Jr. wrote: > However, Ian has a point in his first paragraph. SearchMonkey does > *not* do auto-discovery; it relies entirely on site owners telling it > precisely what data to extract, where it's allowed to extract it from, > and how to present it. That's incorrect. You can build a SearchMonkey infobar that is set to function on all URLs (just use "*" in your URL field.) For example, the Creative Commons SearchMonkey application: http://gallery.search.yahoo.com/application?smid=kVf.s (currently broken because of a recent change in the SearchMonkey PHP API that we need to address, so here's a photo: http://www.flickr.com/photos/ysearchblog/2869419185/ ) By adding the CC RDFa markup to your page, it will show up with the infobar in Yahoo searches. So site-specific microformats are clearly less powerful. And vocabulary-specific microformats, while useful, are also not as useful here (consider a SearchMonkey application that picks up CC-licensed items, be they video, audio, books, scientific data, etc... Different microformats = development hell.) Have you read the RDFa Primer? http://www.w3.org/TR/xhtml-rdfa-primer/ It describes (pre-SearchMonkey) the kind of applications that can be built with RDFa. SearchMonkey is an ideal example, but it's by no means the only one. -Ben
Re: [whatwg] Origins, reprise
On Fri, Jan 9, 2009 at 10:42 AM, Boris Zbarsky wrote: > 3) Those for which the URI is same-origin with itself but no other URI > (not to be confused with the globally unique identifier case). Can you give an example of this kind of URI? Thanks, Adam
Re: [whatwg] Trying to work out the problems solved by RDFa
On Fri, Jan 9, 2009 at 2:17 PM, Ben Adida wrote: > Tab Atkins Jr. wrote: >> Actually, SearchMonkey is an excellent use case, and provides a >> problem statement. > > I'm surprised, but very happily so, that you agree. > > My confusion stems from the fact that Ian clearly mentioned SearchMonkey > in his email a few days ago, then proceeded to say it wasn't a good use > case. I apologize; looking back into my archives, it appears there was an entire subthread specifically about SearchMonkey! Also, Ian did indeed mention it in his first email in this thread. He actually gave it more attention than any other single use-case, though. I'll quote the relevant part: > On Tue, 26 Aug 2008, Ben Adida wrote: > > > > Here's one example. This is not the only way that RDFa can be helpful, > > but it should help make things more concrete: > > > > http://developer.yahoo.com/searchmonkey/ > > > > Using semantic markup in HTML (microformats and, soon, RDFa), you, as a > > publisher, can choose to surface more relevant information straight into > > Yahoo search results. > > This doesn't seem to require RDFa or any generic data syntax at all. Since > the system is site-specific anyway (you have to list the URLs you wish to > act against), the same kind of mechanism could be done by just extracting > the data straight out of the page. This would have the advantage of > working with any Web page without requiring the page to be written using a > particular syntax. > > However, if SearchMonkey is an example of a use case, then we should > determine the requirements for this feature. It seems, based on reading > the documentation, that it basically boils down to: > > * Pages should be able to expose nested lists of name-value pairs on a > page-by-page basis. > > * It should be possible to define globally-unique names, but the syntax > should be optimised for a set of predefined vocabularies. > > * Adding this data to a page should be easy. > > * The syntax for adding this data should encourage the data to remain > accurate when the page is changed. > > * The syntax should be resilient to intentional copy-and-paste authoring: > people copying data into the page from a page that already has data > should not have to know about any declarations far from the data. > > * The syntax should be resilient to unintentional copy-and-paste > authoring: people copying markup from the page who do not know about > these features should not inadvertently mark up their page with > inapplicable data. > > Are there any other requirements that we can derive from SearchMonkey? I agree with Ian in that SearchMonkey is not *necessarily* speaking in favor of RDFa; that may be what caused you to think he was dismissing it. In truth, Ian is merely trying to take current examples of RDFa use and distill them into their essence. (To grab my previous example, it is similar to seeing what all the various rounded-corners hacks were doing, without necessarily implying that the final solution will be anything like them. It's important to distill the actual problems that users are solving from the details of particular solutions they are using.) Like I said, I think SearchMonkey sounds absolutely awesome, and genuinely useful on a level I haven't yet seen any apps of similar nature reach. I'm exclusively a Google user, but that's something I'd love to have ported over. It's similar in nature to IE8's Accelerators, in that it's an opt-in application for users that reduces clicks to get to information they actively decide they want. However, Ian has a point in his first paragraph. SearchMonkey does *not* do auto-discovery; it relies entirely on site owners telling it precisely what data to extract, where it's allowed to extract it from, and how to present it. It is likely that this can be done entirely within the confines of current html, and the fact that SearchMonkey can use Microformats suggests that this is true. A possible approach is a site-owner producing an ad-hoc microformat (little m) that the crawler can match against pages and index the information of, and then offer to the SearchMonkey application for presentation as the developer wills. This would require specified parsing rules for such things (which, as mentioned in an earlier email, the big-m Microformats community is working on). The question is, would this be sufficient? Are other approaches easier for authors? RDFa, as noted, already has a specified parsing model. Does this make it easier for authors to design data templates? Easier to communicate templates to a crawler? Easier to deploy in a site? Easier to parse for a crawler? SearchMonkey makes mention of developers producing SearchMonkey apps without the explicit permission of site owners. This use would almost certainly be better served with a looser data discovery model than RDFa, so that a site owner doesn't have to explicitly comply in order for others to extract useful data from their p
Re: [whatwg] Trying to work out the problems solved by RDFa
Tab Atkins Jr. wrote: > Actually, SearchMonkey is an excellent use case, and provides a > problem statement. I'm surprised, but very happily so, that you agree. My confusion stems from the fact that Ian clearly mentioned SearchMonkey in his email a few days ago, then proceeded to say it wasn't a good use case. -Ben
Re: [whatwg] Trying to work out the problems solved by RDFa
On Fri, Jan 9, 2009 at 1:48 PM, Ben Adida wrote: > Julian Reschke wrote: >>> Because the issue is that we don't yet know if we want to support >>> RDFa. That's the whole point of this thread. Nobody's given a useful >>> problem statement yet, so we can't evaluate whether there's a problem >>> we need to solve, or how we should solve it. >> >> For the record: I disagree with that. I have the impression that no >> matter how many problems are presented, the answer is going to be: "not >> that stone -- fetch me another stone". > > For the record: I completely agree with Julian. This is why I haven't > jumped into this thread yet again. > > The key piece of evidence here is SearchMonkey, a product by Yahoo that > specifically uses RDFa. Even its microformat support funnels everything > to an RDF-like metadata approach. With thousands of application > developers and some concrete examples that specifically use RDFa (the > Creative Commons application being one of them), the message from many > on this list remains "not good enough." > > I'm not sure where the bar is, but it seems far from objective. Actually, SearchMonkey is an excellent use case, and provides a problem statement. Problem === Site owners want a way to provide enhanced search results to the engines, so that an entry in the search results page is more than just a bare link and snippet of text, and provides additional resources for users straight on the search page without them having to click into the page and discover those resources themselves. For example (taken directly from the SearchMonkey docs), yelp.com may want to provide additional information on restaurants they have reviews for, pushing info on price, rating, and phone number directly into the search results, along with links straight to their reviews or photos of the restaurant. Different sites will have vastly different needs and requirements in this regard, preventing natural discovery by crawlers from being effective. (SearchMonkey itself relies on the user registering an add-in on their Yahoo account, so spammers can't exploit this - the user has to proactively decide they want additional information from a site to show up in their results, then they click a link and the rest is automagical.) That really wasn't hard. I'd never seen SearchMonkey before (it's possible it was mentioned, but I know that it was never explicitly described), but it's a really sweet app that helps both authors and users. That's a check mark in my book. ~TJ
Re: [whatwg] Trying to work out the problems solved by RDFa
Julian Reschke ha scritto: Calogero Alex Baldacchino wrote: ... This is why I was thinking about somewhat "data-rdfa-about", "data-rdfa-property", "data-rdfa-content" and so on, so that, for the purposes of an RDFa processor working on top of HTML5 UAs (perhaps in a test phase, if needed at all, of course), an element dataset would give access to "rdfa-about", instead of just "about", that is using the prefix "rdfa-" as acting as a namespace prefix in xml (hence, as if there were "rdfa:about" instead of "data-rdfa-about" in the markup). ... That clashed with the documented purpose of data-*. Hmm, I'm not sure there is a clash, since I was suggesting a *custom* and essentially *private* mechanism to experiment with RDFa in conjunction with HTML serialization, for the *small-scale* needs of some organizations willing to embed RDFa metadata in text/html documents, and to exchange them with each other by using a convention likely avoiding name clashes with other private metadata. Since I think it's unlikely to find data-rdfa-* used with different semantics in the very same page, and in a small-scale scenario involving a few *selected* sources for RDFa-modelled information, it should be likely to know in advance that someone else is using the same conventions. Such a modelled document might be used in conjunction with an external RDFa processor, thus avoiding any direct support in a browser. However, such a convention might be enough "clash-free" to work on a wider scale, thus it might become widespread and provide an evidence that the web /needs/, or at least /has chosen/ to use RDFa as (one of) the most common way to embed metadata in a document, and such might be enough to add a native support for the whole range of "RDFa" attributes, eventually along with support for earlier experimental ones (such as "data-rdfa-*" and "rdfa:*" ones, for backward compatibility). And actually I can't see much of a problem if a private-born feature became the base of a widespread and widely accepted convention (I'm not saying the spec should name data-rdfa-* as a mean to implement RDFa, instead I think that, if a general agreement on if and how RDFa must be spec'ed out and implemented can't be found, such an experiment might be proposed to the semantic web industry and wait for the results - given a lack in support might prevent any interested party to use RDFa and HTML5 altogether). *If* we want to support RDFa, why not add the attributes the way they are already named??? For instance, to experiment whether it is worth to change the "if we want" into "we do want", without requiring an early implementation and specification, nor relying on if and what a certain browser vendor might want to experiment differently from others (such a convention would only require support for HTML5 datasets and a script or a plugin capable to handle them as representing RDFa metadata). -- the point here is that after introducing data-* attributes as a mean to support custom attributes any browser vendors might decide to drop support for other kind of custom attributes in html serialization (that is, for attributes being neither part of the language nor data-* ones), therefore if they (or any of them) decided to avoid to support RDFa attributes until they were introduced in a specification there might be no mean to experiment with them (in general, that is cross-browser) without resorting either to data-* or to "rdfa:*" (the latter in xhtml). Anyway, /in general/ what should a browser do with RDFa metadata, on a *wide scale*, other than classifying a portion of the open web (e.g. in its local history), eventually allowing users to select trusted sources? Actually, I don't think such would bring enough benefits for *average* users, compared to the risk to get a lot of spam metadata from /heterogeneous/ sources. I really don't expect average users to understand how to filter sites basing on metadata reliability (and just for the purpose to use a metadata-based query interface, because a site with wrong metadata might still contain usefull informations); instead they might just try and use a query interface the same way they use a default search bar, get wrong results (once spam metadata became widespread) and decide the mechanism doesn't work fine (eventually complaining for that). A somewhat antispam filter might help, but I think that understanding if metadata are reliable, that is if they really correspond to a web page content, is an odd problem to be solved by a bot without a good degree of Artificial Intelligence (filtering emails by looking for suspicious patterns is far easier than implementing a filter capable to /understand/ metadata, /understand/ natural language and compare /semantics/ ). As well, I don't expect the great majority of web pages to contain "valid" metadata: most people would not care of them, and a potentially growing number might copy
Re: [whatwg] Trying to work out the problems solved by RDFa
Julian Reschke wrote: >> Because the issue is that we don't yet know if we want to support >> RDFa. That's the whole point of this thread. Nobody's given a useful >> problem statement yet, so we can't evaluate whether there's a problem >> we need to solve, or how we should solve it. > > For the record: I disagree with that. I have the impression that no > matter how many problems are presented, the answer is going to be: "not > that stone -- fetch me another stone". For the record: I completely agree with Julian. This is why I haven't jumped into this thread yet again. The key piece of evidence here is SearchMonkey, a product by Yahoo that specifically uses RDFa. Even its microformat support funnels everything to an RDF-like metadata approach. With thousands of application developers and some concrete examples that specifically use RDFa (the Creative Commons application being one of them), the message from many on this list remains "not good enough." I'm not sure where the bar is, but it seems far from objective. -Ben
Re: [whatwg] Trying to work out the problems solved by RDFa
Tab Atkins Jr. wrote: *If* we want to support RDFa, why not add the attributes the way they are already named??? Because the issue is that we don't yet know if we want to support RDFa. That's the whole point of this thread. Nobody's given a useful problem statement yet, so we can't evaluate whether there's a problem we need to solve, or how we should solve it. For the record: I disagree with that. I have the impression that no matter how many problems are presented, the answer is going to be: "not that stone -- fetch me another stone". Alex's suggestion, while officially against spec, has the benefit of allowing RDFa supporters to sort out their use cases through experience. That's the back door into the spec, after all; you don't If something that is against the spec is acceptable, then it's *much* easier to just use the already defined attributes. Better breaking the spec by using new attributes then abusing existing ones. > ... BR, Julian
Re: [whatwg] Trying to work out the problems solved by RDFa
On Fri, Jan 9, 2009 at 5:46 AM, Julian Reschke wrote: > Calogero Alex Baldacchino wrote: >> >> ... >> This is why I was thinking about somewhat "data-rdfa-about", >> "data-rdfa-property", "data-rdfa-content" and so on, so that, for the >> purposes of an RDFa processor working on top of HTML5 UAs (perhaps in a test >> phase, if needed at all, of course), an element dataset would give access to >> "rdfa-about", instead of just "about", that is using the prefix "rdfa-" as >> acting as a namespace prefix in xml (hence, as if there were "rdfa:about" >> instead of "data-rdfa-about" in the markup). >> ... > > That clashed with the documented purpose of data-*. > > *If* we want to support RDFa, why not add the attributes the way they are > already named??? Because the issue is that we don't yet know if we want to support RDFa. That's the whole point of this thread. Nobody's given a useful problem statement yet, so we can't evaluate whether there's a problem we need to solve, or how we should solve it. Alex's suggestion, while officially against spec, has the benefit of allowing RDFa supporters to sort out their use cases through experience. That's the back door into the spec, after all; you don't have to do as much work to formulate a problem statement if you can point to large amounts of people hacking around a current lack, as that's a pretty strong indicator that there *is* a problem needing to be solved. As an added benefit, the fact that there's already multiple independent attempts at a solution gives us a wide pool of experience to draw from in formulating the actual spec, so as to make the use as easy as possible for authors. (An example that comes to mind in this regard is rounded corners. Usually you have to break semantics and put in junk elements to get rounded corners on a flexible box. This became so common that the question of whether or not rounded corners were significant enough to be added in CSS answered itself - people are trying hard to hack the support in, so it's clearly something they want, and thus it's worthwhile to spec a method (the border-radius property) to give them it. It solves a problem that authors, through their actions, made extremely clear, and it does so in a way that is enormously simpler 99% of the time. Win-win.) ~Tj
[whatwg] Origins, reprise
I've recently come across another issue with the origin definition. Right now, this says: 1) If url does not use a server-based naming authority, or if parsing url failed, or if url is not an absolute URL, then return a new globally unique identifier. 2) Return the tuple (scheme, host, port). (with some steps to determine the tuple thrown in). In Gecko, we actually have three classes of URIs for security purposes: 1) Those for which the URI is not same-origin with anything (the globally unique identifier case). 2) Those for which the URI is same-origin with anything with the same scheme+host+port. 3) Those for which the URI is same-origin with itself but no other URI (not to be confused with the globally unique identifier case). It would be nice if we could express this in terms of the origin setup, but it doesn't seem to me like that's workable as things stand... -Boris
Re: [whatwg] Fuzzbot (Firefox RDFa semantics processor) (was: Trying to work out the problems solved by RDFa)
Calogero Alex Baldacchino wrote: > That is, choosing a proper level of integration for RDF(a) support into > a web browser might divide success from failure. I don't know what's the > best possible level, but I guess the deepest may be the worst, thus > starting from an external support through out plugins, or scripts to be > embedded in a webbapp, and working on top of other feature might work > fine and lead to a better, native support by all vendors, yet limited to > an API for custom applications There seems to be a bit of confusion over what RDFa can and can't do as well as the current state of the art. We have created an RDFa Firefox plugin called Fuzzbot (for Windows, Linux and Mac OS X) that is a very rough demonstration of how an browser-based RDFa processor might operate. If you're new to RDFa, you can use it to edit and debug RDFa pages in order to get a better sense of how RDFa works. There is a primer[1] to the semantic web and an RDFa basics[2] tutorial on YouTube for the completely un-initiated. The rdfa.info wiki[3] has further information. (sent to public-r...@w3.org earlier this week): We've just released a new version of Fuzzbot[4], this time with packages for all major platforms, which we're going to be using at the upcoming RDFa workshop at the Web Directions North 2009 conference[5]. Fuzzbot uses librdfa as the RDFa processing back-end and can display triples extracted from webpages via the Firefox UI. It is currently most useful when debugging RDFa web page triples. We use it to ensure that the RDFa web pages that we are editing are generating the expected triples - it is part of our suite of Firefox web development plug-ins. There are three versions of the Firefox XPI: Windows XP/Vista (i386) http://rdfa.digitalbazaar.com/fuzzbot/download/fuzzbot-windows.xpi Mac OS X (i386) http://rdfa.digitalbazaar.com/fuzzbot/download/fuzzbot-macosx-i386.xpi Linux (i386) - you must have xulrunner-1.9 installed http://rdfa.digitalbazaar.com/fuzzbot/download/fuzzbot-linux.xpi There is also very preliminary support for the Audio RDF and Video RDF vocabularies, demos of which can be found on YouTube[6][7]. To try it out on the Audio RDF vocab, install the plugin, then click on the Fuzzbot icon at the bottom of the Firefox window (in the status bar): http://bitmunk.com/media/6566872 There should be a number of triples that show up in the frame at the bottom of the screen as well as a music note icon that shows up in the Firefox 3 AwesomeBar. To try out the Video RDF vocab, do the same at this URL: http://rdfa.digitalbazaar.com/fuzzbot/demo/video.html Please report any installation or run-time issues (such as the plug-in not working on your platform) to me, or on the librdfa bugs page: http://rdfa.digitalbazaar.com/librdfa/trac -- manu [1] http://www.youtube.com/watch?v=OGg8A2zfWKg [2] http://www.youtube.com/watch?v=ldl0m-5zLz4 [3] http://rdfa.info/wiki [4] http://rdfa.digitalbazaar.com/fuzzbot/ [5] http://north.webdirections.org/ [6] http://www.youtube.com/watch?v=oPWNgZ4peuI [7] http://www.youtube.com/watch?v=PVGD9HQloDI -- Manu Sporny President/CEO - Digital Bazaar, Inc. blog: Fibers are the Future: Scaling Past 100K Concurrent Requests http://blog.digitalbazaar.com/2008/10/21/scaling-webservices-part-2
Re: [whatwg] Trying to work out the problems solved by RDFa
Calogero Alex Baldacchino wrote: ... This is why I was thinking about somewhat "data-rdfa-about", "data-rdfa-property", "data-rdfa-content" and so on, so that, for the purposes of an RDFa processor working on top of HTML5 UAs (perhaps in a test phase, if needed at all, of course), an element dataset would give access to "rdfa-about", instead of just "about", that is using the prefix "rdfa-" as acting as a namespace prefix in xml (hence, as if there were "rdfa:about" instead of "data-rdfa-about" in the markup). ... That clashed with the documented purpose of data-*. *If* we want to support RDFa, why not add the attributes the way they are already named??? ... However, AIUI, actual xml serialization (xhtml5) allows the use of namespaces and prefixed attributes, thus couldn't a proper namespace be introduced for RDFa attributes, so they can be used, if needed, in xhtml5 documents? I think such might be a valuable choice, because it seems to me RDFa attributes can be used to address such cases where metadata must stay as close as possible to correspondent data, but a mistake in a piece of markup may trigger the adoption agency or foster parenting algorithms, eventually causing a separation between metadata and content, thus possibly breaking reliability of gathered informations. From this perspective, a parser stopping on the very first error might give a quicker feedback than one rearranging misnested elements as far as it is reasonably possible (not affecting, and instead improving, content presentation and users' "direct" experience, but possibly causing side-effects with metadata). ... That would make RDFa as used in XHTML 1.* and RDFa used in HTML 5 incompatible. What for? > ... BR, Julian