Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Calogero Alex Baldacchino

Ben Adida ha scritto:

Ian Hickson wrote:
  
We have to make sure that whatever we specify in HTML5 actually is going 
to be useful for the purpose it is intended for. If a feature intended for 
wide-scale automated data extraction is especially susceptible to spamming 
attacks, then it is unlikely to be useful for wide-scale automated data 
extraction.



It's no more susceptible to spam than existing HTML, as per my previous
response.

  


Perhaps this is why general purpose search engines do not rely 
(entirely) on metadata and markup semantics to classify content, nor 
does Yahoo with SearchMonkey. SearchMonkey documentation points out that 
metadata never affects page ranks, nor is semantics interpreted for any 
purpose; metadata only affects additional informations presented to the 
user at the user will, and if the user chose to get informations of a 
certain kind (gathered by a certain data service), thus spammy metadata 
can be thought as circumscribed in this case, they might corrupt 
SearchMonkey additional data, but not the user's overall experience with 
the search engine. From this point of view, SearchMonkey is some kind of 
wide-range but small-scale use case (with respect to each tool and each 
site the user might enable), because the user can easily choose which 
sources to trust (e.g. which data services to use, or which sites to 
look for additional infos), and in any case he can get enough infos 
without metadata.


On the other hand, a client UA implementing a feature entirely based on 
metadata couldn't easily circumscribe abused metadata and bring valid 
informations to the user attention, nor could the average user take 
easily trusted and spammy sites apart, because he wouldn't understand 
the problem (and a site with spammy metadata might still contain 
informations users were interested in previously, or in a different 
context), whereas in SearchMonkey the average user would notice 
something doesn't work in enhanced results, but he'd also get the basic 
infos he was looking for. Thus there are different requirements to be 
taken into account for different scenarios (SearchMonkey and client UA 
are such different scenarios)


Moreover, SearchMonkey is a kind of centralised service based on 
distributed metadata, it doesn't need collaboration by any other UA 
(that is, it doesn't need support for metadata in other software) by 
default (whereas it allows custom data services to autonomously extract 
metadata, but always for the purposes of SearchMonkey), it only requires 
that web sites adhering to the project (or just willing to provide 
additional infos) embed some kind of metadata only for the purpose of 
making them available to SearchMonkey services, or at least that authors 
create appropriate metadata and send them to Yahoo (in the form of 
dataRSS embedded in a Atom document). That is, SearchMonkey seems to me 
a clear example of a use case for metadata not requiring any changes to 
html5 spec, since any kind of supported metadata are used by 
SearchMonkey as if they were custom, private metadata; whatever happens 
to such metadata client-side, even if they're just stripped by a 
browser, doesn't really matter.


Furthermore, SearchMonkey supports several kinds of metadata, not only 
RDFa, but also eRDF, microformats and dataRSS external to the document. 
So, why should SearchMonkey be the reason to introduce explicit support 
to RDFa and not also for eRDF, which doesn't require new attributes, but 
just a parser? One might think one solution is better than the other, 
and this might be true in theory, but what really counts is what people 
do find easier to use, and this might be determined by experience with 
SearchMonkey (that is, let's see what people use more often, then decide 
what's more needed).


Moreover, RDFa is thought for xhtml, thus it can't be introduced in html 
serialization just by defining a few new attributes: a processor would 
or might need some knowledge over /namespaces/, thus the whole "family" 
of *xmlns* attributes (with and without prefixes) should be specified 
for use with the html serialization, unless an alternative mechanism, 
similar to the one chosen for eRDF, were defined, and maybe such would 
result in a new, hybrid mechanism (stitching together pieces from eRDF 
and RDFa). Buf if we introduce xmlns and xmlns: into html 
serialization, why not also prefixed attributes? That is, can RDFa be 
introduced into html serialization "as is", without resorting to the 
whole xml extensibility? This should be taken into account as well, 
because just adding new attributes to the language might work fine for 
xml-serialized documents, but might not for html-serialized ones. This 
means RDFa support might be more difficult than it may seem at first 
glance, whereas it might not be needed for custom and/or small scale use 
cases (and I think SearchMonkey is one such case).


Nobody is suggesting that user agents derive any behavior from , s

Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Dan Brickley

On 10/1/09 00:37, Ian Hickson wrote:

On Fri, 9 Jan 2009, Ben Adida wrote:

Is inherent resistance to spam a condition (even a consideration) for
HTML5?


We have to make sure that whatever we specify in HTML5 actually is going
to be useful for the purpose it is intended for. If a feature intended for
wide-scale automated data extraction is especially susceptible to spamming
attacks, then it is unlikely to be useful for wide-scale automated data
extraction.


I've been looking at such concerns a bit for RDFa. One issue (shared 
with HTML in general I think) is user-supplied content, eg. blog 
comments and 'rel=nofollow' scenarios).  Is there any way in HTML5 to 
indicate that a whole chunk of Web page is from an (in some 
to-be-defined sense) untrusted source?


I see http://www.whatwg.org/specs/web-apps/current-work/#link-type-nofollow

"The nofollow keyword indicates that the link is not endorsed by the 
original author or publisher of the page, or that the link to the 
referenced document was included primarily because of a commercial 
relationship between people affiliated with the two pages."


While I'm unsure about the "commercial relationship" clause quite 
capturing what's needed, the basic idea seems sound. Is there any 
provision (or plans) for applying this notion to entire blocks of 
markup, rather than just to simple hyperlinks? This would be rather 
useful for distinguishing embedded metadata that comes from the page 
author from that included from blog comments or similar.


Thanks for any pointers,

cheers,

Dan

--
http://danbri.org/


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Ian Hickson
On Fri, 9 Jan 2009, Ben Adida wrote:
> 
> SearchMonkey, which you continue to ignore, is an important use case.

When did I ignore it? I discussed it in depth in my e-mail in December, 
listing a number of use cases and requirements that I thought it 
demonstrated, and asking if there were any others I'd missed.


> Before I invest significant time in responding to your barrage of 
> questions, I'm looking for a hint of objective evaluation on your end.

All I'm trying to do is evaluate things objectively. I don't know how much 
more I can "hint" towards this.

Indeed, every question I asked in the aforementioned e-mail had no reason 
_other_ than to enable me to objectively evaluate the proposals.


> > Note that search engines aren't the problem here
> 
> Actually, we were discussing SearchMonkey, so I think it's very much the 
> context for this sub-thread.

I meant that search engines weren't the problem when it came to spam. 
Search engines can deal with distributed spam. The techniques developed to 
combat distributed spam don't really work on the scale of a single user's 
machine and browser.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Ben Adida
Ian Hickson wrote:
> We have to make sure that whatever we specify in HTML5 actually is going 
> to be useful for the purpose it is intended for. If a feature intended for 
> wide-scale automated data extraction is especially susceptible to spamming 
> attacks, then it is unlikely to be useful for wide-scale automated data 
> extraction.

It's no more susceptible to spam than existing HTML, as per my previous
response.

> Nobody is suggesting that user agents derive any behavior from , so 
> it doesn't matter if  is spammed or not.

And RDFa does not mandate any specific behavior, only the ability to
express structure. The power lies in products like SearchMonkey that
make use of this structure with innovative applications.

Can one imagine tools that make poor use of this structured data so that
they incentivize spam? Absolutely. Is this the bar for HTML5? If bad or
poorly conceived applications can be imagined, then it's not in the
standard?

> It is less likely for a user to intentionally visit a 
> spammy page than for a user to visit a page that happens to contain spammy 
> content embedded within it (e.g. in blog comments).

You've done plenty of web security work, and I suspect you know well
that spammy RDFa is the least in a large set of problems that come with
accepting arbitrary markup in blog comments. This is a strawman.

> However, browsers don't do this kind of processing -- 
> indeed, this kind of processing appears to be exactly what RDFa proponents 
> are trying to enable (though to what end, I'm still trying to find out, 
> since nobody has actually replied to all the questions I asked yet [1]).

While client-side processing is indeed an important use case (Ubiquity,
Fuzzbot, etc...), it's not the only one. SearchMonkey, which you
continue to ignore, is an important use case.

Before I invest significant time in responding to your barrage of
questions, I'm looking for a hint of objective evaluation on your end. I
thought I saw an opportunity for productive discussion based on common
ground with SearchMonkey, but this has led again into a new and
close-to-bogus reason for blocking consideration of RDFa.

> Note that search engines aren't the problem here

Actually, we were discussing SearchMonkey, so I think it's very much the
context for this sub-thread. You continue to ignore SearchMonkey, for
reasons which, as I've pointed out in a response earlier today, are
factually incorrect.

-Ben


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Ben Adida
Tab Atkins Jr. wrote:
> To answer your specific question,  is under the control of the
> site author, and search engines already have elaborate methods to tell
> a spammy site from a hammy one, thus downranking them.

And RDFa is also entirely under the control of the site author.

> On the other hand, the hypothetical attack scenario I outlined was
> about metadata that could be added to the page by external parties.

I thought your attack concerned both author markup and commenter markup.
But it seems we agree on author markup: no additional risk there.

So on to commenter markup.

Most blogging software already white-lists the HTML elements and
attributes they allow, otherwise they are easily hacked with XSS. This
means that, by default, most blogging software will strip RDFa from
comments, which is exactly the right approach, since comments should not
have authority over the structured data of the page.

-Ben


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Ian Hickson
On Fri, 9 Jan 2009, Ben Adida wrote:
> 
> Is inherent resistance to spam a condition (even a consideration) for 
> HTML5?

We have to make sure that whatever we specify in HTML5 actually is going 
to be useful for the purpose it is intended for. If a feature intended for 
wide-scale automated data extraction is especially susceptible to spamming 
attacks, then it is unlikely to be useful for wide-scale automated data 
extraction.


> If so, where is the concern around , which is clearly featured in 
> search engine results?

Nobody is suggesting that user agents derive any behavior from , so 
it doesn't matter if  is spammed or not. The only effect would be 
some spam in the user's session history. Furthermore,  is page- 
wide, meaning that the actual page author would have to spam the page for 
it to be spamed. It is less likely for a user to intentionally visit a 
spammy page than for a user to visit a page that happens to contain spammy 
content embedded within it (e.g. in blog comments).

If browsers were expected to crawl all pages for all links and then 
populate the browser's interface with the most popular links, then one 
would quickly expect everyone's browsers to be advertising Viagra, porn 
sites, and the like. However, browsers don't do this kind of processing -- 
indeed, this kind of processing appears to be exactly what RDFa proponents 
are trying to enable (though to what end, I'm still trying to find out, 
since nobody has actually replied to all the questions I asked yet [1]).

Note that search engines aren't the problem here -- large operations like 
search engines are quite capable of running the massive processing 
required to filter spam. The problem is automated processing on the 
client, where those resources aren't available.

[1] 
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-December/018023.html

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Tab Atkins Jr.
On Fri, Jan 9, 2009 at 5:13 PM, Ben Adida  wrote:
> Tab Atkins Jr. wrote:
>> This brings up different issues, however.
>
> Is inherent resistance to spam a condition (even a consideration) for
> HTML5? If so, where is the concern around , which is clearly
> featured in search engine results?

Well, it's something that we probably want to keep in mind, because
it's so relevant for the success of any such proposal.  I wouldn't
want to lend support to a feature that turned out to be immediately
useless due to spam.  Lot of wasted effort on the WG's, Ian's, and
possibly browser developer's part.

To answer your specific question,  is under the control of the
site author, and search engines already have elaborate methods to tell
a spammy site from a hammy one, thus downranking them.

On the other hand, the hypothetical attack scenario I outlined was
about metadata that could be added to the page by external parties.

If we were today discussing adding  to HTML5 to help search
engines provide a short summary of a page, and part of the proposal
might allow blog commenters to change the title of pages on a whim,
I'd certainly be equally concerned.  ^_^

~TJ


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Ben Adida
Tab Atkins Jr. wrote:
> This brings up different issues, however.

Is inherent resistance to spam a condition (even a consideration) for
HTML5? If so, where is the concern around , which is clearly
featured in search engine results?

-Ben


Re: [whatwg]

2009-01-09 Thread Story Henry
We started putting a wiki page together for this that will be kept up  
to date here:


http://esw.w3.org/topic/foaf+ssl

Henry

On 9 Jan 2009, at 00:28, Story Henry wrote:


Dear WhatWG,

I just subscribed to this list having noticed a thread earlier this  
month on the topic of the  tag. As it happens we are working  
on a protocol
foaf+ssl where keygen turns out to be extremely useful. It allows us  
to create web services to give people very secure certificates which  
can then be used to build a secure distributed social network based  
on a web of trust.


The foaf+ssl protocol works as it happens with most existing  
browsers - though we have not done a detailed study of this yet (if  
people could help this would be greatly appreciated). The protocol  
is summarized here:


http://www.w3.org/2008/09/msnws/papers/foaf+ssl.html

And you can find more on my blog at http://blogs.sun.com/bblfish .

The discussion on  which produces spkac public keys which it  
sends to the server can be found on the foaf-protocols mailing list  
archive under 'spkac'


http://lists.foaf-project.org/pipermail/foaf-protocols/2009-January/date.html

To tell you the truth I just discovered this tag recently myself,  
wrote some code to test that it worked, found it to work on Opera,  
Netscape, and Firefox, though it works slightly differently on each  
platform.


http://lists.foaf-project.org/pipermail/foaf-protocols/2009-January/000153.html

I also put up a page on wikipedia:

http://en.wikipedia.org/wiki/Spkac

So please do keep the tag, and perhaps work on making it easier to  
work with.


Henry

Blog: http://blogs.sun.com/bblfish


Ian Hickson wrote on January 6 2009:
Over the years, several people (most of them bcc'ed) have asked for  
HTML5 to include a definition of . Some have even gone as  
far as finding documentation on the element -- thank you. As I  
understand it based on the documentation,  basically  
generates a public/private asymmetric cryptographic key pair, and  
then sends the public component as its form value.  Unfortunately,  
this seems completely and utterly useless, as at no point does  
there seem to be any way to ever use the private component either  
for signing or for decrypting anything, nor does there appear to be  
a way to use the certificate for authentication. Without further  
information along these lines describing how to actually make  
practical use of the element, I do not intend to document   
in the HTML5 specification. If anyone can fill in these holes that  
would be very helpful. Cheers,









Re: [whatwg] Origins, reprise

2009-01-09 Thread Boris Zbarsky

Adam Barth wrote:

On Fri, Jan 9, 2009 at 10:42 AM, Boris Zbarsky  wrote:

3) Those for which the URI is same-origin with itself but no other URI
  (not to be confused with the globally unique identifier case).


Can you give an example of this kind of URI?


Yes, of course.  IMAP URIs [1] have an authority component which is the 
IMAP server.  At the same time, each message needs to be treated as a 
separate trust domain.


Similar for the proposed nntp URIs [2].

-Boris

[1] http://www.rfc-editor.org/rfc/rfc5092.txt
[2] http://tools.ietf.org/html/draft-ellermann-news-nntp-uri-11


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Tab Atkins Jr.
On Fri, Jan 9, 2009 at 3:22 PM, Ben Adida  wrote:
> Tab Atkins Jr. wrote:
>> However, Ian has a point in his first paragraph.  SearchMonkey does
>> *not* do auto-discovery; it relies entirely on site owners telling it
>> precisely what data to extract, where it's allowed to extract it from,
>> and how to present it.
>
> That's incorrect.
>
> You can build a SearchMonkey infobar that is set to function on all URLs
> (just use "*" in your URL field.)
>
> For example, the Creative Commons SearchMonkey application:
>
> http://gallery.search.yahoo.com/application?smid=kVf.s
>
> (currently broken because of a recent change in the SearchMonkey PHP API
> that we need to address, so here's a photo:
>
> http://www.flickr.com/photos/ysearchblog/2869419185/
> )
>
> By adding the CC RDFa markup to your page, it will show up with the
> infobar in Yahoo searches.

Ah, hadn't considered a net-wide SearchMonkey script.  Interesting.

This brings up different issues, however.  Something I see
immediately: Say I'm a scammer.  I know that the CC SearchMonkey app
is in wide use (pretend, here).  I start putting CC-RDF data in spam
blog comments, with my own spammy stuff in the relevant fields.  Now
people don't even have to click on the blog link in the search results
and read my obviously spammy comment to be introduced to my offers for
discount Viagra!  They'll just see a little CC bar, click on it to
have it open in-place, and there I am.  I could even hide my link in
legitimate license data, so that people only hit my malicious site
when they click the link to see more information about the license.

Issues like these make wide-scale auto-trusted use of metadata
difficult.  It also makes me more reluctant to want it in the spec
yet.  I'd rather see the community work out these problems first.  It
may be that there's a relatively simple solution.  It may be that the
crawlers can reliably distinguish between ham and spam CC data.  But
then, it may be that there *is* no good solution enabling us to use
this approach, and this kind of metadata on arbitrary sites just can't
be trusted.

I, personally, don't know the answer to this yet.  I suspect that you
don't, either; if the arbitrary-site CC infobar works at all, it's
because few people *use* CC RDF yet, and so it's still limited to a
community with implicit trust.

> So site-specific microformats are clearly less powerful. And
> vocabulary-specific microformats, while useful, are also not as useful
> here (consider a SearchMonkey application that picks up CC-licensed
> items, be they video, audio, books, scientific data, etc... Different
> microformats = development hell.)

Indeed, they are less powerful.  As I explored above, though, too much
power can be damning. It may be that the site-specific little-m
microformat (or something equivalent, allowing a developer to extract
metadata through actively targeting site structure) is powerful enough
to be useful, but weak enough to *remain* useful in the face of abuse.

(Also, I know CC is sort of the darling of the RDFa community, but
there's significant enough debate over in-band vs out-of-band
licensing info, etc. that detracts from the core issues we're trying
to discuss here that it's probably not the best example to use.)

> Have you read the RDFa Primer?
> http://www.w3.org/TR/xhtml-rdfa-primer/
>
> It describes (pre-SearchMonkey) the kind of applications that can be
> built with RDFa. SearchMonkey is an ideal example, but it's by no means
> the only one.

Yup; I was an active participant in this discussion when it started
last August.  The example applications discussed in the paper,
unfortunately, are precisely the kind where trusting metadata is
likely a *bad* idea.  For example, finding reviews of shows produced
by friends of Alice, using foaf and hreview, is rife with opportunity
for spamming.  SearchMonkey seems to avoid this for the most part;
when designing applications for particular URLs, at least, you are
relying on relatively trustworthy data, not arbitrary data scattered
across the web.  Perhaps something similar has application within
trusted networks, but in that case it comprises a completely different
use case than what SearchMonkey hits, with possibly different
requirements.

~TJ


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Calogero Alex Baldacchino

Ben Adida ha scritto:

Tab Atkins Jr. wrote:
  

Actually, SearchMonkey is an excellent use case, and provides a
problem statement.



I'm surprised, but very happily so, that you agree.

My confusion stems from the fact that Ian clearly mentioned SearchMonkey
in his email a few days ago, then proceeded to say it wasn't a good use
case.

-Ben

  


It seems to me that's a very custom use case - though requiring metadata 
to be embedded in a big number of pages, but that's an optional 
requirement, because search results don't rely only on metadata -  since 
metadata are used as an optional source for informations by the server 
and don't require any collaboration by other kinds of UA (excluding, at 
most, some custom data services - whereas, for instance, a search engine 
using the mark element to highlight a keyword would require a client UA 
to understand and style it properly -- I expect it not to be working on 
IE6, for instance, because IEx browsers deal with unknown elements as if 
their content where misplaced). That is, Yahoo might develop his own 
data model and work fine with sites implementing it; perhaps RDF(a) was 
chosen because they might think RDF is a natural way to model data which 
are sparse in a web page (and re-mapping microformats on RDF might 
result in an easier implementation); anyway, in this case the only UA 
needing to understand RDFa, in this case, is SearchMonkey itself, thus a 
client browser might just drop RDFa attributes without breaking 
SearchMonkey functionalities -- at least, this is my first impression.


Furthermore, it's a very recent (yet potentially interesting) 
application, so why not to wait and see how it grows, if the opt-in 
mechanism will effectively prevent spam (e.g. spammers might model data 
basing on widely diffused vocabularies and data services, and find a way 
to make such data available in searches when users asks for additional 
infos, for instance through an ad within a page of an accomplice author, 
or exploiting some kind of errors in authors' selection of URLs to be 
crawled for metadata, or the alike), or just which model become the most 
used among RDFa, eRDF, Microformats, Atom embedding dataRSS and whatever 
else Yahoo might decide to support, before choosing to include one or 
the other into html5 specification (or to include each one because 
equally diffused)? Moreover, it seems that some xml processing is needed 
to create a custom data service, thus it might be natural to use xhtml 
(possibly along with namespaces and prefixed attributes) to provide 
metadata to such a data service, which might rely on an xml parser 
instead of implementing one from scratch (and html parser might not 
support namespaces for the purpose to expose them through DOM 
interfaces, as I understand html serialization) -- the use of prefixed 
RDFa attributes, or perhaps even unprefixed ones, within an 
xml-serialized document, shouldn't require a formalization in html5 
spec, as far as there is no strict requirement for UAs to support RDF 
processing - as it is for the purposes of SearchMonkey and its related 
data services.


WBR, Alex


--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP 
autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
Con Danone Activia, puoi vincere cellulari Nokia e Macbook Air. Scopri come
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8551&d=9-1


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Ben Adida
Tab Atkins Jr. wrote:
> However, Ian has a point in his first paragraph.  SearchMonkey does
> *not* do auto-discovery; it relies entirely on site owners telling it
> precisely what data to extract, where it's allowed to extract it from,
> and how to present it.

That's incorrect.

You can build a SearchMonkey infobar that is set to function on all URLs
(just use "*" in your URL field.)

For example, the Creative Commons SearchMonkey application:

http://gallery.search.yahoo.com/application?smid=kVf.s

(currently broken because of a recent change in the SearchMonkey PHP API
that we need to address, so here's a photo:

http://www.flickr.com/photos/ysearchblog/2869419185/
)

By adding the CC RDFa markup to your page, it will show up with the
infobar in Yahoo searches.

So site-specific microformats are clearly less powerful. And
vocabulary-specific microformats, while useful, are also not as useful
here (consider a SearchMonkey application that picks up CC-licensed
items, be they video, audio, books, scientific data, etc... Different
microformats = development hell.)

Have you read the RDFa Primer?
http://www.w3.org/TR/xhtml-rdfa-primer/

It describes (pre-SearchMonkey) the kind of applications that can be
built with RDFa. SearchMonkey is an ideal example, but it's by no means
the only one.

-Ben


Re: [whatwg] Origins, reprise

2009-01-09 Thread Adam Barth
On Fri, Jan 9, 2009 at 10:42 AM, Boris Zbarsky  wrote:
> 3) Those for which the URI is same-origin with itself but no other URI
>   (not to be confused with the globally unique identifier case).

Can you give an example of this kind of URI?

Thanks,
Adam


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Tab Atkins Jr.
On Fri, Jan 9, 2009 at 2:17 PM, Ben Adida  wrote:
> Tab Atkins Jr. wrote:
>> Actually, SearchMonkey is an excellent use case, and provides a
>> problem statement.
>
> I'm surprised, but very happily so, that you agree.
>
> My confusion stems from the fact that Ian clearly mentioned SearchMonkey
> in his email a few days ago, then proceeded to say it wasn't a good use
> case.

I apologize; looking back into my archives, it appears there was an
entire subthread specifically about SearchMonkey!  Also, Ian did
indeed mention it in his first email in this thread.  He actually gave
it more attention than any other single use-case, though.  I'll quote
the relevant part:

> On Tue, 26 Aug 2008, Ben Adida wrote:
> >
> > Here's one example. This is not the only way that RDFa can be helpful,
> > but it should help make things more concrete:
> >
> >   http://developer.yahoo.com/searchmonkey/
> >
> > Using semantic markup in HTML (microformats and, soon, RDFa), you, as a
> > publisher, can choose to surface more relevant information straight into
> > Yahoo search results.
>
> This doesn't seem to require RDFa or any generic data syntax at all. Since
> the system is site-specific anyway (you have to list the URLs you wish to
> act against), the same kind of mechanism could be done by just extracting
> the data straight out of the page. This would have the advantage of
> working with any Web page without requiring the page to be written using a
> particular syntax.
>
> However, if SearchMonkey is an example of a use case, then we should
> determine the requirements for this feature. It seems, based on reading
> the documentation, that it basically boils down to:
>
>  * Pages should be able to expose nested lists of name-value pairs on a
>   page-by-page basis.
>
>  * It should be possible to define globally-unique names, but the syntax
>   should be optimised for a set of predefined vocabularies.
>
>  * Adding this data to a page should be easy.
>
>  * The syntax for adding this data should encourage the data to remain
>   accurate when the page is changed.
>
>  * The syntax should be resilient to intentional copy-and-paste authoring:
>   people copying data into the page from a page that already has data
>   should not have to know about any declarations far from the data.
>
>  * The syntax should be resilient to unintentional copy-and-paste
>   authoring: people copying markup from the page who do not know about
>   these features should not inadvertently mark up their page with
>   inapplicable data.
>
> Are there any other requirements that we can derive from SearchMonkey?

I agree with Ian in that SearchMonkey is not *necessarily* speaking in
favor of RDFa; that may be what caused you to think he was dismissing
it.  In truth, Ian is merely trying to take current examples of RDFa
use and distill them into their essence.  (To grab my previous
example, it is similar to seeing what all the various rounded-corners
hacks were doing, without necessarily implying that the final solution
will be anything like them.  It's important to distill the actual
problems that users are solving from the details of particular
solutions they are using.)

Like I said, I think SearchMonkey sounds absolutely awesome, and
genuinely useful on a level I haven't yet seen any apps of similar
nature reach.  I'm exclusively a Google user, but that's something I'd
love to have ported over.  It's similar in nature to IE8's
Accelerators, in that it's an opt-in application for users that
reduces clicks to get to information they actively decide they want.

However, Ian has a point in his first paragraph.  SearchMonkey does
*not* do auto-discovery; it relies entirely on site owners telling it
precisely what data to extract, where it's allowed to extract it from,
and how to present it.  It is likely that this can be done entirely
within the confines of current html, and the fact that SearchMonkey
can use Microformats suggests that this is true.  A possible approach
is a site-owner producing an ad-hoc microformat (little m) that the
crawler can match against pages and index the information of, and then
offer to the SearchMonkey application for presentation as the
developer wills.  This would require specified parsing rules for such
things (which, as mentioned in an earlier email, the big-m
Microformats community is working on).

The question is, would this be sufficient?  Are other approaches
easier for authors?  RDFa, as noted, already has a specified parsing
model.  Does this make it easier for authors to design data templates?
 Easier to communicate templates to a crawler?  Easier to deploy in a
site?  Easier to parse for a crawler?

SearchMonkey makes mention of developers producing SearchMonkey apps
without the explicit permission of site owners.  This use would almost
certainly be better served with a looser data discovery model than
RDFa, so that a site owner doesn't have to explicitly comply in order
for others to extract useful data from their p

Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Ben Adida
Tab Atkins Jr. wrote:
> Actually, SearchMonkey is an excellent use case, and provides a
> problem statement.

I'm surprised, but very happily so, that you agree.

My confusion stems from the fact that Ian clearly mentioned SearchMonkey
in his email a few days ago, then proceeded to say it wasn't a good use
case.

-Ben



Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Tab Atkins Jr.
On Fri, Jan 9, 2009 at 1:48 PM, Ben Adida  wrote:
> Julian Reschke wrote:
>>> Because the issue is that we don't yet know if we want to support
>>> RDFa.  That's the whole point of this thread.  Nobody's given a useful
>>> problem statement yet, so we can't evaluate whether there's a problem
>>> we need to solve, or how we should solve it.
>>
>> For the record: I disagree with that. I have the impression that no
>> matter how many problems are presented, the answer is going to be: "not
>> that stone -- fetch me another stone".
>
> For the record: I completely agree with Julian. This is why I haven't
> jumped into this thread yet again.
>
> The key piece of evidence here is SearchMonkey, a product by Yahoo that
> specifically uses RDFa. Even its microformat support funnels everything
> to an RDF-like metadata approach. With thousands of application
> developers and some concrete examples that specifically use RDFa (the
> Creative Commons application being one of them), the message from many
> on this list remains "not good enough."
>
> I'm not sure where the bar is, but it seems far from objective.

Actually, SearchMonkey is an excellent use case, and provides a
problem statement.

Problem
===

Site owners want a way to provide enhanced search results to the
engines, so that an entry in the search results page is more than just
a bare link and snippet of text, and provides additional resources for
users straight on the search page without them having to click into
the page and discover those resources themselves.

For example (taken directly from the SearchMonkey docs), yelp.com may
want to provide additional information on restaurants they have
reviews for, pushing info on price, rating, and phone number directly
into the search results, along with links straight to their reviews or
photos of the restaurant.

Different sites will have vastly different needs and requirements in
this regard, preventing natural discovery by crawlers from being
effective.

(SearchMonkey itself relies on the user registering an add-in on their
Yahoo account, so spammers can't exploit this - the user has to
proactively decide they want additional information from a site to
show up in their results, then they click a link and the rest is
automagical.)


That really wasn't hard.  I'd never seen SearchMonkey before (it's
possible it was mentioned, but I know that it was never explicitly
described), but it's a really sweet app that helps both authors and
users.  That's a check mark in my book.

~TJ


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Calogero Alex Baldacchino

Julian Reschke ha scritto:

Calogero Alex Baldacchino wrote:

...
This is why I was thinking about somewhat "data-rdfa-about", 
"data-rdfa-property", "data-rdfa-content" and so on, so that, for the 
purposes of an RDFa processor working on top of HTML5 UAs (perhaps in 
a test phase, if needed at all, of course), an element dataset would 
give access to "rdfa-about", instead of just "about", that is using 
the prefix "rdfa-" as acting as a namespace prefix in xml (hence, as 
if there were "rdfa:about" instead of "data-rdfa-about" in the markup).

...


That clashed with the documented purpose of data-*.


Hmm, I'm not sure there is a clash, since I was suggesting a *custom* 
and essentially *private* mechanism to experiment with RDFa in 
conjunction with HTML serialization, for the *small-scale* needs of some 
organizations willing to embed RDFa metadata in text/html documents, and 
to exchange them with each other by using a convention likely avoiding 
name clashes with other private metadata. Since I think it's unlikely to 
find data-rdfa-* used with different semantics in the very same page, 
and in a small-scale scenario involving a few *selected* sources for 
RDFa-modelled information, it should be likely to know in advance that 
someone else is using the same conventions. Such a modelled document 
might be used in conjunction with an external RDFa processor, thus 
avoiding any direct support in a browser.


However, such a convention might be enough "clash-free" to work on a 
wider scale, thus it might become widespread and provide an evidence 
that the web /needs/, or at least /has chosen/ to use RDFa as (one of) 
the most common way to embed metadata in a document, and such might be 
enough to add a native support for the whole range of "RDFa" attributes, 
eventually along with support for earlier experimental ones (such as 
"data-rdfa-*" and "rdfa:*" ones, for backward compatibility). And 
actually I can't see much of a problem if a private-born feature became 
the base of a widespread and widely accepted convention (I'm not saying 
the spec should name data-rdfa-* as a mean to implement RDFa, instead I 
think that, if a general agreement on if and how RDFa must be spec'ed 
out and implemented can't be found, such an experiment might be proposed 
to the semantic web industry and wait for the results - given a lack in 
support might prevent any interested party to use RDFa and HTML5 
altogether).




*If* we want to support RDFa, why not add the attributes the way they 
are already named???




For instance, to experiment whether it is worth to change the "if we 
want" into "we do want", without requiring an early implementation and 
specification, nor relying on if and what a certain browser vendor might 
want to experiment differently from others (such a convention would only 
require support for HTML5 datasets and a script or a plugin capable to 
handle them as representing RDFa metadata). -- the point here is that 
after introducing data-* attributes as a mean to support custom 
attributes any browser vendors might decide to drop support for other 
kind of custom attributes in html serialization (that is, for attributes 
being neither part of the language nor data-* ones), therefore if they 
(or any of them) decided to avoid to support RDFa attributes until they 
were introduced in a specification there might be no mean to experiment 
with them (in general, that is cross-browser) without resorting either 
to data-* or to "rdfa:*" (the latter in xhtml).


Anyway, /in general/ what should a browser do with RDFa metadata, on a 
*wide scale*, other than classifying a portion of the open web (e.g. in 
its local history), eventually allowing users to select trusted sources?


Actually, I don't think such would bring enough benefits for *average* 
users, compared to the risk to get a lot of spam metadata from 
/heterogeneous/ sources. I really don't expect average users to 
understand how to filter sites basing on metadata reliability (and just 
for the purpose to use a metadata-based query interface, because a site 
with wrong metadata might still contain usefull informations); instead 
they might just try and use a query interface the same way they use a 
default search bar, get wrong results (once spam metadata became 
widespread) and decide the mechanism doesn't work fine (eventually 
complaining for that). A somewhat antispam filter might help, but I 
think that understanding if metadata are reliable, that is if they 
really correspond to a web page content, is an odd problem to be solved 
by a bot without a good degree of Artificial Intelligence (filtering 
emails by looking for suspicious patterns is far easier than 
implementing a filter capable to /understand/ metadata, /understand/ 
natural language and compare /semantics/ ).


As well, I don't expect the great majority of web pages to contain 
"valid" metadata: most people would not care of them, and a potentially 
growing number might copy

Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Ben Adida
Julian Reschke wrote:
>> Because the issue is that we don't yet know if we want to support
>> RDFa.  That's the whole point of this thread.  Nobody's given a useful
>> problem statement yet, so we can't evaluate whether there's a problem
>> we need to solve, or how we should solve it.
> 
> For the record: I disagree with that. I have the impression that no
> matter how many problems are presented, the answer is going to be: "not
> that stone -- fetch me another stone".

For the record: I completely agree with Julian. This is why I haven't
jumped into this thread yet again.

The key piece of evidence here is SearchMonkey, a product by Yahoo that
specifically uses RDFa. Even its microformat support funnels everything
to an RDF-like metadata approach. With thousands of application
developers and some concrete examples that specifically use RDFa (the
Creative Commons application being one of them), the message from many
on this list remains "not good enough."

I'm not sure where the bar is, but it seems far from objective.

-Ben


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Julian Reschke

Tab Atkins Jr. wrote:

*If* we want to support RDFa, why not add the attributes the way they are
already named???


Because the issue is that we don't yet know if we want to support
RDFa.  That's the whole point of this thread.  Nobody's given a useful
problem statement yet, so we can't evaluate whether there's a problem
we need to solve, or how we should solve it.


For the record: I disagree with that. I have the impression that no 
matter how many problems are presented, the answer is going to be: "not 
that stone -- fetch me another stone".



Alex's suggestion, while officially against spec, has the benefit of
allowing RDFa supporters to sort out their use cases through
experience.  That's the back door into the spec, after all; you don't


If something that is against the spec is acceptable, then it's *much* 
easier to just use the already defined attributes. Better breaking the 
spec by using new attributes then abusing existing ones.


> ...

BR, Julian


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Tab Atkins Jr.
On Fri, Jan 9, 2009 at 5:46 AM, Julian Reschke  wrote:
> Calogero Alex Baldacchino wrote:
>>
>> ...
>> This is why I was thinking about somewhat "data-rdfa-about",
>> "data-rdfa-property", "data-rdfa-content" and so on, so that, for the
>> purposes of an RDFa processor working on top of HTML5 UAs (perhaps in a test
>> phase, if needed at all, of course), an element dataset would give access to
>> "rdfa-about", instead of just "about", that is using the prefix "rdfa-" as
>> acting as a namespace prefix in xml (hence, as if there were "rdfa:about"
>> instead of "data-rdfa-about" in the markup).
>> ...
>
> That clashed with the documented purpose of data-*.
>
> *If* we want to support RDFa, why not add the attributes the way they are
> already named???

Because the issue is that we don't yet know if we want to support
RDFa.  That's the whole point of this thread.  Nobody's given a useful
problem statement yet, so we can't evaluate whether there's a problem
we need to solve, or how we should solve it.

Alex's suggestion, while officially against spec, has the benefit of
allowing RDFa supporters to sort out their use cases through
experience.  That's the back door into the spec, after all; you don't
have to do as much work to formulate a problem statement if you can
point to large amounts of people hacking around a current lack, as
that's a pretty strong indicator that there *is* a problem needing to
be solved.  As an added benefit, the fact that there's already
multiple independent attempts at a solution gives us a wide pool of
experience to draw from in formulating the actual spec, so as to make
the use as easy as possible for authors.

(An example that comes to mind in this regard is rounded corners.
Usually you have to break semantics and put in junk elements to get
rounded corners on a flexible box.  This became so common that the
question of whether or not rounded corners were significant enough to
be added in CSS answered itself - people are trying hard to hack the
support in, so it's clearly something they want, and thus it's
worthwhile to spec a method (the border-radius property) to give them
it.  It solves a problem that authors, through their actions, made
extremely clear, and it does so in a way that is enormously simpler
99% of the time.  Win-win.)

~Tj


[whatwg] Origins, reprise

2009-01-09 Thread Boris Zbarsky

I've recently come across another issue with the origin definition.

Right now, this says:

1) If url does not use a server-based naming authority, or if parsing
   url failed, or if url is not an absolute URL, then return a new
   globally unique identifier.
2) Return the tuple (scheme, host, port).

(with some steps to determine the tuple thrown in).

In Gecko, we actually have three classes of URIs for security purposes:

1) Those for which the URI is not same-origin with anything (the
   globally unique identifier case).
2) Those for which the URI is same-origin with anything with the same
   scheme+host+port.
3) Those for which the URI is same-origin with itself but no other URI
   (not to be confused with the globally unique identifier case).

It would be nice if we could express this in terms of the origin setup, 
but it doesn't seem to me like that's workable as things stand...


-Boris


Re: [whatwg] Fuzzbot (Firefox RDFa semantics processor) (was: Trying to work out the problems solved by RDFa)

2009-01-09 Thread Manu Sporny
Calogero Alex Baldacchino wrote:
> That is, choosing a proper level of integration for RDF(a) support into
> a web browser might divide success from failure. I don't know what's the
> best possible level, but I guess the deepest may be the worst, thus
> starting from an external support through out plugins, or scripts to be
> embedded in a webbapp, and working on top of other feature might work
> fine and lead to a better, native support by all vendors, yet limited to
> an API for custom applications

There seems to be a bit of confusion over what RDFa can and can't do as
well as the current state of the art. We have created an RDFa Firefox
plugin called Fuzzbot (for Windows, Linux and Mac OS X) that is a very
rough demonstration of how an browser-based RDFa processor might
operate. If you're new to RDFa, you can use it to edit and debug RDFa
pages in order to get a better sense of how RDFa works.

There is a primer[1] to the semantic web and an RDFa basics[2] tutorial
on YouTube for the completely un-initiated. The rdfa.info wiki[3] has
further information.


(sent to public-r...@w3.org earlier this week):

We've just released a new version of Fuzzbot[4], this time with packages
for all major platforms, which we're going to be using at the upcoming
RDFa workshop at the Web Directions North 2009 conference[5].

Fuzzbot uses librdfa as the RDFa processing back-end and can display
triples extracted from webpages via the Firefox UI. It is currently most
useful when debugging RDFa web page triples. We use it to ensure that
the RDFa web pages that we are editing are generating the expected
triples - it is part of our suite of Firefox web development plug-ins.

There are three versions of the Firefox XPI:

Windows XP/Vista (i386)
http://rdfa.digitalbazaar.com/fuzzbot/download/fuzzbot-windows.xpi

Mac OS X (i386)
http://rdfa.digitalbazaar.com/fuzzbot/download/fuzzbot-macosx-i386.xpi

Linux (i386) - you must have xulrunner-1.9 installed
http://rdfa.digitalbazaar.com/fuzzbot/download/fuzzbot-linux.xpi

There is also very preliminary support for the Audio RDF and Video RDF
vocabularies, demos of which can be found on YouTube[6][7].

To try it out on the Audio RDF vocab, install the plugin, then click on
the Fuzzbot icon at the bottom of the Firefox window (in the status bar):

http://bitmunk.com/media/6566872

There should be a number of triples that show up in the frame at the
bottom of the screen as well as a music note icon that shows up in the
Firefox 3 AwesomeBar.

To try out the Video RDF vocab, do the same at this URL:

http://rdfa.digitalbazaar.com/fuzzbot/demo/video.html

Please report any installation or run-time issues (such as the plug-in
not working on your platform) to me, or on the librdfa bugs page:

http://rdfa.digitalbazaar.com/librdfa/trac

-- manu

[1] http://www.youtube.com/watch?v=OGg8A2zfWKg
[2] http://www.youtube.com/watch?v=ldl0m-5zLz4
[3] http://rdfa.info/wiki
[4] http://rdfa.digitalbazaar.com/fuzzbot/
[5] http://north.webdirections.org/
[6] http://www.youtube.com/watch?v=oPWNgZ4peuI
[7] http://www.youtube.com/watch?v=PVGD9HQloDI

-- 
Manu Sporny
President/CEO - Digital Bazaar, Inc.

blog: Fibers are the Future: Scaling Past 100K Concurrent Requests
http://blog.digitalbazaar.com/2008/10/21/scaling-webservices-part-2


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-09 Thread Julian Reschke

Calogero Alex Baldacchino wrote:

...
This is why I was thinking about somewhat "data-rdfa-about", 
"data-rdfa-property", "data-rdfa-content" and so on, so that, for the 
purposes of an RDFa processor working on top of HTML5 UAs (perhaps in a 
test phase, if needed at all, of course), an element dataset would give 
access to "rdfa-about", instead of just "about", that is using the 
prefix "rdfa-" as acting as a namespace prefix in xml (hence, as if 
there were "rdfa:about" instead of "data-rdfa-about" in the markup).

...


That clashed with the documented purpose of data-*.

*If* we want to support RDFa, why not add the attributes the way they 
are already named???



...
However, AIUI, actual xml serialization (xhtml5) allows the use of 
namespaces and prefixed attributes, thus couldn't a proper namespace be 
introduced for RDFa attributes, so they can be used, if needed, in 
xhtml5 documents? I think such might be a valuable choice, because it 
seems to me RDFa attributes can be used to address such cases where 
metadata must stay as close as possible to correspondent data, but a 
mistake in a piece of markup may trigger the adoption agency or foster 
parenting algorithms, eventually causing a separation between metadata 
and content, thus possibly breaking reliability of gathered 
informations. From this perspective, a parser stopping on the very first 
error might give a quicker feedback than one rearranging misnested 
elements as far as it is reasonably possible (not affecting, and instead 
improving, content presentation and users' "direct" experience, but 
possibly causing side-effects with metadata).

...


That would make RDFa as used in XHTML 1.* and RDFa used in HTML 5 
incompatible. What for?


> ...

BR, Julian