Re: Metadata statistics from Yahoo! Search

2012-04-17 Thread Juan Sequeda
On Tue, Apr 17, 2012 at 5:14 PM, Kingsley Idehen wrote: > On 4/17/12 11:06 AM, Peter Mika wrote: > >> Hi All, >> >> To add one more data point to the previous discussion about >> webdatacommons.org, we have recently presented a short position paper at >> the LDOW 2012 workshop at WWW 2012. Online

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Martin Hepp
Hi Chris, Thanks for your e-mail. > we clearly say on the WebDataCommons website as well as in the announcement > that we are extracting data from 1.4 billion web pages only. > > The Web is obviously much larger. Thus it is also obvious that we don't have > all data in our dataset. It's not a

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Chris Bizer
Hi Martin, we clearly say on the WebDataCommons website as well as in the announcement that we are extracting data from 1.4 billion web pages only. The Web is obviously much larger. Thus it is also obvious that we don't have all data in our dataset. See http://lists.w3.org/Archives/Public/publi

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Kingsley Idehen
On 4/17/12 3:29 PM, Martin Hepp wrote: That would be a nice first step. And then stopping to claim that the stats show the actual status of the "data web" ;-) Narrative should be more about: showcasing the least amount of structured data on the burgeoning Web of Linked Data :-) Kingsley

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Martin Hepp
That would be a nice first step. And then stopping to claim that the stats show the actual status of the "data web" ;-) On Apr 17, 2012, at 9:23 PM, Dan Brickley wrote: > How about adding a disclaimer line to the webdatacommons.org site like > > "Note that the many database-backed sites contai

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Dan Brickley
How about adding a disclaimer line to the webdatacommons.org site like "Note that the many database-backed sites contain a huge long tail of rarely-visited, rarely-linked pages (e.g. product catalogues), but which increasingly contain useful structured data. It is best not to assume that this coll

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Martin Hepp
Hi Dan, Peter: I think we are in agreement - the CommonCrawl project is nice and useful. I am not questioning that at all. However, webdatacommons.org uses the CommonCrawl corpus, extracts all structured data found, and then 1. advocates the results as kind of definite statistics on "data on th

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Dan Brickley
On 17 April 2012 18:56, Peter Mika wrote: > > Hi Martin, > > It's not as simple as that, because PageRank is a probabilistic algorithm (it > includes random jumps between pages), and I wouldn't expect that wayfair.com > would include 2M links on a single page (that would be one very long webpage

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Martin Hepp
All that you say is true, but: 1. The typical fan-out rates of e.g. shops mean that the deep links do typically not get a pagerank greater than 0. 2. If a shop has no link to all of its detail pages (e.g. you can only find some items via a search interface or based on individualized recommendat

Yet more metadata statistics out - from Sindice

2012-04-17 Thread Giovanni Tummarello
HI Peter, all to add (a probably small element of discussion) to this i am happy to say that last week we released on the frontpage some analytics stats which are fresh updated every week. At the moment they come from 500million+ web URLS. Maybe not much but pls notice we ONLY retain web urld w

Re: Metadata statistics from Yahoo! Search

2012-04-17 Thread Kingsley Idehen
On 4/17/12 11:06 AM, Peter Mika wrote: Hi All, To add one more data point to the previous discussion about webdatacommons.org, we have recently presented a short position paper at the LDOW 2012 workshop at WWW 2012. Online at http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-1.

Metadata statistics from Yahoo! Search

2012-04-17 Thread Peter Mika
Hi All, To add one more data point to the previous discussion about webdatacommons.org, we have recently presented a short position paper at the LDOW 2012 workshop at WWW 2012. Online at http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-1.pdf Please compare this carefully with

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Peter Mika
Hi Martin, By incorporating PageRank into the decision of what pages to crawl, CommonCrawl is actually trying to approximate what search engine crawlers are doing. In general, search engines would collect pages that would be more likely to rank higher in search results, and PageRank is an imp

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Martin Hepp
Dear Chris, all, while reading the paper [1] I think I found a possible explanation why WebDataCommons.org does not fulfill the high expectations regarding the completeness and coverage. It seems that CommonCrawl filters pages by Pagerank in order to determine the feasible subset of URIs for t