Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-18 Thread Leon Derczynski
On 17 April 2012 16:43, Martin Hepp wrote: > Hi Peter, > > Thanks for your feedback. However, > >> PageRank does transfer along the edges of the web graph, so a highly ranked >> homepage would transfer it's PageRank to the pages leading from it. > > Do you mean that if http://wayfair.com/ can pas

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-18 Thread Chris Bizer
r into the websites that we identified and publish the resulting data. > > This would be a really useful service to the community in addition to > criticizing other people's work. > > Cheers, > > Chris > > > -Ursprüngliche Nachricht- &g

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Martin Hepp
ntified and publish the resulting data. > > This would be a really useful service to the community in addition to > criticizing other people's work. > > Cheers, > > Chris > > > -----Ursprüngliche Nachricht- > Von: Martin Hepp [mailto:martin.h...@unibw.d

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Chris Bizer
laries; public-lod@w3.org; Chris Bizer Betreff: Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites Dear Chris, all, while reading the paper [1] I think I found a possible explanation why WebDataCommons.org does not f

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Kingsley Idehen
On 4/17/12 3:29 PM, Martin Hepp wrote: That would be a nice first step. And then stopping to claim that the stats show the actual status of the "data web" ;-) Narrative should be more about: showcasing the least amount of structured data on the burgeoning Web of Linked Data :-) Kingsley

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Martin Hepp
That would be a nice first step. And then stopping to claim that the stats show the actual status of the "data web" ;-) On Apr 17, 2012, at 9:23 PM, Dan Brickley wrote: > How about adding a disclaimer line to the webdatacommons.org site like > > "Note that the many database-backed sites contai

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Dan Brickley
How about adding a disclaimer line to the webdatacommons.org site like "Note that the many database-backed sites contain a huge long tail of rarely-visited, rarely-linked pages (e.g. product catalogues), but which increasingly contain useful structured data. It is best not to assume that this coll

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Martin Hepp
Hi Dan, Peter: I think we are in agreement - the CommonCrawl project is nice and useful. I am not questioning that at all. However, webdatacommons.org uses the CommonCrawl corpus, extracts all structured data found, and then 1. advocates the results as kind of definite statistics on "data on th

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Dan Brickley
On 17 April 2012 18:56, Peter Mika wrote: > > Hi Martin, > > It's not as simple as that, because PageRank is a probabilistic algorithm (it > includes random jumps between pages), and I wouldn't expect that wayfair.com > would include 2M links on a single page (that would be one very long webpage

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Martin Hepp
All that you say is true, but: 1. The typical fan-out rates of e.g. shops mean that the deep links do typically not get a pagerank greater than 0. 2. If a shop has no link to all of its detail pages (e.g. you can only find some items via a search interface or based on individualized recommendat

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Peter Mika
Hi Martin, By incorporating PageRank into the decision of what pages to crawl, CommonCrawl is actually trying to approximate what search engine crawlers are doing. In general, search engines would collect pages that would be more likely to rank higher in search results, and PageRank is an imp

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-04-17 Thread Martin Hepp
Dear Chris, all, while reading the paper [1] I think I found a possible explanation why WebDataCommons.org does not fulfill the high expectations regarding the completeness and coverage. It seems that CommonCrawl filters pages by Pagerank in order to determine the feasible subset of URIs for t

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-03-26 Thread Martin Hepp
Hi, a quote from the mission statement on the webdatacommons.org page: "More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages a

ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

2012-03-22 Thread Chris Bizer
Hi all, we are happy to announce WebDataCommons.org, a joined project of Freie Universität Berlin and the Karlsruhe Institute of Technology to extract all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently availabl