Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

Martin Hepp Tue, 17 Apr 2012 06:28:58 -0700

Dear Chris, all,

while reading the paper [1] I think I found a possible explanation why 
WebDataCommons.org does not fulfill the high expectations regarding the 
completeness and coverage.


It seems that CommonCrawl filters pages by Pagerank in order to determine the 
feasible subset of URIs for the crawl. While this may be okay for a generic Web 
crawl, for linguistics purposes, or for training machine-learning components, 
it is a dead end if you want to extract structured data, since the interesting 
markup typically resides in the *deep links* of dynamic Web applications, e.g. 
the product item pages in shops, the individual event pages in ticket systems, 
etc.

Those pages often have a very low Pagerank, even when they are part of very 
prestigious Web sites with a high Pagerank for the main landing page.

Example:

1. Main page:   http://www.wayfair.com/ 
--> Pagerank 5 of 10

2. Category page:       http://www.wayfair.com/Lighting-C77859.html
--> Pagerank 3 of 10

3. Item page:   
http://www.wayfair.com/Golden-Lighting-Cerchi-Flush-Mount-in-Chrome-1030-FM-CH-GNL1849.html
--> Pagerank of 0 / 10

Now, the RDFa on this site is in the 2 Million item pages only. Filtering out 
the deep link in the original crawl means you are removing the HTML that 
contains the actual data.

In your paper [1], you kind of downplay that limitation by saying that this 
approach yielded "snapshots of the popular part of the web.". I think "popular" 
is very misleading in here because the Pagerank does not work very well for the 
"deep" Web, because those pages are typically lacking external links almost 
completely, and due to their huge number per site, they earn only a minimal 
Pagerank from their main site, which provides the link or links.

So, once again, I think your approach is NOT suitable for yielding a corpus of 
usable data at Web scale, and the statistics you derive are likely very much 
skewed, because you look only at landing pages and popular overview pages of 
sites, while the real data is in HTML pages not contained in the basic crawl.

Please interprete your findings in the light of these limitations. I am saying 
this so strongly because I already saw many tweets cherishing the paper as "now 
we have the definitive statistics on structured data on the Web".


Best wishes

Martin

Note: For estimating the Pagerank in this example, I used the online-service 
[2], which may provide only an approximation.


[1] http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-2.pdf

[2] http://www.prchecker.info/check_page_rank.php

--------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  h...@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
* Project Main Page: http://purl.org/goodrelations/

Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

Reply via email to