On 17 April 2012 16:43, Martin Hepp wrote:
> Hi Peter,
>
> Thanks for your feedback. However,
>
>> PageRank does transfer along the edges of the web graph, so a highly ranked
>> homepage would transfer it's PageRank to the pages leading from it.
>
> Do you mean that if http://wayfair.com/ can pas
r into the websites that we identified and publish the resulting
data.
>
> This would be a really useful service to the community in addition to
> criticizing other people's work.
>
> Cheers,
>
> Chris
>
>
> -Ursprüngliche Nachricht-
&g
ntified and publish the resulting data.
>
> This would be a really useful service to the community in addition to
> criticizing other people's work.
>
> Cheers,
>
> Chris
>
>
> -----Ursprüngliche Nachricht-
> Von: Martin Hepp [mailto:martin.h...@unibw.d
laries; public-lod@w3.org; Chris Bizer
Betreff: Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current
RDFa, Microdata and Miroformat data extracted from 65.4 million websites
Dear Chris, all,
while reading the paper [1] I think I found a possible explanation why
WebDataCommons.org does not f
On 4/17/12 3:29 PM, Martin Hepp wrote:
That would be a nice first step. And then stopping to claim that the stats show the
actual status of the "data web" ;-)
Narrative should be more about: showcasing the least amount of
structured data on the burgeoning Web of Linked Data :-)
Kingsley
That would be a nice first step. And then stopping to claim that the stats show
the actual status of the "data web" ;-)
On Apr 17, 2012, at 9:23 PM, Dan Brickley wrote:
> How about adding a disclaimer line to the webdatacommons.org site like
>
> "Note that the many database-backed sites contai
How about adding a disclaimer line to the webdatacommons.org site like
"Note that the many database-backed sites contain a huge long tail of
rarely-visited, rarely-linked pages (e.g. product catalogues), but
which increasingly contain useful structured data. It is best not to
assume that this coll
Hi Dan, Peter:
I think we are in agreement - the CommonCrawl project is nice and useful. I am
not questioning that at all.
However, webdatacommons.org uses the CommonCrawl corpus, extracts all
structured data found, and then
1. advocates the results as kind of definite statistics on "data on th
On 17 April 2012 18:56, Peter Mika wrote:
>
> Hi Martin,
>
> It's not as simple as that, because PageRank is a probabilistic algorithm (it
> includes random jumps between pages), and I wouldn't expect that wayfair.com
> would include 2M links on a single page (that would be one very long webpage
All that you say is true, but:
1. The typical fan-out rates of e.g. shops mean that the deep links do
typically not get a pagerank greater than 0.
2. If a shop has no link to all of its detail pages (e.g. you can only find
some items via a search interface or based on individualized recommendat
Hi Martin,
By incorporating PageRank into the decision of what pages to crawl,
CommonCrawl is actually trying to approximate what search engine
crawlers are doing. In general, search engines would collect pages that
would be more likely to rank higher in search results, and PageRank is
an imp
Dear Chris, all,
while reading the paper [1] I think I found a possible explanation why
WebDataCommons.org does not fulfill the high expectations regarding the
completeness and coverage.
It seems that CommonCrawl filters pages by Pagerank in order to determine the
feasible subset of URIs for t
Hi,
a quote from the mission statement on the webdatacommons.org page:
"More and more websites have started to embed structured data describing
products, people, organizations, places, events into their HTML pages. The Web
Data Commons project extracts this data from several billion web pages a
Hi all,
we are happy to announce WebDataCommons.org, a joined project of Freie
Universität Berlin and the Karlsruhe Institute of Technology to extract all
Microformat, Microdata and RDFa data from the Common Crawl web corpus, the
largest and most up-to-data web corpus that is currently availabl
14 matches
Mail list logo