On Tue, Apr 17, 2012 at 5:14 PM, Kingsley Idehen wrote:
> On 4/17/12 11:06 AM, Peter Mika wrote:
>
>> Hi All,
>>
>> To add one more data point to the previous discussion about
>> webdatacommons.org, we have recently presented a short position paper at
>> the LDOW 2012 workshop at WWW 2012. Online
Hi Chris,
Thanks for your e-mail.
> we clearly say on the WebDataCommons website as well as in the announcement
> that we are extracting data from 1.4 billion web pages only.
>
> The Web is obviously much larger. Thus it is also obvious that we don't have
> all data in our dataset.
It's not a
Hi Martin,
we clearly say on the WebDataCommons website as well as in the announcement
that we are extracting data from 1.4 billion web pages only.
The Web is obviously much larger. Thus it is also obvious that we don't have
all data in our dataset.
See http://lists.w3.org/Archives/Public/publi
On 4/17/12 3:29 PM, Martin Hepp wrote:
That would be a nice first step. And then stopping to claim that the stats show the
actual status of the "data web" ;-)
Narrative should be more about: showcasing the least amount of
structured data on the burgeoning Web of Linked Data :-)
Kingsley
That would be a nice first step. And then stopping to claim that the stats show
the actual status of the "data web" ;-)
On Apr 17, 2012, at 9:23 PM, Dan Brickley wrote:
> How about adding a disclaimer line to the webdatacommons.org site like
>
> "Note that the many database-backed sites contai
How about adding a disclaimer line to the webdatacommons.org site like
"Note that the many database-backed sites contain a huge long tail of
rarely-visited, rarely-linked pages (e.g. product catalogues), but
which increasingly contain useful structured data. It is best not to
assume that this coll
Hi Dan, Peter:
I think we are in agreement - the CommonCrawl project is nice and useful. I am
not questioning that at all.
However, webdatacommons.org uses the CommonCrawl corpus, extracts all
structured data found, and then
1. advocates the results as kind of definite statistics on "data on th
On 17 April 2012 18:56, Peter Mika wrote:
>
> Hi Martin,
>
> It's not as simple as that, because PageRank is a probabilistic algorithm (it
> includes random jumps between pages), and I wouldn't expect that wayfair.com
> would include 2M links on a single page (that would be one very long webpage
All that you say is true, but:
1. The typical fan-out rates of e.g. shops mean that the deep links do
typically not get a pagerank greater than 0.
2. If a shop has no link to all of its detail pages (e.g. you can only find
some items via a search interface or based on individualized recommendat
HI Peter, all
to add (a probably small element of discussion) to this
i am happy to say that last week we released on the frontpage some
analytics stats which are fresh updated every week.
At the moment they come from 500million+ web URLS. Maybe not much but
pls notice we ONLY retain web urld w
On 4/17/12 11:06 AM, Peter Mika wrote:
Hi All,
To add one more data point to the previous discussion about
webdatacommons.org, we have recently presented a short position paper
at the LDOW 2012 workshop at WWW 2012. Online at
http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-1.
Hi All,
To add one more data point to the previous discussion about
webdatacommons.org, we have recently presented a short position paper at
the LDOW 2012 workshop at WWW 2012. Online at
http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-1.pdf
Please compare this carefully with
Hi Martin,
By incorporating PageRank into the decision of what pages to crawl,
CommonCrawl is actually trying to approximate what search engine
crawlers are doing. In general, search engines would collect pages that
would be more likely to rank higher in search results, and PageRank is
an imp
Dear Chris, all,
while reading the paper [1] I think I found a possible explanation why
WebDataCommons.org does not fulfill the high expectations regarding the
completeness and coverage.
It seems that CommonCrawl filters pages by Pagerank in order to determine the
feasible subset of URIs for t
14 matches
Mail list logo