Ivan Lazar Miljenovic wrote:
On 18 August 2010 12:12, wren ng thornton <w...@freegeek.org> wrote:
Johan Tibell wrote:
To my knowledge the data we have about prevalence of encoding on the web
is
accurate. We crawl all pages we can get our hands on, by starting at some
set of seeds and then following all the links. You cannot be sure that
you've reached all web sites as there might be cliques in the web graph
but
we try our best to get them all. You're unlikely to get a better estimate
anywhere else. I doubt few organizations have the machinery required to
crawl most of the web.
There was a study recently on this. They found that there are four main
parts of the Internet:

* a densely connected core, where from any site you can get to any other
* an "in cone", from which you can reach the core (but not other in-cone
members, since then you'd both be in the core)
* an "out cone", which can be reached from the core (but which cannot reach
each other)
* and, unconnected islands

I'm guessing here that you're referring to what I've heard called the
"hidden web": databases, etc. that require sign-ins, etc. (as stuff
that isn't in the core, to differing degrees: some of these databases
are indexed by google but you can't actually read them without an
account, etc.) ?

Not so far as I recall. I'd have to find a copy of the paper to be sure though. Because the metric used was graph connectivity, if those hidden pages have links out into non-hidden pages (e.g., the login page), then they'd be counted in the same way as the non-hidden pages reachable from them.

--
Live well,
~wren
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to