Hi,
After taking some time to look into the nutch source
code (v0.7.1), I notice that the current file format
for storing page content may not be very efficient.
If I understand correctly, to retrieve the content of
a page with a docID, say, 20, the code check the
"index" file first, since the de
mapred.local.dir temp dir. space allocation limited by smallest area
-
Key: NUTCH-181
URL: http://issues.apache.org/jira/browse/NUTCH-181
Project: Nutch
Type: Bug
Components: indexer
Version
Hi Ken,
First of all, thanks for sharing your insights, that's a very
interesting read.
Ken Krugler wrote:
This sounds like the TrustRank algorithm. See
http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust
attenuation via trust dampening (reducing the trust level as you get
furt
Hi Andrzej,
I've been toying with the following idea, which is an extension of
the existing URLFilter mechanism and the concept of a "crawl
frontier".
Let's suppose we have several initial seed urls, each with a
different subjective quality. We would like to crawl these, and
expand the "cra