date:20060116

[Nutch-dev] question/suggestion on nutch file format

2006-01-16 Thread Tom

Hi, After taking some time to look into the nutch source code (v0.7.1), I notice that the current file format for storing page content may not be very efficient. If I understand correctly, to retrieve the content of a page with a docID, say, 20, the code check the "index" file first, since the de

[Nutch-dev] [jira] Created: (NUTCH-181) mapred.local.dir temp dir. space allocation limited by smallest area

2006-01-16 Thread Paul Baclace (JIRA)

mapred.local.dir temp dir. space allocation limited by smallest area - Key: NUTCH-181 URL: http://issues.apache.org/jira/browse/NUTCH-181 Project: Nutch Type: Bug Components: indexer Version

[Nutch-dev] Re: Per-page crawling policy

2006-01-16 Thread Andrzej Bialecki

Hi Ken, First of all, thanks for sharing your insights, that's a very interesting read. Ken Krugler wrote: This sounds like the TrustRank algorithm. See http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust attenuation via trust dampening (reducing the trust level as you get furt

[Nutch-dev] Re: Per-page crawling policy

2006-01-16 Thread Ken Krugler

Hi Andrzej, I've been toying with the following idea, which is an extension of the existing URLFilter mechanism and the concept of a "crawl frontier". Let's suppose we have several initial seed urls, each with a different subjective quality. We would like to crawl these, and expand the "cra