Is there a config value that could be setting the topN value? I definitely don't use it in my script:

#!/bin/bash

HADOOP_DIR=/nutch/hadoop/

./nutch generate crawldb segs
newSeg=`/nutch/hadoop/bin/hadoop dfs -ls segs/ | tail -1 | awk {'print $8'}`
echo $newSeg

./nutch fetch $newSeg
./nutch parse $newSeg
./nutch updatedb crawldb $newSeg

Are there any test for the generator? So taht I can see what it will select?

Thank You


On 02.11.2011 15:30, Markus Jelsma wrote:


On Wednesday 02 November 2011 15:08:42 Marek Bachmann wrote:
On 02.11.2011 14:17, Markus Jelsma wrote:
Hi Marek,

With your settings the generator should select all records that are
_eligible_ for fetch due to their fetch time being expired. I suspect
that you generate, fetch, update and generate again. In the meanwhile
the DB may have changed so this would explain this behaviour.

Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx to
the small hadoop cluster ;-) )

My fetch intervals are:

<property>
    <name>db.fetch.interval.max</name>
    <value>1209600</value>
    <description>
          1209600 s =  14 days
    </description>
</property>


<property>
    <name>db.fetch.interval.default</name>
    <value>603450</value>
    <description>
          6034500 s = 7 days
    </description>
</property>

I think that the status "unfetched" is for urls that have never been
fetched, am I right?

Yes. See the CrawlDatum source for more descriptions on all status codes.


So, what I expect is, that when after a Genrate-Fetch-Parse-Update Cycle
there are 20k urls, the generator should add all of them to the fetch list.

An example:

Started with:
11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawldb
11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0:    236834
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1:    4794
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2:    170
11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score:  0.0
11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score:  2.48141E-5
11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score:  1.0
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
18314
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):
202241
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
9181 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
   2896 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6
(db_notmodified):  797 11/11/02 14:48:14 INFO crawl.CrawlDbReader: CrawlDb
statistics: done

I ran an GFPU-Cycle and then:

11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0:    241755
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1:    4810
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2:    188
11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score:  0.0
11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score:  2.4315814E-5
11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score:  1.0
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
13753
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched):
211389
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
9303 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
   2909 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6
(db_notmodified):  797 11/11/02 15:07:58 INFO crawl.CrawlDbReader: CrawlDb
statistics: done

As you can see, there were ~18k unfetched urls but only ~9,5k have been
processed (from Hadoop Job Details):

Yes, i would expect it would generate all db_unfetched records too but i
cannot reproduce such behaviour. If i don't use topN to cut it off i get fetch
lists with 100 millions of URL's incl. all db_unfetched.

FetcherStaus:
moved           16
exception       85
access_denied   109
success         9.214
temp_moved      135
notfound        111


Thank you once again, Markus

PS: What's the magic trick the generator does to determine an url as
eligible? :)

You should check the mapper method in the source to get a full picture.


If you do not update the DB it will (by default) always generate
identical fetch lists under the similar circustances.

I think it sometimes generates only ~1k because you already fetched all
other records.

Cheers

On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote:
Hello people,

can someone explain me how the generator genrates the fetch lists?

In particular:

I don't understand why it generates fetch lists which very different
amounts of urls.

Sometimes it generates>   25k urls and somestimes>   1k.

In every case there were more than>25k urls unfetched in the crawldb.
So I was expecting that it always generates ~ 25k urls. But as I said
before, sometimes only ~ 1k.

In my nutch-site.xml I have defined following values:

<property>

     <name>generate.max.count</name>
     <value>-1</value>
     <description>The maximum number of urls in a single
     fetchlist.  -1 if unlimited. The urls are counted according
     to the value of the parameter generator.count.mode.
     </description>

</property>

<property>

     <name>generate.max.count</name>
     <value>-1</value>
     <description>The maximum number of urls in a single
     fetchlist.  -1 if unlimited. The urls are counted according
     to the value of the parameter generator.count.mode.
     </description>

</property>

Any ideas?

Thanks


Reply via email to