Is there a config value that could be setting the topN value? I
definitely don't use it in my script:
#!/bin/bash
HADOOP_DIR=/nutch/hadoop/
./nutch generate crawldb segs
newSeg=`/nutch/hadoop/bin/hadoop dfs -ls segs/ | tail -1 | awk {'print $8'}`
echo $newSeg
./nutch fetch $newSeg
./nutch parse $newSeg
./nutch updatedb crawldb $newSeg
Are there any test for the generator? So taht I can see what it will select?
Thank You
On 02.11.2011 15:30, Markus Jelsma wrote:
On Wednesday 02 November 2011 15:08:42 Marek Bachmann wrote:
On 02.11.2011 14:17, Markus Jelsma wrote:
Hi Marek,
With your settings the generator should select all records that are
_eligible_ for fetch due to their fetch time being expired. I suspect
that you generate, fetch, update and generate again. In the meanwhile
the DB may have changed so this would explain this behaviour.
Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx to
the small hadoop cluster ;-) )
My fetch intervals are:
<property>
<name>db.fetch.interval.max</name>
<value>1209600</value>
<description>
1209600 s = 14 days
</description>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>603450</value>
<description>
6034500 s = 7 days
</description>
</property>
I think that the status "unfetched" is for urls that have never been
fetched, am I right?
Yes. See the CrawlDatum source for more descriptions on all status codes.
So, what I expect is, that when after a Genrate-Fetch-Parse-Update Cycle
there are 20k urls, the generator should add all of them to the fetch list.
An example:
Started with:
11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawldb
11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0: 236834
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1: 4794
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2: 170
11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score: 0.0
11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score: 2.48141E-5
11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score: 1.0
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
18314
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):
202241
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
9181 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
2896 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6
(db_notmodified): 797 11/11/02 14:48:14 INFO crawl.CrawlDbReader: CrawlDb
statistics: done
I ran an GFPU-Cycle and then:
11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0: 241755
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1: 4810
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2: 188
11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score: 0.0
11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score: 2.4315814E-5
11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score: 1.0
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
13753
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched):
211389
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
9303 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
2909 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6
(db_notmodified): 797 11/11/02 15:07:58 INFO crawl.CrawlDbReader: CrawlDb
statistics: done
As you can see, there were ~18k unfetched urls but only ~9,5k have been
processed (from Hadoop Job Details):
Yes, i would expect it would generate all db_unfetched records too but i
cannot reproduce such behaviour. If i don't use topN to cut it off i get fetch
lists with 100 millions of URL's incl. all db_unfetched.
FetcherStaus:
moved 16
exception 85
access_denied 109
success 9.214
temp_moved 135
notfound 111
Thank you once again, Markus
PS: What's the magic trick the generator does to determine an url as
eligible? :)
You should check the mapper method in the source to get a full picture.
If you do not update the DB it will (by default) always generate
identical fetch lists under the similar circustances.
I think it sometimes generates only ~1k because you already fetched all
other records.
Cheers
On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote:
Hello people,
can someone explain me how the generator genrates the fetch lists?
In particular:
I don't understand why it generates fetch lists which very different
amounts of urls.
Sometimes it generates> 25k urls and somestimes> 1k.
In every case there were more than>25k urls unfetched in the crawldb.
So I was expecting that it always generates ~ 25k urls. But as I said
before, sometimes only ~ 1k.
In my nutch-site.xml I have defined following values:
<property>
<name>generate.max.count</name>
<value>-1</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
<property>
<name>generate.max.count</name>
<value>-1</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
Any ideas?
Thanks