On Wednesday 02 November 2011 16:24:09 Marek Bachmann wrote:
> Is there a config value that could be setting the topN value? I
> definitely don't use it in my script:
-topN as command parameter
>
> #!/bin/bash
>
> HADOOP_DIR=/nutch/hadoop/
>
> ./nutch generate crawldb segs
> newSeg=`/nutch/hadoop/bin/hadoop dfs -ls segs/ | tail -1 | awk {'print
> $8'}` echo $newSeg
>
> ./nutch fetch $newSeg
> ./nutch parse $newSeg
> ./nutch updatedb crawldb $newSeg
>
> Are there any test for the generator? So taht I can see what it will
> select?
>
> Thank You
>
> On 02.11.2011 15:30, Markus Jelsma wrote:
> > On Wednesday 02 November 2011 15:08:42 Marek Bachmann wrote:
> >> On 02.11.2011 14:17, Markus Jelsma wrote:
> >>> Hi Marek,
> >>>
> >>> With your settings the generator should select all records that are
> >>> _eligible_ for fetch due to their fetch time being expired. I suspect
> >>> that you generate, fetch, update and generate again. In the meanwhile
> >>> the DB may have changed so this would explain this behaviour.
> >>
> >> Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx to
> >> the small hadoop cluster ;-) )
> >>
> >> My fetch intervals are:
> >>
> >> <property>
> >>
> >> <name>db.fetch.interval.max</name>
> >> <value>1209600</value>
> >> <description>
> >>
> >> 1209600 s = 14 days
> >>
> >> </description>
> >>
> >> </property>
> >>
> >>
> >> <property>
> >>
> >> <name>db.fetch.interval.default</name>
> >> <value>603450</value>
> >> <description>
> >>
> >> 6034500 s = 7 days
> >>
> >> </description>
> >>
> >> </property>
> >>
> >> I think that the status "unfetched" is for urls that have never been
> >> fetched, am I right?
> >
> > Yes. See the CrawlDatum source for more descriptions on all status codes.
> >
> >> So, what I expect is, that when after a Genrate-Fetch-Parse-Update Cycle
> >> there are 20k urls, the generator should add all of them to the fetch
> >> list.
> >>
> >> An example:
> >>
> >> Started with:
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
> >> crawldb 11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0: 236834
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1: 4794
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2: 170
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score: 0.0
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score: 2.48141E-5
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score: 1.0
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> >> 18314
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> >> 202241
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
> >>
> >> 9181 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5
(db_redir_perm):
> >> 2896 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6
> >>
> >> (db_notmodified): 797 11/11/02 14:48:14 INFO crawl.CrawlDbReader:
> >> CrawlDb statistics: done
> >>
> >> I ran an GFPU-Cycle and then:
> >>
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0: 241755
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1: 4810
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2: 188
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score: 0.0
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score: 2.4315814E-5
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score: 1.0
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> >> 13753
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> >> 211389
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
> >>
> >> 9303 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5
(db_redir_perm):
> >> 2909 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6
> >>
> >> (db_notmodified): 797 11/11/02 15:07:58 INFO crawl.CrawlDbReader:
> >> CrawlDb statistics: done
> >>
> >> As you can see, there were ~18k unfetched urls but only ~9,5k have been
> >
> >> processed (from Hadoop Job Details):
> > Yes, i would expect it would generate all db_unfetched records too but i
> > cannot reproduce such behaviour. If i don't use topN to cut it off i get
> > fetch lists with 100 millions of URL's incl. all db_unfetched.
> >
> >> FetcherStaus:
> >> moved 16
> >> exception 85
> >> access_denied 109
> >> success 9.214
> >> temp_moved 135
> >> notfound 111
> >>
> >>
> >> Thank you once again, Markus
> >>
> >> PS: What's the magic trick the generator does to determine an url as
> >> eligible? :)
> >
> > You should check the mapper method in the source to get a full picture.
> >
> >>> If you do not update the DB it will (by default) always generate
> >>> identical fetch lists under the similar circustances.
> >>>
> >>> I think it sometimes generates only ~1k because you already fetched all
> >>> other records.
> >>>
> >>> Cheers
> >>>
> >>> On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote:
> >>>> Hello people,
> >>>>
> >>>> can someone explain me how the generator genrates the fetch lists?
> >>>>
> >>>> In particular:
> >>>>
> >>>> I don't understand why it generates fetch lists which very different
> >>>> amounts of urls.
> >>>>
> >>>> Sometimes it generates> 25k urls and somestimes> 1k.
> >>>>
> >>>> In every case there were more than>25k urls unfetched in the crawldb.
> >>>> So I was expecting that it always generates ~ 25k urls. But as I said
> >>>> before, sometimes only ~ 1k.
> >>>>
> >>>> In my nutch-site.xml I have defined following values:
> >>>>
> >>>> <property>
> >>>>
> >>>> <name>generate.max.count</name>
> >>>> <value>-1</value>
> >>>> <description>The maximum number of urls in a single
> >>>> fetchlist. -1 if unlimited. The urls are counted according
> >>>> to the value of the parameter generator.count.mode.
> >>>> </description>
> >>>>
> >>>> </property>
> >>>>
> >>>> <property>
> >>>>
> >>>> <name>generate.max.count</name>
> >>>> <value>-1</value>
> >>>> <description>The maximum number of urls in a single
> >>>> fetchlist. -1 if unlimited. The urls are counted according
> >>>> to the value of the parameter generator.count.mode.
> >>>> </description>
> >>>>
> >>>> </property>
> >>>>
> >>>> Any ideas?
> >>>>
> >>>> Thanks
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350