Re: general questions about the generator

Markus Jelsma Wed, 02 Nov 2011 09:17:01 -0700


On Wednesday 02 November 2011 16:24:09 Marek Bachmann wrote:
> Is there a config value that could be setting the topN value? I
> definitely don't use it in my script:


-topN as command parameter


> 
> #!/bin/bash
> 
> HADOOP_DIR=/nutch/hadoop/
> 
> ./nutch generate crawldb segs
> newSeg=`/nutch/hadoop/bin/hadoop dfs -ls segs/ | tail -1 | awk {'print
> $8'}` echo $newSeg
> 
> ./nutch fetch $newSeg
> ./nutch parse $newSeg
> ./nutch updatedb crawldb $newSeg
> 
> Are there any test for the generator? So taht I can see what it will
> select?
> 
> Thank You
> 
> On 02.11.2011 15:30, Markus Jelsma wrote:
> > On Wednesday 02 November 2011 15:08:42 Marek Bachmann wrote:
> >> On 02.11.2011 14:17, Markus Jelsma wrote:
> >>> Hi Marek,
> >>> 
> >>> With your settings the generator should select all records that are
> >>> _eligible_ for fetch due to their fetch time being expired. I suspect
> >>> that you generate, fetch, update and generate again. In the meanwhile
> >>> the DB may have changed so this would explain this behaviour.
> >> 
> >> Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx to
> >> the small hadoop cluster ;-) )
> >> 
> >> My fetch intervals are:
> >> 
> >> <property>
> >> 
> >>     <name>db.fetch.interval.max</name>
> >>     <value>1209600</value>
> >>     <description>
> >>     
> >>           1209600 s =  14 days
> >>     
> >>     </description>
> >> 
> >> </property>
> >> 
> >> 
> >> <property>
> >> 
> >>     <name>db.fetch.interval.default</name>
> >>     <value>603450</value>
> >>     <description>
> >>     
> >>           6034500 s = 7 days
> >>     
> >>     </description>
> >> 
> >> </property>
> >> 
> >> I think that the status "unfetched" is for urls that have never been
> >> fetched, am I right?
> > 
> > Yes. See the CrawlDatum source for more descriptions on all status codes.
> > 
> >> So, what I expect is, that when after a Genrate-Fetch-Parse-Update Cycle
> >> there are 20k urls, the generator should add all of them to the fetch
> >> list.
> >> 
> >> An example:
> >> 
> >> Started with:
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
> >> crawldb 11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0:    236834
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1:    4794
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2:    170
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score:  0.0
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score:  2.48141E-5
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score:  1.0
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> >> 18314
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> >> 202241
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369
> >> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
> >> 
> >> 9181 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5 
(db_redir_perm):
> >>    2896 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6
> >> 
> >> (db_notmodified):  797 11/11/02 14:48:14 INFO crawl.CrawlDbReader:
> >> CrawlDb statistics: done
> >> 
> >> I ran an GFPU-Cycle and then:
> >> 
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0:    241755
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1:    4810
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2:    188
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score:  0.0
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score:  2.4315814E-5
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score:  1.0
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> >> 13753
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> >> 211389
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602
> >> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
> >> 
> >> 9303 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5 
(db_redir_perm):
> >>    2909 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6
> >> 
> >> (db_notmodified):  797 11/11/02 15:07:58 INFO crawl.CrawlDbReader:
> >> CrawlDb statistics: done
> >> 
> >> As you can see, there were ~18k unfetched urls but only ~9,5k have been
> > 
> >> processed (from Hadoop Job Details):
> > Yes, i would expect it would generate all db_unfetched records too but i
> > cannot reproduce such behaviour. If i don't use topN to cut it off i get
> > fetch lists with 100 millions of URL's incl. all db_unfetched.
> > 
> >> FetcherStaus:
> >> moved              16
> >> exception  85
> >> access_denied      109
> >> success    9.214
> >> temp_moved         135
> >> notfound   111
> >> 
> >> 
> >> Thank you once again, Markus
> >> 
> >> PS: What's the magic trick the generator does to determine an url as
> >> eligible? :)
> > 
> > You should check the mapper method in the source to get a full picture.
> > 
> >>> If you do not update the DB it will (by default) always generate
> >>> identical fetch lists under the similar circustances.
> >>> 
> >>> I think it sometimes generates only ~1k because you already fetched all
> >>> other records.
> >>> 
> >>> Cheers
> >>> 
> >>> On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote:
> >>>> Hello people,
> >>>> 
> >>>> can someone explain me how the generator genrates the fetch lists?
> >>>> 
> >>>> In particular:
> >>>> 
> >>>> I don't understand why it generates fetch lists which very different
> >>>> amounts of urls.
> >>>> 
> >>>> Sometimes it generates>   25k urls and somestimes>   1k.
> >>>> 
> >>>> In every case there were more than>25k urls unfetched in the crawldb.
> >>>> So I was expecting that it always generates ~ 25k urls. But as I said
> >>>> before, sometimes only ~ 1k.
> >>>> 
> >>>> In my nutch-site.xml I have defined following values:
> >>>> 
> >>>> <property>
> >>>> 
> >>>>      <name>generate.max.count</name>
> >>>>      <value>-1</value>
> >>>>      <description>The maximum number of urls in a single
> >>>>      fetchlist.  -1 if unlimited. The urls are counted according
> >>>>      to the value of the parameter generator.count.mode.
> >>>>      </description>
> >>>> 
> >>>> </property>
> >>>> 
> >>>> <property>
> >>>> 
> >>>>      <name>generate.max.count</name>
> >>>>      <value>-1</value>
> >>>>      <description>The maximum number of urls in a single
> >>>>      fetchlist.  -1 if unlimited. The urls are counted according
> >>>>      to the value of the parameter generator.count.mode.
> >>>>      </description>
> >>>> 
> >>>> </property>
> >>>> 
> >>>> Any ideas?
> >>>> 
> >>>> Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: general questions about the generator

Reply via email to