Re: general questions about the generator

Markus Jelsma Wed, 02 Nov 2011 07:30:54 -0700


On Wednesday 02 November 2011 15:08:42 Marek Bachmann wrote:
> On 02.11.2011 14:17, Markus Jelsma wrote:
> > Hi Marek,
> > 
> > With your settings the generator should select all records that are
> > _eligible_ for fetch due to their fetch time being expired. I suspect
> > that you generate, fetch, update and generate again. In the meanwhile
> > the DB may have changed so this would explain this behaviour.
> 
> Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx to
> the small hadoop cluster ;-) )
> 
> My fetch intervals are:
> 
> <property>
>    <name>db.fetch.interval.max</name>
>    <value>1209600</value>
>    <description>
>          1209600 s =  14 days
>    </description>
> </property>
> 
> 
> <property>
>    <name>db.fetch.interval.default</name>
>    <value>603450</value>
>    <description>
>          6034500 s = 7 days
>    </description>
> </property>
> 
> I think that the status "unfetched" is for urls that have never been
> fetched, am I right?


Yes. See the CrawlDatum source for more descriptions on all status codes.

> 
> So, what I expect is, that when after a Genrate-Fetch-Parse-Update Cycle
> there are 20k urls, the generator should add all of them to the fetch list.
> 
> An example:
> 
> Started with:
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawldb
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0:    236834
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1:    4794
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2:    170
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score:  0.0
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score:  2.48141E-5
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score:  1.0
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> 18314
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> 202241
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369
> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):  
> 9181 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
>   2896 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6
> (db_notmodified):  797 11/11/02 14:48:14 INFO crawl.CrawlDbReader: CrawlDb
> statistics: done
> 
> I ran an GFPU-Cycle and then:
> 
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0:    241755
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1:    4810
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2:    188
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score:  0.0
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score:  2.4315814E-5
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score:  1.0
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> 13753
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> 211389
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602
> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):  
> 9303 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
>   2909 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6
> (db_notmodified):  797 11/11/02 15:07:58 INFO crawl.CrawlDbReader: CrawlDb
> statistics: done
> 
> As you can see, there were ~18k unfetched urls but only ~9,5k have been
> processed (from Hadoop Job Details):

Yes, i would expect it would generate all db_unfetched records too but i 
cannot reproduce such behaviour. If i don't use topN to cut it off i get fetch 
lists with 100 millions of URL's incl. all db_unfetched.

> FetcherStaus:
> moved                 16
> exception     85
> access_denied         109
> success       9.214
> temp_moved    135
> notfound      111
> 
> 
> Thank you once again, Markus
> 
> PS: What's the magic trick the generator does to determine an url as
> eligible? :)

You should check the mapper method in the source to get a full picture.

> 
> > If you do not update the DB it will (by default) always generate
> > identical fetch lists under the similar circustances.
> > 
> > I think it sometimes generates only ~1k because you already fetched all
> > other records.
> > 
> > Cheers
> > 
> > On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote:
> >> Hello people,
> >> 
> >> can someone explain me how the generator genrates the fetch lists?
> >> 
> >> In particular:
> >> 
> >> I don't understand why it generates fetch lists which very different
> >> amounts of urls.
> >> 
> >> Sometimes it generates>  25k urls and somestimes>  1k.
> >> 
> >> In every case there were more than>25k urls unfetched in the crawldb.
> >> So I was expecting that it always generates ~ 25k urls. But as I said
> >> before, sometimes only ~ 1k.
> >> 
> >> In my nutch-site.xml I have defined following values:
> >> 
> >> <property>
> >> 
> >>     <name>generate.max.count</name>
> >>     <value>-1</value>
> >>     <description>The maximum number of urls in a single
> >>     fetchlist.  -1 if unlimited. The urls are counted according
> >>     to the value of the parameter generator.count.mode.
> >>     </description>
> >> 
> >> </property>
> >> 
> >> <property>
> >> 
> >>     <name>generate.max.count</name>
> >>     <value>-1</value>
> >>     <description>The maximum number of urls in a single
> >>     fetchlist.  -1 if unlimited. The urls are counted according
> >>     to the value of the parameter generator.count.mode.
> >>     </description>
> >> 
> >> </property>
> >> 
> >> Any ideas?
> >> 
> >> Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: general questions about the generator

Reply via email to