Re: general questions about the generator

Markus Jelsma Wed, 02 Nov 2011 13:25:49 -0700

> Hi Markus, hi List,
> 
> I used the CrawlDBScanner to look at the remaining unfetched urls.
> 
> Markus, you are, as usual, absolute correct. The one and only reason why
> the urls weren't scheduled was that the refetch time hasn't come yet.
> 
> As I inspected the URLs that are unfetched I noticed that they have all
> java.net errors. Either SockerTimeout or UnknownHostException.
> 
> Here are two examples:
> 
> http://cape.gforge.cs.uni-kassel.de/  Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Nov 02 21:19:07 CET 2011
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 1
> Retry interval: 603450 seconds (6 days)
> Score: 0.0
> Signature: null
> Metadata: _pst_: exception(16), lastModified=0:
> java.net.UnknownHostException: cape.gforge.cs.uni-kassel.de
> 
> and
> 
> http://bst-ws1.statik.bauingenieure.uni-kassel.de/web/Mitarbeiter     
Version:
> 7 Status: 1 (db_unfetched)
> Fetch time: Thu Nov 03 13:28:45 CET 2011
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 1
> Retry interval: 603450 seconds (6 days)
> Score: 0.0
> Signature: null
> Metadata: _pst_: exception(16), lastModified=0:
> java.net.SocketTimeoutException: connect timed out
> 
> 
> For some reasons I thougt that if a page couldn't be loaded it will
> disappear from the list of unfetched urls.
> 
> I know it better know. :)
> 
> But now comes up an other question for me. I set
> 
> <property>
>    <name>db.fetch.retry.max</name>
>    <value>2</value>
>    <description>The maximum number of times a url that has encountered
>    recoverable errors is generated for fetch.</description>
> </property>
> 
> but in my (old) crawldb there are urls that have up to "retry 11" status.
> Does db.fetch.retry.max mean how often a url is selected for retry even
> if its recrawl time hasn't come?


> And if so, when will be urls, that can't be loaded, deleted from the
> crawldb?

They are marked as DB_GONE but not removed. IIRC the retries setting is for 
records with this status.

The number of retries you see is normal, it gets incremented everytime. See 
the code:

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup

and

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java?view=markup

It's a bit tricky so i can't give a complete answer now. But by following code 
paths you can understand.

The two sources above are play an integral role.

> 
> 
> Thank your very much
> 

Cheers

> Am 02.11.2011 17:16, schrieb Markus Jelsma:
> > On Wednesday 02 November 2011 16:24:09 Marek Bachmann wrote:
> >> Is there a config value that could be setting the topN value? I
> > 
> >> definitely don't use it in my script:
> > -topN as command parameter
> > 
> >> #!/bin/bash
> >> 
> >> HADOOP_DIR=/nutch/hadoop/
> >> 
> >> ./nutch generate crawldb segs
> >> newSeg=`/nutch/hadoop/bin/hadoop dfs -ls segs/ | tail -1 | awk {'print
> >> $8'}` echo $newSeg
> >> 
> >> ./nutch fetch $newSeg
> >> ./nutch parse $newSeg
> >> ./nutch updatedb crawldb $newSeg
> >> 
> >> Are there any test for the generator? So taht I can see what it will
> >> select?
> >> 
> >> Thank You
> >> 
> >> On 02.11.2011 15:30, Markus Jelsma wrote:
> >>> On Wednesday 02 November 2011 15:08:42 Marek Bachmann wrote:
> >>>> On 02.11.2011 14:17, Markus Jelsma wrote:
> >>>>> Hi Marek,
> >>>>> 
> >>>>> With your settings the generator should select all records that are
> >>>>> _eligible_ for fetch due to their fetch time being expired. I suspect
> >>>>> that you generate, fetch, update and generate again. In the meanwhile
> >>>>> the DB may have changed so this would explain this behaviour.
> >>>> 
> >>>> Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx to
> >>>> the small hadoop cluster ;-) )
> >>>> 
> >>>> My fetch intervals are:
> >>>> 
> >>>> <property>
> >>>> 
> >>>>      <name>db.fetch.interval.max</name>
> >>>>      <value>1209600</value>
> >>>>      <description>
> >>>>      
> >>>>            1209600 s =  14 days
> >>>>      
> >>>>      </description>
> >>>> 
> >>>> </property>
> >>>> 
> >>>> 
> >>>> <property>
> >>>> 
> >>>>      <name>db.fetch.interval.default</name>
> >>>>      <value>603450</value>
> >>>>      <description>
> >>>>      
> >>>>            6034500 s = 7 days
> >>>>      
> >>>>      </description>
> >>>> 
> >>>> </property>
> >>>> 
> >>>> I think that the status "unfetched" is for urls that have never been
> >>>> fetched, am I right?
> >>> 
> >>> Yes. See the CrawlDatum source for more descriptions on all status
> >>> codes.
> >>> 
> >>>> So, what I expect is, that when after a Genrate-Fetch-Parse-Update
> >>>> Cycle there are 20k urls, the generator should add all of them to the
> >>>> fetch list.
> >>>> 
> >>>> An example:
> >>>> 
> >>>> Started with:
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
> >>>> crawldb 11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0:    236834
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1:    4794
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2:    170
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score:  0.0
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score:  2.48141E-5
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score:  1.0
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> >>>> 18314
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> >>>> 202241
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369
> >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
> >>>> 
> >>>> 9181 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5
> > 
> > (db_redir_perm):
> >>>>     2896 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6
> >>>> 
> >>>> (db_notmodified):  797 11/11/02 14:48:14 INFO crawl.CrawlDbReader:
> >>>> CrawlDb statistics: done
> >>>> 
> >>>> I ran an GFPU-Cycle and then:
> >>>> 
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0:    241755
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1:    4810
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2:    188
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score:  0.0
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score:  2.4315814E-5
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score:  1.0
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> >>>> 13753
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> >>>> 211389
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602
> >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
> >>>> 
> >>>> 9303 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5
> > 
> > (db_redir_perm):
> >>>>     2909 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6
> >>>> 
> >>>> (db_notmodified):  797 11/11/02 15:07:58 INFO crawl.CrawlDbReader:
> >>>> CrawlDb statistics: done
> >>>> 
> >>>> As you can see, there were ~18k unfetched urls but only ~9,5k have
> >>>> been
> >>> 
> >>>> processed (from Hadoop Job Details):
> >>> Yes, i would expect it would generate all db_unfetched records too but
> >>> i cannot reproduce such behaviour. If i don't use topN to cut it off i
> >>> get fetch lists with 100 millions of URL's incl. all db_unfetched.
> >>> 
> >>>> FetcherStaus:
> >>>> moved            16
> >>>> exception        85
> >>>> access_denied    109
> >>>> success  9.214
> >>>> temp_moved       135
> >>>> notfound         111
> >>>> 
> >>>> 
> >>>> Thank you once again, Markus
> >>>> 
> >>>> PS: What's the magic trick the generator does to determine an url as
> >>>> eligible? :)
> >>> 
> >>> You should check the mapper method in the source to get a full picture.
> >>> 
> >>>>> If you do not update the DB it will (by default) always generate
> >>>>> identical fetch lists under the similar circustances.
> >>>>> 
> >>>>> I think it sometimes generates only ~1k because you already fetched
> >>>>> all other records.
> >>>>> 
> >>>>> Cheers
> >>>>> 
> >>>>> On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote:
> >>>>>> Hello people,
> >>>>>> 
> >>>>>> can someone explain me how the generator genrates the fetch lists?
> >>>>>> 
> >>>>>> In particular:
> >>>>>> 
> >>>>>> I don't understand why it generates fetch lists which very different
> >>>>>> amounts of urls.
> >>>>>> 
> >>>>>> Sometimes it generates>    25k urls and somestimes>    1k.
> >>>>>> 
> >>>>>> In every case there were more than>25k urls unfetched in the
> >>>>>> crawldb. So I was expecting that it always generates ~ 25k urls.
> >>>>>> But as I said before, sometimes only ~ 1k.
> >>>>>> 
> >>>>>> In my nutch-site.xml I have defined following values:
> >>>>>> 
> >>>>>> <property>
> >>>>>> 
> >>>>>>       <name>generate.max.count</name>
> >>>>>>       <value>-1</value>
> >>>>>>       <description>The maximum number of urls in a single
> >>>>>>       fetchlist.  -1 if unlimited. The urls are counted according
> >>>>>>       to the value of the parameter generator.count.mode.
> >>>>>>       </description>
> >>>>>> 
> >>>>>> </property>
> >>>>>> 
> >>>>>> <property>
> >>>>>> 
> >>>>>>       <name>generate.max.count</name>
> >>>>>>       <value>-1</value>
> >>>>>>       <description>The maximum number of urls in a single
> >>>>>>       fetchlist.  -1 if unlimited. The urls are counted according
> >>>>>>       to the value of the parameter generator.count.mode.
> >>>>>>       </description>
> >>>>>> 
> >>>>>> </property>
> >>>>>> 
> >>>>>> Any ideas?
> >>>>>> 
> >>>>>> Thanks

Re: general questions about the generator

Reply via email to