> Hi Markus, hi List, > > I used the CrawlDBScanner to look at the remaining unfetched urls. > > Markus, you are, as usual, absolute correct. The one and only reason why > the urls weren't scheduled was that the refetch time hasn't come yet. > > As I inspected the URLs that are unfetched I noticed that they have all > java.net errors. Either SockerTimeout or UnknownHostException. > > Here are two examples: > > http://cape.gforge.cs.uni-kassel.de/ Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Nov 02 21:19:07 CET 2011 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 1 > Retry interval: 603450 seconds (6 days) > Score: 0.0 > Signature: null > Metadata: _pst_: exception(16), lastModified=0: > java.net.UnknownHostException: cape.gforge.cs.uni-kassel.de > > and > > http://bst-ws1.statik.bauingenieure.uni-kassel.de/web/Mitarbeiter Version: > 7 Status: 1 (db_unfetched) > Fetch time: Thu Nov 03 13:28:45 CET 2011 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 1 > Retry interval: 603450 seconds (6 days) > Score: 0.0 > Signature: null > Metadata: _pst_: exception(16), lastModified=0: > java.net.SocketTimeoutException: connect timed out > > > For some reasons I thougt that if a page couldn't be loaded it will > disappear from the list of unfetched urls. > > I know it better know. :) > > But now comes up an other question for me. I set > > <property> > <name>db.fetch.retry.max</name> > <value>2</value> > <description>The maximum number of times a url that has encountered > recoverable errors is generated for fetch.</description> > </property> > > but in my (old) crawldb there are urls that have up to "retry 11" status. > Does db.fetch.retry.max mean how often a url is selected for retry even > if its recrawl time hasn't come?
> And if so, when will be urls, that can't be loaded, deleted from the > crawldb? They are marked as DB_GONE but not removed. IIRC the retries setting is for records with this status. The number of retries you see is normal, it gets incremented everytime. See the code: http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup and http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java?view=markup It's a bit tricky so i can't give a complete answer now. But by following code paths you can understand. The two sources above are play an integral role. > > > Thank your very much > Cheers > Am 02.11.2011 17:16, schrieb Markus Jelsma: > > On Wednesday 02 November 2011 16:24:09 Marek Bachmann wrote: > >> Is there a config value that could be setting the topN value? I > > > >> definitely don't use it in my script: > > -topN as command parameter > > > >> #!/bin/bash > >> > >> HADOOP_DIR=/nutch/hadoop/ > >> > >> ./nutch generate crawldb segs > >> newSeg=`/nutch/hadoop/bin/hadoop dfs -ls segs/ | tail -1 | awk {'print > >> $8'}` echo $newSeg > >> > >> ./nutch fetch $newSeg > >> ./nutch parse $newSeg > >> ./nutch updatedb crawldb $newSeg > >> > >> Are there any test for the generator? So taht I can see what it will > >> select? > >> > >> Thank You > >> > >> On 02.11.2011 15:30, Markus Jelsma wrote: > >>> On Wednesday 02 November 2011 15:08:42 Marek Bachmann wrote: > >>>> On 02.11.2011 14:17, Markus Jelsma wrote: > >>>>> Hi Marek, > >>>>> > >>>>> With your settings the generator should select all records that are > >>>>> _eligible_ for fetch due to their fetch time being expired. I suspect > >>>>> that you generate, fetch, update and generate again. In the meanwhile > >>>>> the DB may have changed so this would explain this behaviour. > >>>> > >>>> Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx to > >>>> the small hadoop cluster ;-) ) > >>>> > >>>> My fetch intervals are: > >>>> > >>>> <property> > >>>> > >>>> <name>db.fetch.interval.max</name> > >>>> <value>1209600</value> > >>>> <description> > >>>> > >>>> 1209600 s = 14 days > >>>> > >>>> </description> > >>>> > >>>> </property> > >>>> > >>>> > >>>> <property> > >>>> > >>>> <name>db.fetch.interval.default</name> > >>>> <value>603450</value> > >>>> <description> > >>>> > >>>> 6034500 s = 7 days > >>>> > >>>> </description> > >>>> > >>>> </property> > >>>> > >>>> I think that the status "unfetched" is for urls that have never been > >>>> fetched, am I right? > >>> > >>> Yes. See the CrawlDatum source for more descriptions on all status > >>> codes. > >>> > >>>> So, what I expect is, that when after a Genrate-Fetch-Parse-Update > >>>> Cycle there are 20k urls, the generator should add all of them to the > >>>> fetch list. > >>>> > >>>> An example: > >>>> > >>>> Started with: > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb: > >>>> crawldb 11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798 > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0: 236834 > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1: 4794 > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2: 170 > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score: 0.0 > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score: 2.48141E-5 > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score: 1.0 > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched): > >>>> 18314 > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched): > >>>> 202241 > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369 > >>>> 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): > >>>> > >>>> 9181 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5 > > > > (db_redir_perm): > >>>> 2896 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6 > >>>> > >>>> (db_notmodified): 797 11/11/02 14:48:14 INFO crawl.CrawlDbReader: > >>>> CrawlDb statistics: done > >>>> > >>>> I ran an GFPU-Cycle and then: > >>>> > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753 > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0: 241755 > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1: 4810 > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2: 188 > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score: 0.0 > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score: 2.4315814E-5 > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score: 1.0 > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched): > >>>> 13753 > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched): > >>>> 211389 > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602 > >>>> 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): > >>>> > >>>> 9303 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5 > > > > (db_redir_perm): > >>>> 2909 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6 > >>>> > >>>> (db_notmodified): 797 11/11/02 15:07:58 INFO crawl.CrawlDbReader: > >>>> CrawlDb statistics: done > >>>> > >>>> As you can see, there were ~18k unfetched urls but only ~9,5k have > >>>> been > >>> > >>>> processed (from Hadoop Job Details): > >>> Yes, i would expect it would generate all db_unfetched records too but > >>> i cannot reproduce such behaviour. If i don't use topN to cut it off i > >>> get fetch lists with 100 millions of URL's incl. all db_unfetched. > >>> > >>>> FetcherStaus: > >>>> moved 16 > >>>> exception 85 > >>>> access_denied 109 > >>>> success 9.214 > >>>> temp_moved 135 > >>>> notfound 111 > >>>> > >>>> > >>>> Thank you once again, Markus > >>>> > >>>> PS: What's the magic trick the generator does to determine an url as > >>>> eligible? :) > >>> > >>> You should check the mapper method in the source to get a full picture. > >>> > >>>>> If you do not update the DB it will (by default) always generate > >>>>> identical fetch lists under the similar circustances. > >>>>> > >>>>> I think it sometimes generates only ~1k because you already fetched > >>>>> all other records. > >>>>> > >>>>> Cheers > >>>>> > >>>>> On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote: > >>>>>> Hello people, > >>>>>> > >>>>>> can someone explain me how the generator genrates the fetch lists? > >>>>>> > >>>>>> In particular: > >>>>>> > >>>>>> I don't understand why it generates fetch lists which very different > >>>>>> amounts of urls. > >>>>>> > >>>>>> Sometimes it generates> 25k urls and somestimes> 1k. > >>>>>> > >>>>>> In every case there were more than>25k urls unfetched in the > >>>>>> crawldb. So I was expecting that it always generates ~ 25k urls. > >>>>>> But as I said before, sometimes only ~ 1k. > >>>>>> > >>>>>> In my nutch-site.xml I have defined following values: > >>>>>> > >>>>>> <property> > >>>>>> > >>>>>> <name>generate.max.count</name> > >>>>>> <value>-1</value> > >>>>>> <description>The maximum number of urls in a single > >>>>>> fetchlist. -1 if unlimited. The urls are counted according > >>>>>> to the value of the parameter generator.count.mode. > >>>>>> </description> > >>>>>> > >>>>>> </property> > >>>>>> > >>>>>> <property> > >>>>>> > >>>>>> <name>generate.max.count</name> > >>>>>> <value>-1</value> > >>>>>> <description>The maximum number of urls in a single > >>>>>> fetchlist. -1 if unlimited. The urls are counted according > >>>>>> to the value of the parameter generator.count.mode. > >>>>>> </description> > >>>>>> > >>>>>> </property> > >>>>>> > >>>>>> Any ideas? > >>>>>> > >>>>>> Thanks

