On Wednesday 02 November 2011 15:08:42 Marek Bachmann wrote: > On 02.11.2011 14:17, Markus Jelsma wrote: > > Hi Marek, > > > > With your settings the generator should select all records that are > > _eligible_ for fetch due to their fetch time being expired. I suspect > > that you generate, fetch, update and generate again. In the meanwhile > > the DB may have changed so this would explain this behaviour. > > Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx to > the small hadoop cluster ;-) ) > > My fetch intervals are: > > <property> > <name>db.fetch.interval.max</name> > <value>1209600</value> > <description> > 1209600 s = 14 days > </description> > </property> > > > <property> > <name>db.fetch.interval.default</name> > <value>603450</value> > <description> > 6034500 s = 7 days > </description> > </property> > > I think that the status "unfetched" is for urls that have never been > fetched, am I right?
Yes. See the CrawlDatum source for more descriptions on all status codes. > > So, what I expect is, that when after a Genrate-Fetch-Parse-Update Cycle > there are 20k urls, the generator should add all of them to the fetch list. > > An example: > > Started with: > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawldb > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798 > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0: 236834 > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1: 4794 > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2: 170 > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score: 0.0 > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score: 2.48141E-5 > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score: 1.0 > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched): > 18314 > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched): > 202241 > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369 > 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): > 9181 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): > 2896 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6 > (db_notmodified): 797 11/11/02 14:48:14 INFO crawl.CrawlDbReader: CrawlDb > statistics: done > > I ran an GFPU-Cycle and then: > > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753 > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0: 241755 > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1: 4810 > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2: 188 > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score: 0.0 > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score: 2.4315814E-5 > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score: 1.0 > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched): > 13753 > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched): > 211389 > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602 > 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): > 9303 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): > 2909 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6 > (db_notmodified): 797 11/11/02 15:07:58 INFO crawl.CrawlDbReader: CrawlDb > statistics: done > > As you can see, there were ~18k unfetched urls but only ~9,5k have been > processed (from Hadoop Job Details): Yes, i would expect it would generate all db_unfetched records too but i cannot reproduce such behaviour. If i don't use topN to cut it off i get fetch lists with 100 millions of URL's incl. all db_unfetched. > FetcherStaus: > moved 16 > exception 85 > access_denied 109 > success 9.214 > temp_moved 135 > notfound 111 > > > Thank you once again, Markus > > PS: What's the magic trick the generator does to determine an url as > eligible? :) You should check the mapper method in the source to get a full picture. > > > If you do not update the DB it will (by default) always generate > > identical fetch lists under the similar circustances. > > > > I think it sometimes generates only ~1k because you already fetched all > > other records. > > > > Cheers > > > > On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote: > >> Hello people, > >> > >> can someone explain me how the generator genrates the fetch lists? > >> > >> In particular: > >> > >> I don't understand why it generates fetch lists which very different > >> amounts of urls. > >> > >> Sometimes it generates> 25k urls and somestimes> 1k. > >> > >> In every case there were more than>25k urls unfetched in the crawldb. > >> So I was expecting that it always generates ~ 25k urls. But as I said > >> before, sometimes only ~ 1k. > >> > >> In my nutch-site.xml I have defined following values: > >> > >> <property> > >> > >> <name>generate.max.count</name> > >> <value>-1</value> > >> <description>The maximum number of urls in a single > >> fetchlist. -1 if unlimited. The urls are counted according > >> to the value of the parameter generator.count.mode. > >> </description> > >> > >> </property> > >> > >> <property> > >> > >> <name>generate.max.count</name> > >> <value>-1</value> > >> <description>The maximum number of urls in a single > >> fetchlist. -1 if unlimited. The urls are counted according > >> to the value of the parameter generator.count.mode. > >> </description> > >> > >> </property> > >> > >> Any ideas? > >> > >> Thanks -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

