> > 1) generate.max.count sets a limit on the number of URLs for a single > > > host or domain - this is different from the overall limit set by the > > > generate -top parameter. > > > > > > 2) the generator only skips the URLs which are beyond the max number > > > allowed for the host (in your case 3K). This does not mean that ALL > urls > > > for that host are skipped > > > > > > Makes sense? > > > > Hey Julien, thank you. Yes, your description makes sense for me. So if I > > want to fetch a list with only 3k urls, I just have to run: > > > > ./nutch parse $seg -topN 3000 > > No, topN applies to the generator. >
good catch Markus - I'd read generate. Marek - this has nothing to do with the parsing > > > > > right? > > > > But I still don't get this message: > > 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain > > cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping > > > > What is meant by "more than 3000 URLs for all 1 segments"? Skipping > > means then, that "it will skip after 3k urls"? > > generate.max.count=3000 then all urls above 3000 for a given host/domain > are > skipped when generating the segment. > > > > > But for now you helped to solve my problem. :) > > > > > On 16 August 2011 14:16, Marek Bachmann<[email protected]> > wrote: > > >> Hello, > > >> > > >> there are two things I don't understand regarding the generator: > > >> > > >> 1.) If I set the generate.max.count value to a value, e.g. 3000, it > > >> seems that this value is ignored. In every run about 20000 pages are > > >> fetched. > > >> > > >> TOTAL urls: 102396 > > >> retry 0: 101679 > > >> retry 1: 325 > > >> retry 2: 392 > > >> min score: 1.0 > > >> avg score: 1.0 > > >> max score: 1.0 > > >> status 1 (db_unfetched): 33072 > > >> status 2 (db_fetched): 57146 > > >> status 3 (db_gone): 6878 > > >> status 4 (db_redir_temp): 2510 > > >> status 5 (db_redir_perm): 2509 > > >> status 6 (db_notmodified): 281 > > >> CrawlDb statistics: done > > >> > > >> After a generate / fetch / parse / update cycle: > > >> > > >> TOTAL urls: 122885 > > >> retry 0: 121816 > > >> retry 1: 677 > > >> retry 2: 392 > > >> min score: 1.0 > > >> avg score: 1.0 > > >> max score: 1.0 > > >> status 1 (db_unfetched): 32153 > > >> status 2 (db_fetched): 75366 > > >> status 3 (db_gone): 9167 > > >> status 4 (db_redir_temp): 2979 > > >> status 5 (db_redir_perm): 2878 > > >> status 6 (db_notmodified): 342 > > >> CrawlDb statistics: done > > >> > > >> 2.) The next thing is related to the first one: > > >> > > >> The generator tells me in the log files: > > >> 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain > > >> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - > skipping > > >> > > >> But when the fetcher is running it fetches many urls which the > generator > > >> told me it had skipped before, like: > > >> > > >> 2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching > > >> http://cms.uni-kassel.de/**unicms/index.php?id=27436< > http://cms.uni-kass > > >> el.de/unicms/index.php?id=27436> 2011-08-16 13:56:31,706 INFO > > >> fetcher.Fetcher - fetching > > >> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1< > http://cms.uni- > > >> kassel.de/unicms/index.php?id=24287&L=1> > > >> > > >> A second example: > > >> > > >> 2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain > > >> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments - > > >> skipping > > >> > > >> 2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching > > >> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_** > > >> Visualizing_and_Optimizing-**Paper.pdf< > http://www.iset.uni-kassel.de/abt > > >> /FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf> > > >> 2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching > > >> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/** > > >> 2010_Degner_Staffelstein.pdf< > http://www.iset.uni-kassel.de/abt/FB-A/publ > > >> ication/2010/2010_Degner_Staffelstein.pdf> > > >> > > >> Did I do something wrong? I don't get it :) > > >> > > >> Thank you all > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

