Re: Some question about the generator

Marek Bachmann Tue, 16 Aug 2011 07:18:06 -0700

On 16.08.2011 15:53, Julien Nioche wrote:

1) generate.max.count sets a limit on the number of URLs for a single host
or domain - this is different from the overall limit set by the generate
-top parameter.


2) the generator only skips the URLs which are beyond the max number allowed
for the host (in your case 3K). This does not mean that ALL urls for that
host are skipped

Makes sense?

Hey Julien, thank you. Yes, your description makes sense for me. So if Iwant to fetch a list with only 3k urls, I just have to run:


./nutch parse $seg -topN 3000

right?

But I still don't get this message:
2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping

What is meant by "more than 3000 URLs for all 1 segments"? Skippingmeans then, that "it will skip after 3k urls"?


But for now you helped to solve my problem. :)

On 16 August 2011 14:16, Marek Bachmann<[email protected]>  wrote:

Hello,

there are two things I don't understand regarding the generator:

1.) If I set the generate.max.count value to a value, e.g. 3000, it seems
that this value is ignored. In every run about 20000 pages are fetched.

TOTAL urls: 102396
retry 0:    101679
retry 1:    325
retry 2:    392
min score:  1.0
avg score:  1.0
max score:  1.0
status 1 (db_unfetched):    33072
status 2 (db_fetched):      57146
status 3 (db_gone): 6878
status 4 (db_redir_temp):   2510
status 5 (db_redir_perm):   2509
status 6 (db_notmodified):  281
CrawlDb statistics: done

After a generate / fetch / parse / update cycle:

TOTAL urls:     122885
retry 0:        121816
retry 1:        677
retry 2:        392
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        32153
status 2 (db_fetched):  75366
status 3 (db_gone):     9167
status 4 (db_redir_temp):       2979
status 5 (db_redir_perm):       2878
status 6 (db_notmodified):      342
CrawlDb statistics: done

2.) The next thing is related to the first one:

The generator tells me in the log files:
2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping

But when the fetcher is running it fetches many urls which the generator
told me it had skipped before, like:

2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching
http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kassel.de/unicms/index.php?id=27436>
2011-08-16 13:56:31,706 INFO  fetcher.Fetcher - fetching
http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1>

A second example:

2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain
www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
skipping

2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching
http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching
http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf>

Did I do something wrong? I don't get it :)

Thank you all

Re: Some question about the generator

Reply via email to