Hello,

there are two things I don't understand regarding the generator:

1.) If I set the generate.max.count value to a value, e.g. 3000, it seems that this value is ignored. In every run about 20000 pages are fetched.

TOTAL urls: 102396
retry 0:    101679
retry 1:    325
retry 2:    392
min score:  1.0
avg score:  1.0
max score:  1.0
status 1 (db_unfetched):    33072
status 2 (db_fetched):      57146
status 3 (db_gone): 6878
status 4 (db_redir_temp):   2510
status 5 (db_redir_perm):   2509
status 6 (db_notmodified):  281
CrawlDb statistics: done

After a generate / fetch / parse / update cycle:

TOTAL urls:     122885
retry 0:        121816
retry 1:        677
retry 2:        392
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        32153
status 2 (db_fetched):  75366
status 3 (db_gone):     9167
status 4 (db_redir_temp):       2979
status 5 (db_redir_perm):       2878
status 6 (db_notmodified):      342
CrawlDb statistics: done

2.) The next thing is related to the first one:

The generator tells me in the log files:
2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping

But when the fetcher is running it fetches many urls which the generator told me it had skipped before, like:

2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching http://cms.uni-kassel.de/unicms/index.php?id=27436 2011-08-16 13:56:31,706 INFO fetcher.Fetcher - fetching http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1

A second example:

2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping

2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf 2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf

Did I do something wrong? I don't get it :)

Thank you all

Reply via email to