Re: Generator or fetcher does not get topN pages

Maciek Puzianowski Fri, 28 Mar 2025 01:49:18 -0700

PS I have run one more crawl and generator gives me
2025-03-28 09:44:49,629 WARN crawl.Generator: Generator: 0 records selected
for fetching, exiting ...
Generate returned 1 (no new segments created)
even though stats still have db_unfetched pages:


2025-03-28 09:46:48,458 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
/nutch/crawl/crawldb
2025-03-28 09:46:48,459 INFO crawl.CrawlDbReader: TOTAL urls:   11761224
2025-03-28 09:46:48,462 INFO crawl.CrawlDbReader: shortest fetch interval:
     91 days, 07:27:11
2025-03-28 09:46:48,463 INFO crawl.CrawlDbReader: avg fetch interval:   912
days, 21:15:04
2025-03-28 09:46:48,463 INFO crawl.CrawlDbReader: longest fetch interval:
    24855 days, 03:14:07
2025-03-28 09:46:48,471 INFO crawl.CrawlDbReader: earliest fetch time:  Wed
Mar 26 10:12:00 CET 2025
2025-03-28 09:46:48,472 INFO crawl.CrawlDbReader: avg of fetch times:   Mon
Apr 14 23:27:00 CEST 2025
2025-03-28 09:46:48,472 INFO crawl.CrawlDbReader: latest fetch time:    Tue
Aug 12 09:01:00 CEST 2025
2025-03-28 09:46:48,472 INFO crawl.CrawlDbReader: retry 0:      11686131
2025-03-28 09:46:48,472 INFO crawl.CrawlDbReader: retry 1:      37311
2025-03-28 09:46:48,472 INFO crawl.CrawlDbReader: retry 2:      37782
2025-03-28 09:46:48,486 INFO crawl.CrawlDbReader: score quantile 0.01:  0.0
2025-03-28 09:46:48,487 INFO crawl.CrawlDbReader: score quantile 0.05:  0.0
2025-03-28 09:46:48,487 INFO crawl.CrawlDbReader: score quantile 0.1:
6.1295694039572264E-9
2025-03-28 09:46:48,487 INFO crawl.CrawlDbReader: score quantile 0.2:
1.5246109515271756E-7
2025-03-28 09:46:48,487 INFO crawl.CrawlDbReader: score quantile 0.25:
 3.2369442678906294E-7
2025-03-28 09:46:48,487 INFO crawl.CrawlDbReader: score quantile 0.3:
5.25055238256845E-7
2025-03-28 09:46:48,487 INFO crawl.CrawlDbReader: score quantile 0.4:
1.2284163852809848E-6
2025-03-28 09:46:48,487 INFO crawl.CrawlDbReader: score quantile 0.5:
2.749959157424023E-6
2025-03-28 09:46:48,487 INFO crawl.CrawlDbReader: score quantile 0.6:
5.929214179568089E-6
2025-03-28 09:46:48,487 INFO crawl.CrawlDbReader: score quantile 0.7:
1.4190426764291562E-5
2025-03-28 09:46:48,488 INFO crawl.CrawlDbReader: score quantile 0.75:
 3.793288859535107E-5
2025-03-28 09:46:48,488 INFO crawl.CrawlDbReader: score quantile 0.8:
6.67417677730286E-5
2025-03-28 09:46:48,488 INFO crawl.CrawlDbReader: score quantile 0.9:
6.945855590774825E-4
2025-03-28 09:46:48,488 INFO crawl.CrawlDbReader: score quantile 0.95:
 0.005837679396330209
2025-03-28 09:46:48,488 INFO crawl.CrawlDbReader: score quantile 0.99:
 0.8907802032304117
2025-03-28 09:46:48,489 INFO crawl.CrawlDbReader: min score:    0.0
2025-03-28 09:46:48,489 INFO crawl.CrawlDbReader: avg score:
 0.01544958644397896
2025-03-28 09:46:48,489 INFO crawl.CrawlDbReader: max score:
 183.34019470214844
2025-03-28 09:46:48,505 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
     9406711
2025-03-28 09:46:48,505 INFO crawl.CrawlDbReader: status 2 (db_fetched):
     1976990
2025-03-28 09:46:48,505 INFO crawl.CrawlDbReader: status 3 (db_gone):
98857
2025-03-28 09:46:48,505 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
    69608
2025-03-28 09:46:48,505 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
    169813
2025-03-28 09:46:48,505 INFO crawl.CrawlDbReader: status 6
(db_notmodified):    6037
2025-03-28 09:46:48,505 INFO crawl.CrawlDbReader: status 7 (db_duplicate):
     33208
2025-03-28 09:46:48,505 INFO crawl.CrawlDbReader: CrawlDb statistics: done

Why is that?

pt., 28 mar 2025 o 09:19 Maciek Puzianowski <[email protected]>
napisał(a):

> Sebastian,
>
> thank you for your quick answer,
>
> I have set generate.max.count to 1 000 000, so I think it shouldn`t be a
> problem, generate.count.mode is set to default byHost as well as
> generate.min.score to 0.
> It is a fresh start and I have set refetch time to 90 days so I don`t
> blame the pages that were already fetched, I believe they are going to
> fetch in those 90 days. db_unfetched pages are bothering me.
> I was looking at the statistics you mentioned, however I can`t get the
> cause from them. I will share them, maybe you will notice something... so:
>
> readdb -stats:
> 2025-03-28 09:12:54,544 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
> /nutch/crawl/crawldb
> 2025-03-28 09:12:54,545 INFO crawl.CrawlDbReader: TOTAL urls:   11761224
> 2025-03-28 09:12:54,551 INFO crawl.CrawlDbReader: shortest fetch interval:
>      91 days, 07:27:11
> 2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: avg fetch interval:
> 912 days, 21:15:04
> 2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: longest fetch interval:
>       24855 days, 03:14:07
> 2025-03-28 09:12:54,563 INFO crawl.CrawlDbReader: earliest fetch time:
>  Wed Mar 26 10:12:00 CET 2025
> 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: avg of fetch times:
> Mon Apr 14 23:27:00 CEST 2025
> 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: latest fetch time:
>  Tue Aug 12 09:01:00 CEST 2025
> 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 0:      11686131
> 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 1:      37311
> 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 2:      37782
> 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.01:  0.0
> 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.05:  0.0
> 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.1:
> 2.8397222762258305E-9
> 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.2:
> 1.5968612389262938E-7
> 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.25:
>  3.0930187090863353E-7
> 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.3:
> 5.316786869545881E-7
> 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.4:
> 1.1587939392930727E-6
> 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.5:
> 2.5814283409185258E-6
> 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.6:
> 5.908654160550735E-6
> 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.7:
> 1.7618150654886576E-5
> 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.75:
>  2.9190806586558215E-5
> 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.8:
> 8.921349358020518E-5
> 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.9:
> 5.628835416129725E-4
> 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.95:
>  0.007623361698306406
> 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.99:
>  0.9327549852155366
> 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: min score:    0.0
> 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: avg score:
>  0.015449582458424396
> 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: max score:
>  183.34019470214844
> 2025-03-28 09:12:54,602 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
>      9406711
> 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 2 (db_fetched):
>      1978175
> 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 3 (db_gone):
> 98857
> 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 4
> (db_redir_temp):     69608
> 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 5
> (db_redir_perm):     169813
> 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 6
> (db_notmodified):    6045
> 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 7 (db_duplicate):
>      32015
> 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: CrawlDb statistics: done
>
> Please keep in mind, that "longest fetch interval:       24855 days" is
> not an issue as I have custom plugin that is doing so to prevent certain
> pages from fetching ever again.
>
> Generate counters:
> First one:
> 2025-03-28 07:59:06,502 INFO mapreduce.Job: Counters: 57
>         File System Counters
>                 FILE: Number of bytes read=259336882
>                 FILE: Number of bytes written=731826855
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>                 HDFS: Number of bytes read=1460078564
>                 HDFS: Number of bytes written=66867570
>                 HDFS: Number of read operations=1680
>                 HDFS: Number of large read operations=0
>                 HDFS: Number of write operations=561
>                 HDFS: Number of bytes read erasure-coded=0
>         Job Counters
>                 Launched map tasks=140
>                 Launched reduce tasks=140
>                 Data-local map tasks=137
>                 Rack-local map tasks=3
>                 Total time spent by all maps in occupied slots
> (ms)=16111006
>                 Total time spent by all reduces in occupied slots
> (ms)=20877184
>                 Total time spent by all map tasks (ms)=8055503
>                 Total time spent by all reduce tasks (ms)=10438592
>                 Total vcore-milliseconds taken by all map tasks=8055503
>                 Total vcore-milliseconds taken by all reduce tasks=10438592
>                 Total megabyte-milliseconds taken by all map
> tasks=65990680576
>                 Total megabyte-milliseconds taken by all reduce
> tasks=85512945664
>         Map-Reduce Framework
>                 Map input records=11556117
>                 Map output records=9192906
>                 Map output bytes=1171410543
>                 Map output materialized bytes=373113013
>                 Input split bytes=19740
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=2022938
>                 Reduce shuffle bytes=373113013
>                 Reduce input records=9192906
>                 Reduce output records=0
>                 Spilled Records=18385812
>                 Shuffled Maps =19600
>                 Failed Shuffles=0
>                 Merged Map outputs=19600
>                 GC time elapsed (ms)=87351
>                 CPU time spent (ms)=7536820
>                 Physical memory (bytes) snapshot=159464194048
>                 Virtual memory (bytes) snapshot=2564550729728
>                 Total committed heap usage (bytes)=602469826560
>                 Peak Map Physical memory (bytes)=703803392
>                 Peak Map Virtual memory (bytes)=9257803776
>                 Peak Reduce Physical memory (bytes)=1188073472
>                 Peak Reduce Virtual memory (bytes)=9272107008
>         Generator
>                 MALFORMED_URL=2
>                 SCHEDULE_REJECTED=2363211
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=1460058824
>         File Output Format Counters
>                 Bytes Written=0
>
> Looks fine, output records match with db_unfetched.
>
> Second one:
> 2025-03-28 07:59:55,038 INFO mapreduce.Job: Counters: 55
>         File System Counters
>                 FILE: Number of bytes read=24605358
>                 FILE: Number of bytes written=95719210
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>                 HDFS: Number of bytes read=66892770
>                 HDFS: Number of bytes written=62854154
>                 HDFS: Number of read operations=580
>                 HDFS: Number of large read operations=0
>                 HDFS: Number of write operations=8
>                 HDFS: Number of bytes read erasure-coded=0
>         Job Counters
>                 Launched map tasks=140
>                 Launched reduce tasks=4
>                 Data-local map tasks=136
>                 Rack-local map tasks=4
>                 Total time spent by all maps in occupied slots (ms)=6395528
>                 Total time spent by all reduces in occupied slots
> (ms)=106208
>                 Total time spent by all map tasks (ms)=3197764
>                 Total time spent by all reduce tasks (ms)=53104
>                 Total vcore-milliseconds taken by all map tasks=3197764
>                 Total vcore-milliseconds taken by all reduce tasks=53104
>                 Total megabyte-milliseconds taken by all map
> tasks=26196082688
>                 Total megabyte-milliseconds taken by all reduce
> tasks=435027968
>         Map-Reduce Framework
>                 Map input records=499940
>                 Map output records=499940
>                 Map output bytes=94641374
>                 Map output materialized bytes=20372766
>                 Input split bytes=25200
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=499914
>                 Reduce shuffle bytes=20372766
>                 Reduce input records=499940
>                 Reduce output records=499940
>                 Spilled Records=999880
>                 Shuffled Maps =560
>                 Failed Shuffles=0
>                 Merged Map outputs=560
>                 GC time elapsed (ms)=32655
>                 CPU time spent (ms)=1206490
>                 Physical memory (bytes) snapshot=68925956096
>                 Virtual memory (bytes) snapshot=1315029512192
>                 Total committed heap usage (bytes)=309841625088
>                 Peak Map Physical memory (bytes)=519888896
>                 Peak Map Virtual memory (bytes)=9196109824
>                 Peak Reduce Physical memory (bytes)=517226496
>                 Peak Reduce Virtual memory (bytes)=9233317888
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=66867570
>         File Output Format Counters
>                 Bytes Written=62854154
>
> Again output/input records matches with topN parameter.
>
> Finally, fetcher counter:
> 2025-03-28 08:57:25,256 INFO mapreduce.Job: Counters: 65
>         File System Counters
>                 FILE: Number of bytes read=1670508693
>                 FILE: Number of bytes written=2454067433
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>                 HDFS: Number of bytes read=7590522
>                 HDFS: Number of bytes written=778632266
>                 HDFS: Number of read operations=156
>                 HDFS: Number of large read operations=0
>                 HDFS: Number of write operations=840
>                 HDFS: Number of bytes read erasure-coded=0
>         Job Counters
>                 Launched map tasks=4
>                 Launched reduce tasks=140
>                 Other local map tasks=4
>                 Total time spent by all maps in occupied slots
> (ms)=14400056
>                 Total time spent by all reduces in occupied slots
> (ms)=758192886
>                 Total time spent by all map tasks (ms)=7200028
>                 Total time spent by all reduce tasks (ms)=379096443
>                 Total vcore-milliseconds taken by all map tasks=7200028
>                 Total vcore-milliseconds taken by all reduce
> tasks=379096443
>                 Total megabyte-milliseconds taken by all map
> tasks=58982629376
>                 Total megabyte-milliseconds taken by all reduce
> tasks=3105558061056
>         Map-Reduce Framework
>                 Map input records=59533
>                 Map output records=106816
>                 Map output bytes=3132357328
>                 Map output materialized bytes=743594506
>                 Input split bytes=636
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=54834
>                 Reduce shuffle bytes=743594506
>                 Reduce input records=106816
>                 Reduce output records=106816
>                 Spilled Records=352386
>                 Shuffled Maps =560
>                 Failed Shuffles=0
>                 Merged Map outputs=560
>                 GC time elapsed (ms)=35203
>                 CPU time spent (ms)=9098390
>                 Physical memory (bytes) snapshot=226334003200
>                 Virtual memory (bytes) snapshot=1317038624768
>                 Total committed heap usage (bytes)=309841625088
>                 Peak Map Physical memory (bytes)=1892413440
>                 Peak Map Virtual memory (bytes)=9298661376
>                 Peak Reduce Physical memory (bytes)=1637662720
>                 Peak Reduce Virtual memory (bytes)=9212592128
>         FetcherStatus
>                 access_denied=98
>                 bytes_downloaded=3062491817
>                 exception=1302
>                 gone=26
>                 moved=2598
>                 notfound=723
>                 redirect_count_exceeded=146
>                 robots_denied=1145
>                 robots_denied_maxcrawldelay=16
>                 success=47650
>                 temp_moved=1432
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=7589886
>         File Output Format Counters
>                 Bytes Written=778632266
> 2025-03-28 08:57:25,258 INFO fetcher.Fetcher: Fetcher: finished, elapsed:
> 3444497 ms
>
> Input/output records do not match with previous steps. What do I miss?
>
> Best,
> Maciej
>
> pt., 28 mar 2025 o 08:54 Sebastian Nagel
> <[email protected]> napisał(a):
>
>> Hi Maciek,
>>
>> there are multiple configurations which set a limit on the items
>> during fetch list generations.
>>
>> - topN (Ok, it's obviously not the reason)
>>
>> - a limit per host is defined by the property generate.max.count
>>    - default is -1 (no limit)
>>    - eventually you want to set a limit per host, in order to avoid that
>>      a single host with overlong fetch lists slows down the overall
>> crawling
>>    - by generate.count.mode this limit can be applied per registered
>>      domain or IP address
>>
>> - generate.min.score (default: 0.0): only CrawlDatum items with a higher
>>    score are put into fetch lists
>>
>> - the fetch scheduling: re-fetch pages after a certain amount of time
>>    (default: 30 days), also wait 1 day for retrying a page which failed
>>    to fetch with an error (not a 404)
>>
>>
>> Running the CrawlDb statistics
>>
>>    bin/nutch readdb crawldb -stats
>>
>> shows the number of items per status, retry count, the distribution
>> of scores and fetch intervals.
>>
>>
>>  > should generate a segment with 500k pages to fetch.
>>
>> Not necessarily, see above. The final size of the fetch list
>> is shown by the counter
>>    Reduce output records=NNN
>> Note: the second appearance of it in the generator log because of
>> NUTCH-3059.
>>
>>  > fetches only around 100k pages.
>>
>> The fetcher counters also show how many items are skipped because of the
>> fetcher timelimit and alike.
>>
>>
>> Let us know whether you need more information. If possible please share
>> the
>> CrawlDb statistics or the generator and fetcher counters. It might help
>> to find the reason.
>>
>> Best,
>> Sebastian
>>
>> On 3/27/25 18:56, Maciek Puzianowski wrote:
>> > Hi,
>> > I have a problem with topN value in Apache Nutch.
>> > I have 8 million+ db_unfetched pages in crawldb. I use crawl script with
>> > following command:
>> > bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20
>> > --size-fetchlist 500000 /nutch/crawl 1
>> > --size-fetchlist parameter is the topN for generate method, meaning
>> that it
>> > should generate a segment with 500k pages to fetch. However, the fetcher
>> > fetches only around 100k pages. Also I get around 1 million
>> > SCHEDULE_REJECTED counter in generate method, but I think its just pages
>> > that I have already fetched.
>> >
>> > I have checked url filters and they affect only few pages.
>> >
>> > What can be causing such issue with such a big difference?
>> >
>> > Best,
>> > Maciej
>> >
>>
>>

Re: Generator or fetcher does not get topN pages

Reply via email to