Hi Maciek,

> Input/output records do not match with previous steps. What do I miss?

Indeed:

- generator
>                  Launched reduce tasks=4
>                  Reduce output records=499940

- fetcher
>          Job Counters
>                  Launched map tasks=4
>                  Map input records=59533

The number of fetch lists (fetcher map tasks) does fit,
but generator output records and fetcher input records do not.

No idea why. I'd first count items in the segment:

  bin/nutch readseg -list ...


> PS I have run one more crawl and generator gives me
> 2025-03-28 09:44:49,629 WARN crawl.Generator: Generator: 0 records selected
> for fetching, exiting ...
> Generate returned 1 (no new segments created)
> even though stats still have db_unfetched pages:

Also no idea why this happens. Any hints in the last generator log?


One minor point:

> 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.05:  0.0

This means there are 5% of the records with score == 0.0
With the default generate.min.score, they're never fetched.
It's likely redirects because of NUTCH-2749. I definitely should fix it!
There's an trivial fix / work-around: comment the line
  datum.setScore(0.0f);
in OPICScoringFilter initialScore.

Also you could set http.redirect.max to a reasonable value (3-5) and redirects are followed directly in the fetcher.


Best,
Sebastian

On 3/28/25 09:19, Maciek Puzianowski wrote:
Sebastian,

thank you for your quick answer,

I have set generate.max.count to 1 000 000, so I think it shouldn`t be a
problem, generate.count.mode is set to default byHost as well as
generate.min.score to 0.
It is a fresh start and I have set refetch time to 90 days so I don`t blame
the pages that were already fetched, I believe they are going to fetch in
those 90 days. db_unfetched pages are bothering me.
I was looking at the statistics you mentioned, however I can`t get the
cause from them. I will share them, maybe you will notice something... so:

readdb -stats:
2025-03-28 09:12:54,544 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
/nutch/crawl/crawldb
2025-03-28 09:12:54,545 INFO crawl.CrawlDbReader: TOTAL urls:   11761224
2025-03-28 09:12:54,551 INFO crawl.CrawlDbReader: shortest fetch interval:
      91 days, 07:27:11
2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: avg fetch interval:   912
days, 21:15:04
2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: longest fetch interval:
     24855 days, 03:14:07
2025-03-28 09:12:54,563 INFO crawl.CrawlDbReader: earliest fetch time:  Wed
Mar 26 10:12:00 CET 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: avg of fetch times:   Mon
Apr 14 23:27:00 CEST 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: latest fetch time:    Tue
Aug 12 09:01:00 CEST 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 0:      11686131
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 1:      37311
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 2:      37782
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.01:  0.0
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.05:  0.0
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.1:
2.8397222762258305E-9
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.2:
1.5968612389262938E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.25:
  3.0930187090863353E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.3:
5.316786869545881E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.4:
1.1587939392930727E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.5:
2.5814283409185258E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.6:
5.908654160550735E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.7:
1.7618150654886576E-5
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.75:
  2.9190806586558215E-5
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.8:
8.921349358020518E-5
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.9:
5.628835416129725E-4
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.95:
  0.007623361698306406
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.99:
  0.9327549852155366
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: min score:    0.0
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: avg score:
  0.015449582458424396
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: max score:
  183.34019470214844
2025-03-28 09:12:54,602 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
      9406711
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 2 (db_fetched):
      1978175
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 3 (db_gone):
98857
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
     69608
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
     169813
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 6
(db_notmodified):    6045
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 7 (db_duplicate):
      32015
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: CrawlDb statistics: done

Please keep in mind, that "longest fetch interval:       24855 days" is not
an issue as I have custom plugin that is doing so to prevent certain pages
from fetching ever again.

Generate counters:
First one:
2025-03-28 07:59:06,502 INFO mapreduce.Job: Counters: 57
         File System Counters
                 FILE: Number of bytes read=259336882
                 FILE: Number of bytes written=731826855
                 FILE: Number of read operations=0
                 FILE: Number of large read operations=0
                 FILE: Number of write operations=0
                 HDFS: Number of bytes read=1460078564
                 HDFS: Number of bytes written=66867570
                 HDFS: Number of read operations=1680
                 HDFS: Number of large read operations=0
                 HDFS: Number of write operations=561
                 HDFS: Number of bytes read erasure-coded=0
         Job Counters
                 Launched map tasks=140
                 Launched reduce tasks=140
                 Data-local map tasks=137
                 Rack-local map tasks=3
                 Total time spent by all maps in occupied slots (ms)=16111006
                 Total time spent by all reduces in occupied slots
(ms)=20877184
                 Total time spent by all map tasks (ms)=8055503
                 Total time spent by all reduce tasks (ms)=10438592
                 Total vcore-milliseconds taken by all map tasks=8055503
                 Total vcore-milliseconds taken by all reduce tasks=10438592
                 Total megabyte-milliseconds taken by all map
tasks=65990680576
                 Total megabyte-milliseconds taken by all reduce
tasks=85512945664
         Map-Reduce Framework
                 Map input records=11556117
                 Map output records=9192906
                 Map output bytes=1171410543
                 Map output materialized bytes=373113013
                 Input split bytes=19740
                 Combine input records=0
                 Combine output records=0
                 Reduce input groups=2022938
                 Reduce shuffle bytes=373113013
                 Reduce input records=9192906
                 Reduce output records=0
                 Spilled Records=18385812
                 Shuffled Maps =19600
                 Failed Shuffles=0
                 Merged Map outputs=19600
                 GC time elapsed (ms)=87351
                 CPU time spent (ms)=7536820
                 Physical memory (bytes) snapshot=159464194048
                 Virtual memory (bytes) snapshot=2564550729728
                 Total committed heap usage (bytes)=602469826560
                 Peak Map Physical memory (bytes)=703803392
                 Peak Map Virtual memory (bytes)=9257803776
                 Peak Reduce Physical memory (bytes)=1188073472
                 Peak Reduce Virtual memory (bytes)=9272107008
         Generator
                 MALFORMED_URL=2
                 SCHEDULE_REJECTED=2363211
         Shuffle Errors
                 BAD_ID=0
                 CONNECTION=0
                 IO_ERROR=0
                 WRONG_LENGTH=0
                 WRONG_MAP=0
                 WRONG_REDUCE=0
         File Input Format Counters
                 Bytes Read=1460058824
         File Output Format Counters
                 Bytes Written=0

Looks fine, output records match with db_unfetched.

Second one:
2025-03-28 07:59:55,038 INFO mapreduce.Job: Counters: 55
         File System Counters
                 FILE: Number of bytes read=24605358
                 FILE: Number of bytes written=95719210
                 FILE: Number of read operations=0
                 FILE: Number of large read operations=0
                 FILE: Number of write operations=0
                 HDFS: Number of bytes read=66892770
                 HDFS: Number of bytes written=62854154
                 HDFS: Number of read operations=580
                 HDFS: Number of large read operations=0
                 HDFS: Number of write operations=8
                 HDFS: Number of bytes read erasure-coded=0
         Job Counters
                 Launched map tasks=140
                 Launched reduce tasks=4
                 Data-local map tasks=136
                 Rack-local map tasks=4
                 Total time spent by all maps in occupied slots (ms)=6395528
                 Total time spent by all reduces in occupied slots
(ms)=106208
                 Total time spent by all map tasks (ms)=3197764
                 Total time spent by all reduce tasks (ms)=53104
                 Total vcore-milliseconds taken by all map tasks=3197764
                 Total vcore-milliseconds taken by all reduce tasks=53104
                 Total megabyte-milliseconds taken by all map
tasks=26196082688
                 Total megabyte-milliseconds taken by all reduce
tasks=435027968
         Map-Reduce Framework
                 Map input records=499940
                 Map output records=499940
                 Map output bytes=94641374
                 Map output materialized bytes=20372766
                 Input split bytes=25200
                 Combine input records=0
                 Combine output records=0
                 Reduce input groups=499914
                 Reduce shuffle bytes=20372766
                 Reduce input records=499940
                 Reduce output records=499940
                 Spilled Records=999880
                 Shuffled Maps =560
                 Failed Shuffles=0
                 Merged Map outputs=560
                 GC time elapsed (ms)=32655
                 CPU time spent (ms)=1206490
                 Physical memory (bytes) snapshot=68925956096
                 Virtual memory (bytes) snapshot=1315029512192
                 Total committed heap usage (bytes)=309841625088
                 Peak Map Physical memory (bytes)=519888896
                 Peak Map Virtual memory (bytes)=9196109824
                 Peak Reduce Physical memory (bytes)=517226496
                 Peak Reduce Virtual memory (bytes)=9233317888
         Shuffle Errors
                 BAD_ID=0
                 CONNECTION=0
                 IO_ERROR=0
                 WRONG_LENGTH=0
                 WRONG_MAP=0
                 WRONG_REDUCE=0
         File Input Format Counters
                 Bytes Read=66867570
         File Output Format Counters
                 Bytes Written=62854154

Again output/input records matches with topN parameter.

Finally, fetcher counter:
2025-03-28 08:57:25,256 INFO mapreduce.Job: Counters: 65
         File System Counters
                 FILE: Number of bytes read=1670508693
                 FILE: Number of bytes written=2454067433
                 FILE: Number of read operations=0
                 FILE: Number of large read operations=0
                 FILE: Number of write operations=0
                 HDFS: Number of bytes read=7590522
                 HDFS: Number of bytes written=778632266
                 HDFS: Number of read operations=156
                 HDFS: Number of large read operations=0
                 HDFS: Number of write operations=840
                 HDFS: Number of bytes read erasure-coded=0
         Job Counters
                 Launched map tasks=4
                 Launched reduce tasks=140
                 Other local map tasks=4
                 Total time spent by all maps in occupied slots (ms)=14400056
                 Total time spent by all reduces in occupied slots
(ms)=758192886
                 Total time spent by all map tasks (ms)=7200028
                 Total time spent by all reduce tasks (ms)=379096443
                 Total vcore-milliseconds taken by all map tasks=7200028
                 Total vcore-milliseconds taken by all reduce tasks=379096443
                 Total megabyte-milliseconds taken by all map
tasks=58982629376
                 Total megabyte-milliseconds taken by all reduce
tasks=3105558061056
         Map-Reduce Framework
                 Map input records=59533
                 Map output records=106816
                 Map output bytes=3132357328
                 Map output materialized bytes=743594506
                 Input split bytes=636
                 Combine input records=0
                 Combine output records=0
                 Reduce input groups=54834
                 Reduce shuffle bytes=743594506
                 Reduce input records=106816
                 Reduce output records=106816
                 Spilled Records=352386
                 Shuffled Maps =560
                 Failed Shuffles=0
                 Merged Map outputs=560
                 GC time elapsed (ms)=35203
                 CPU time spent (ms)=9098390
                 Physical memory (bytes) snapshot=226334003200
                 Virtual memory (bytes) snapshot=1317038624768
                 Total committed heap usage (bytes)=309841625088
                 Peak Map Physical memory (bytes)=1892413440
                 Peak Map Virtual memory (bytes)=9298661376
                 Peak Reduce Physical memory (bytes)=1637662720
                 Peak Reduce Virtual memory (bytes)=9212592128
         FetcherStatus
                 access_denied=98
                 bytes_downloaded=3062491817
                 exception=1302
                 gone=26
                 moved=2598
                 notfound=723
                 redirect_count_exceeded=146
                 robots_denied=1145
                 robots_denied_maxcrawldelay=16
                 success=47650
                 temp_moved=1432
         Shuffle Errors
                 BAD_ID=0
                 CONNECTION=0
                 IO_ERROR=0
                 WRONG_LENGTH=0
                 WRONG_MAP=0
                 WRONG_REDUCE=0
         File Input Format Counters
                 Bytes Read=7589886
         File Output Format Counters
                 Bytes Written=778632266
2025-03-28 08:57:25,258 INFO fetcher.Fetcher: Fetcher: finished, elapsed:
3444497 ms

Input/output records do not match with previous steps. What do I miss?

Best,
Maciej

pt., 28 mar 2025 o 08:54 Sebastian Nagel <[email protected]>
napisaƂ(a):

Hi Maciek,

there are multiple configurations which set a limit on the items
during fetch list generations.

- topN (Ok, it's obviously not the reason)

- a limit per host is defined by the property generate.max.count
    - default is -1 (no limit)
    - eventually you want to set a limit per host, in order to avoid that
      a single host with overlong fetch lists slows down the overall
crawling
    - by generate.count.mode this limit can be applied per registered
      domain or IP address

- generate.min.score (default: 0.0): only CrawlDatum items with a higher
    score are put into fetch lists

- the fetch scheduling: re-fetch pages after a certain amount of time
    (default: 30 days), also wait 1 day for retrying a page which failed
    to fetch with an error (not a 404)


Running the CrawlDb statistics

    bin/nutch readdb crawldb -stats

shows the number of items per status, retry count, the distribution
of scores and fetch intervals.


  > should generate a segment with 500k pages to fetch.

Not necessarily, see above. The final size of the fetch list
is shown by the counter
    Reduce output records=NNN
Note: the second appearance of it in the generator log because of
NUTCH-3059.

  > fetches only around 100k pages.

The fetcher counters also show how many items are skipped because of the
fetcher timelimit and alike.


Let us know whether you need more information. If possible please share the
CrawlDb statistics or the generator and fetcher counters. It might help
to find the reason.

Best,
Sebastian

On 3/27/25 18:56, Maciek Puzianowski wrote:
Hi,
I have a problem with topN value in Apache Nutch.
I have 8 million+ db_unfetched pages in crawldb. I use crawl script with
following command:
bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20
--size-fetchlist 500000 /nutch/crawl 1
--size-fetchlist parameter is the topN for generate method, meaning that
it
should generate a segment with 500k pages to fetch. However, the fetcher
fetches only around 100k pages. Also I get around 1 million
SCHEDULE_REJECTED counter in generate method, but I think its just pages
that I have already fetched.

I have checked url filters and they affect only few pages.

What can be causing such issue with such a big difference?

Best,
Maciej





Reply via email to