Re: Generator or fetcher does not get topN pages

Sebastian Nagel Fri, 04 Apr 2025 02:38:51 -0700

Thanks, Maciek.

I had a quick look, but didn't see anything. I'll check it in detail the
next days.


But glad that you found a work-around!

~Sebastian

On 3/31/25 14:39, Maciek Puzianowski wrote:

PS
I think I manged to overpass it, but I don`t think it is supposed to be
like that. I have modified crawl command to:
bin/crawl -i --num-fetchers 140 --num-tasks 170 --num-threads 50
/nutch/crawl 1
I have 4 worker nodes in my hadoop cluster so previously I set
--num-fetchers to 4 as crawl script recommends. Now I changed it to 140 and
now my fetcher job returns:
Map-Reduce Framework
                 Map input records=487214
                 Map output records=883753

So it looks like it works fine now, however why does it fix my problem?

Best,
Maciej

pt., 28 mar 2025 o 16:51 Maciek Puzianowski <[email protected]>
napisał(a):

Sebastian,
bin/nutch readseg -list ... returns after fetching:
NAME            GENERATED       FETCHER START           FETCHER END
       FETCHED PARSED
20250328111939  1499960         2025-03-28T11:24:17
2025-03-28T12:25:20     259802  ?
which seems right because I have set topN to 1500000 and as you can see it
fetched more than 100k.
I did an experiment and tried to set topN to 2500000 to see if it rises
proportionally. However, with topN set to 2.5 million I still get around
260k fetched pages. Maybe that fact can be a hint.

Nothing new in generator log. I ran it after few minutes and generator
didn`t say there is no more pages to fetch... maybe that is another hint.

There is a fetcher log from topN=2.5million run:

2025-03-28 16:17:16,479 INFO mapreduce.Job: Counters: 65
         File System Counters
                 FILE: Number of bytes read=10247252578
                 FILE: Number of bytes written=13839868610
                 FILE: Number of read operations=0
                 FILE: Number of large read operations=0
                 FILE: Number of write operations=0
                 HDFS: Number of bytes read=33380868
                 HDFS: Number of bytes written=3826710903
                 HDFS: Number of read operations=156
                 HDFS: Number of large read operations=0
                 HDFS: Number of write operations=840
                 HDFS: Number of bytes read erasure-coded=0
         Job Counters
                 Launched map tasks=4
                 Launched reduce tasks=140
                 Other local map tasks=4
                 Total time spent by all maps in occupied slots
(ms)=29492618
                 Total time spent by all reduces in occupied slots
(ms)=909990524
                 Total time spent by all map tasks (ms)=14746309
                 Total time spent by all reduce tasks (ms)=454995262
                 Total vcore-milliseconds taken by all map tasks=14746309
                 Total vcore-milliseconds taken by all reduce
tasks=454995262
                 Total megabyte-milliseconds taken by all map
tasks=120801763328
                 Total megabyte-milliseconds taken by all reduce
tasks=3727321186304
         Map-Reduce Framework
                 Map input records=259372
                 Map output records=531984
                 Map output bytes=15944298687
                 Map output materialized bytes=3596875971
                 Input split bytes=636
                 Combine input records=0
                 Combine output records=0
                 Reduce input groups=271454
                 Reduce shuffle bytes=3596875971
                 Reduce input records=531984
                 Reduce output records=531984
                 Spilled Records=2053170
                 Shuffled Maps =560
                 Failed Shuffles=0
                 Merged Map outputs=560
                 GC time elapsed (ms)=42297
                 CPU time spent (ms)=15190940
                 Physical memory (bytes) snapshot=239405228032
                 Virtual memory (bytes) snapshot=1317027561472
                 Total committed heap usage (bytes)=309841625088
                 Peak Map Physical memory (bytes)=1934864384
                 Peak Map Virtual memory (bytes)=9297256448
                 Peak Reduce Physical memory (bytes)=1720426496
                 Peak Reduce Virtual memory (bytes)=9214562304
         FetcherStatus
                 access_denied=588
                 bytes_downloaded=15524376452
                 exception=4933
                 gone=85
                 moved=14393
                 notfound=4081
                 redirect_count_exceeded=46
                 robots_denied=4769
                 robots_denied_maxcrawldelay=48
                 success=233691
                 temp_moved=10665
         Shuffle Errors
                 BAD_ID=0
                 CONNECTION=0
                 IO_ERROR=0
                 WRONG_LENGTH=0
                 WRONG_MAP=0
                 WRONG_REDUCE=0
         File Input Format Counters
                 Bytes Read=33380232
         File Output Format Counters
                 Bytes Written=3826710903
2025-03-28 16:17:16,480 INFO fetcher.Fetcher: Fetcher: finished, elapsed:
4749335 ms

To summarize, generator works probably fine as readseg gives me good
amount of pages, but somehow fetcher loses most of them. I will try to look
in that way, however I will appriciate any more help.

Best,
Maciej

pt., 28 mar 2025 o 11:53 Sebastian Nagel
<[email protected]> napisał(a):

Hi Maciek,

  > Input/output records do not match with previous steps. What do I miss?

Indeed:

- generator
  >                  Launched reduce tasks=4
  >                  Reduce output records=499940

- fetcher
  >          Job Counters
  >                  Launched map tasks=4
  >                  Map input records=59533

The number of fetch lists (fetcher map tasks) does fit,
but generator output records and fetcher input records do not.

No idea why. I'd first count items in the segment:

    bin/nutch readseg -list ...


  > PS I have run one more crawl and generator gives me
  > 2025-03-28 09:44:49,629 WARN crawl.Generator: Generator: 0 records
selected
  > for fetching, exiting ...
  > Generate returned 1 (no new segments created)
  > even though stats still have db_unfetched pages:

Also no idea why this happens. Any hints in the last generator log?


One minor point:

  > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile
0.05:  0.0

This means there are 5% of the records with score == 0.0
With the default generate.min.score, they're never fetched.
It's likely redirects because of NUTCH-2749. I definitely should fix it!
There's an trivial fix / work-around: comment the line
    datum.setScore(0.0f);
in OPICScoringFilter initialScore.

Also you could set http.redirect.max to a reasonable value (3-5) and
redirects
are followed directly in the fetcher.


Best,
Sebastian

On 3/28/25 09:19, Maciek Puzianowski wrote:

Sebastian,

thank you for your quick answer,

I have set generate.max.count to 1 000 000, so I think it shouldn`t be a
problem, generate.count.mode is set to default byHost as well as
generate.min.score to 0.
It is a fresh start and I have set refetch time to 90 days so I don`t

blame

the pages that were already fetched, I believe they are going to fetch

in

those 90 days. db_unfetched pages are bothering me.
I was looking at the statistics you mentioned, however I can`t get the
cause from them. I will share them, maybe you will notice something...

so:


readdb -stats:
2025-03-28 09:12:54,544 INFO crawl.CrawlDbReader: Statistics for

CrawlDb:

/nutch/crawl/crawldb
2025-03-28 09:12:54,545 INFO crawl.CrawlDbReader: TOTAL urls:   11761224
2025-03-28 09:12:54,551 INFO crawl.CrawlDbReader: shortest fetch

interval:

       91 days, 07:27:11
2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: avg fetch interval:

days, 21:15:04
2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: longest fetch

interval:

      24855 days, 03:14:07
2025-03-28 09:12:54,563 INFO crawl.CrawlDbReader: earliest fetch time:

Wed

Mar 26 10:12:00 CET 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: avg of fetch times:

Mon

Apr 14 23:27:00 CEST 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: latest fetch time:

Tue

Aug 12 09:01:00 CEST 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 0:      11686131
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 1:      37311
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 2:      37782
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.01:

0.0

2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.05:

0.0

2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.1:
2.8397222762258305E-9
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.2:
1.5968612389262938E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.25:
   3.0930187090863353E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.3:
5.316786869545881E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.4:
1.1587939392930727E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.5:
2.5814283409185258E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.6:
5.908654160550735E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.7:
1.7618150654886576E-5
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.75:
   2.9190806586558215E-5
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.8:
8.921349358020518E-5
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.9:
5.628835416129725E-4
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.95:
   0.007623361698306406
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.99:
   0.9327549852155366
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: min score:    0.0
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: avg score:
   0.015449582458424396
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: max score:
   183.34019470214844
2025-03-28 09:12:54,602 INFO crawl.CrawlDbReader: status 1

(db_unfetched):

       9406711
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 2 (db_fetched):
       1978175
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 3 (db_gone):
98857
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 4

(db_redir_temp):

      69608
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 5

(db_redir_perm):

      169813
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 6
(db_notmodified):    6045
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 7

(db_duplicate):

       32015
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: CrawlDb statistics:

done


Please keep in mind, that "longest fetch interval:       24855 days" is

not

an issue as I have custom plugin that is doing so to prevent certain

pages

from fetching ever again.

Generate counters:
First one:
2025-03-28 07:59:06,502 INFO mapreduce.Job: Counters: 57
          File System Counters
                  FILE: Number of bytes read=259336882
                  FILE: Number of bytes written=731826855
                  FILE: Number of read operations=0
                  FILE: Number of large read operations=0
                  FILE: Number of write operations=0
                  HDFS: Number of bytes read=1460078564
                  HDFS: Number of bytes written=66867570
                  HDFS: Number of read operations=1680
                  HDFS: Number of large read operations=0
                  HDFS: Number of write operations=561
                  HDFS: Number of bytes read erasure-coded=0
          Job Counters
                  Launched map tasks=140
                  Launched reduce tasks=140
                  Data-local map tasks=137
                  Rack-local map tasks=3
                  Total time spent by all maps in occupied slots

(ms)=16111006

                  Total time spent by all reduces in occupied slots
(ms)=20877184
                  Total time spent by all map tasks (ms)=8055503
                  Total time spent by all reduce tasks (ms)=10438592
                  Total vcore-milliseconds taken by all map tasks=8055503
                  Total vcore-milliseconds taken by all reduce

tasks=10438592

                  Total megabyte-milliseconds taken by all map
tasks=65990680576
                  Total megabyte-milliseconds taken by all reduce
tasks=85512945664
          Map-Reduce Framework
                  Map input records=11556117
                  Map output records=9192906
                  Map output bytes=1171410543
                  Map output materialized bytes=373113013
                  Input split bytes=19740
                  Combine input records=0
                  Combine output records=0
                  Reduce input groups=2022938
                  Reduce shuffle bytes=373113013
                  Reduce input records=9192906
                  Reduce output records=0
                  Spilled Records=18385812
                  Shuffled Maps =19600
                  Failed Shuffles=0
                  Merged Map outputs=19600
                  GC time elapsed (ms)=87351
                  CPU time spent (ms)=7536820
                  Physical memory (bytes) snapshot=159464194048
                  Virtual memory (bytes) snapshot=2564550729728
                  Total committed heap usage (bytes)=602469826560
                  Peak Map Physical memory (bytes)=703803392
                  Peak Map Virtual memory (bytes)=9257803776
                  Peak Reduce Physical memory (bytes)=1188073472
                  Peak Reduce Virtual memory (bytes)=9272107008
          Generator
                  MALFORMED_URL=2
                  SCHEDULE_REJECTED=2363211
          Shuffle Errors
                  BAD_ID=0
                  CONNECTION=0
                  IO_ERROR=0
                  WRONG_LENGTH=0
                  WRONG_MAP=0
                  WRONG_REDUCE=0
          File Input Format Counters
                  Bytes Read=1460058824
          File Output Format Counters
                  Bytes Written=0

Looks fine, output records match with db_unfetched.

Second one:
2025-03-28 07:59:55,038 INFO mapreduce.Job: Counters: 55
          File System Counters
                  FILE: Number of bytes read=24605358
                  FILE: Number of bytes written=95719210
                  FILE: Number of read operations=0
                  FILE: Number of large read operations=0
                  FILE: Number of write operations=0
                  HDFS: Number of bytes read=66892770
                  HDFS: Number of bytes written=62854154
                  HDFS: Number of read operations=580
                  HDFS: Number of large read operations=0
                  HDFS: Number of write operations=8
                  HDFS: Number of bytes read erasure-coded=0
          Job Counters
                  Launched map tasks=140
                  Launched reduce tasks=4
                  Data-local map tasks=136
                  Rack-local map tasks=4
                  Total time spent by all maps in occupied slots

(ms)=6395528

                  Total time spent by all reduces in occupied slots
(ms)=106208
                  Total time spent by all map tasks (ms)=3197764
                  Total time spent by all reduce tasks (ms)=53104
                  Total vcore-milliseconds taken by all map tasks=3197764
                  Total vcore-milliseconds taken by all reduce

tasks=53104

                  Total megabyte-milliseconds taken by all map
tasks=26196082688
                  Total megabyte-milliseconds taken by all reduce
tasks=435027968
          Map-Reduce Framework
                  Map input records=499940
                  Map output records=499940
                  Map output bytes=94641374
                  Map output materialized bytes=20372766
                  Input split bytes=25200
                  Combine input records=0
                  Combine output records=0
                  Reduce input groups=499914
                  Reduce shuffle bytes=20372766
                  Reduce input records=499940
                  Reduce output records=499940
                  Spilled Records=999880
                  Shuffled Maps =560
                  Failed Shuffles=0
                  Merged Map outputs=560
                  GC time elapsed (ms)=32655
                  CPU time spent (ms)=1206490
                  Physical memory (bytes) snapshot=68925956096
                  Virtual memory (bytes) snapshot=1315029512192
                  Total committed heap usage (bytes)=309841625088
                  Peak Map Physical memory (bytes)=519888896
                  Peak Map Virtual memory (bytes)=9196109824
                  Peak Reduce Physical memory (bytes)=517226496
                  Peak Reduce Virtual memory (bytes)=9233317888
          Shuffle Errors
                  BAD_ID=0
                  CONNECTION=0
                  IO_ERROR=0
                  WRONG_LENGTH=0
                  WRONG_MAP=0
                  WRONG_REDUCE=0
          File Input Format Counters
                  Bytes Read=66867570
          File Output Format Counters
                  Bytes Written=62854154

Again output/input records matches with topN parameter.

Finally, fetcher counter:
2025-03-28 08:57:25,256 INFO mapreduce.Job: Counters: 65
          File System Counters
                  FILE: Number of bytes read=1670508693
                  FILE: Number of bytes written=2454067433
                  FILE: Number of read operations=0
                  FILE: Number of large read operations=0
                  FILE: Number of write operations=0
                  HDFS: Number of bytes read=7590522
                  HDFS: Number of bytes written=778632266
                  HDFS: Number of read operations=156
                  HDFS: Number of large read operations=0
                  HDFS: Number of write operations=840
                  HDFS: Number of bytes read erasure-coded=0
          Job Counters
                  Launched map tasks=4
                  Launched reduce tasks=140
                  Other local map tasks=4
                  Total time spent by all maps in occupied slots

(ms)=14400056

                  Total time spent by all reduces in occupied slots
(ms)=758192886
                  Total time spent by all map tasks (ms)=7200028
                  Total time spent by all reduce tasks (ms)=379096443
                  Total vcore-milliseconds taken by all map tasks=7200028
                  Total vcore-milliseconds taken by all reduce

tasks=379096443

                  Total megabyte-milliseconds taken by all map
tasks=58982629376
                  Total megabyte-milliseconds taken by all reduce
tasks=3105558061056
          Map-Reduce Framework
                  Map input records=59533
                  Map output records=106816
                  Map output bytes=3132357328
                  Map output materialized bytes=743594506
                  Input split bytes=636
                  Combine input records=0
                  Combine output records=0
                  Reduce input groups=54834
                  Reduce shuffle bytes=743594506
                  Reduce input records=106816
                  Reduce output records=106816
                  Spilled Records=352386
                  Shuffled Maps =560
                  Failed Shuffles=0
                  Merged Map outputs=560
                  GC time elapsed (ms)=35203
                  CPU time spent (ms)=9098390
                  Physical memory (bytes) snapshot=226334003200
                  Virtual memory (bytes) snapshot=1317038624768
                  Total committed heap usage (bytes)=309841625088
                  Peak Map Physical memory (bytes)=1892413440
                  Peak Map Virtual memory (bytes)=9298661376
                  Peak Reduce Physical memory (bytes)=1637662720
                  Peak Reduce Virtual memory (bytes)=9212592128
          FetcherStatus
                  access_denied=98
                  bytes_downloaded=3062491817
                  exception=1302
                  gone=26
                  moved=2598
                  notfound=723
                  redirect_count_exceeded=146
                  robots_denied=1145
                  robots_denied_maxcrawldelay=16
                  success=47650
                  temp_moved=1432
          Shuffle Errors
                  BAD_ID=0
                  CONNECTION=0
                  IO_ERROR=0
                  WRONG_LENGTH=0
                  WRONG_MAP=0
                  WRONG_REDUCE=0
          File Input Format Counters
                  Bytes Read=7589886
          File Output Format Counters
                  Bytes Written=778632266
2025-03-28 08:57:25,258 INFO fetcher.Fetcher: Fetcher: finished,

elapsed:

3444497 ms

Input/output records do not match with previous steps. What do I miss?

Best,
Maciej

pt., 28 mar 2025 o 08:54 Sebastian Nagel <[email protected]

.invalid>

napisał(a):

Hi Maciek,

there are multiple configurations which set a limit on the items
during fetch list generations.

- topN (Ok, it's obviously not the reason)

- a limit per host is defined by the property generate.max.count
     - default is -1 (no limit)
     - eventually you want to set a limit per host, in order to avoid

that

       a single host with overlong fetch lists slows down the overall
crawling
     - by generate.count.mode this limit can be applied per registered
       domain or IP address

- generate.min.score (default: 0.0): only CrawlDatum items with a

higher

     score are put into fetch lists

- the fetch scheduling: re-fetch pages after a certain amount of time
     (default: 30 days), also wait 1 day for retrying a page which

failed

     to fetch with an error (not a 404)


Running the CrawlDb statistics

     bin/nutch readdb crawldb -stats

shows the number of items per status, retry count, the distribution
of scores and fetch intervals.


   > should generate a segment with 500k pages to fetch.

Not necessarily, see above. The final size of the fetch list
is shown by the counter
     Reduce output records=NNN
Note: the second appearance of it in the generator log because of
NUTCH-3059.

   > fetches only around 100k pages.

The fetcher counters also show how many items are skipped because of

the

fetcher timelimit and alike.


Let us know whether you need more information. If possible please

share the

CrawlDb statistics or the generator and fetcher counters. It might help
to find the reason.

Best,
Sebastian

On 3/27/25 18:56, Maciek Puzianowski wrote:

Hi,
I have a problem with topN value in Apache Nutch.
I have 8 million+ db_unfetched pages in crawldb. I use crawl script

with

following command:
bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20
--size-fetchlist 500000 /nutch/crawl 1
--size-fetchlist parameter is the topN for generate method, meaning

that

it

should generate a segment with 500k pages to fetch. However, the

fetcher

fetches only around 100k pages. Also I get around 1 million
SCHEDULE_REJECTED counter in generate method, but I think its just

pages

that I have already fetched.

I have checked url filters and they affect only few pages.

What can be causing such issue with such a big difference?

Best,
Maciej

Re: Generator or fetcher does not get topN pages

Reply via email to