Re: Generator or fetcher does not get topN pages

Maciek Puzianowski Sat, 05 Apr 2025 10:55:00 -0700

Sebastian,
bin/nutch readseg -list ... returns after fetching:
NAME            GENERATED       FETCHER START           FETCHER END
    FETCHED PARSED
20250328111939  1499960         2025-03-28T11:24:17     2025-03-28T12:25:20
    259802  ?
which seems right because I have set topN to 1500000 and as you can see it
fetched more than 100k.
I did an experiment and tried to set topN to 2500000 to see if it rises
proportionally. However, with topN set to 2.5 million I still get around
260k fetched pages. Maybe that fact can be a hint.


Nothing new in generator log. I ran it after few minutes and generator
didn`t say there is no more pages to fetch... maybe that is another hint.

There is a fetcher log from topN=2.5million run:

2025-03-28 16:17:16,479 INFO mapreduce.Job: Counters: 65
        File System Counters
                FILE: Number of bytes read=10247252578
                FILE: Number of bytes written=13839868610
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=33380868
                HDFS: Number of bytes written=3826710903
                HDFS: Number of read operations=156
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=840
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Launched map tasks=4
                Launched reduce tasks=140
                Other local map tasks=4
                Total time spent by all maps in occupied slots (ms)=29492618
                Total time spent by all reduces in occupied slots
(ms)=909990524
                Total time spent by all map tasks (ms)=14746309
                Total time spent by all reduce tasks (ms)=454995262
                Total vcore-milliseconds taken by all map tasks=14746309
                Total vcore-milliseconds taken by all reduce tasks=454995262
                Total megabyte-milliseconds taken by all map
tasks=120801763328
                Total megabyte-milliseconds taken by all reduce
tasks=3727321186304
        Map-Reduce Framework
                Map input records=259372
                Map output records=531984
                Map output bytes=15944298687
                Map output materialized bytes=3596875971
                Input split bytes=636
                Combine input records=0
                Combine output records=0
                Reduce input groups=271454
                Reduce shuffle bytes=3596875971
                Reduce input records=531984
                Reduce output records=531984
                Spilled Records=2053170
                Shuffled Maps =560
                Failed Shuffles=0
                Merged Map outputs=560
                GC time elapsed (ms)=42297
                CPU time spent (ms)=15190940
                Physical memory (bytes) snapshot=239405228032
                Virtual memory (bytes) snapshot=1317027561472
                Total committed heap usage (bytes)=309841625088
                Peak Map Physical memory (bytes)=1934864384
                Peak Map Virtual memory (bytes)=9297256448
                Peak Reduce Physical memory (bytes)=1720426496
                Peak Reduce Virtual memory (bytes)=9214562304
        FetcherStatus
                access_denied=588
                bytes_downloaded=15524376452
                exception=4933
                gone=85
                moved=14393
                notfound=4081
                redirect_count_exceeded=46
                robots_denied=4769
                robots_denied_maxcrawldelay=48
                success=233691
                temp_moved=10665
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=33380232
        File Output Format Counters
                Bytes Written=3826710903
2025-03-28 16:17:16,480 INFO fetcher.Fetcher: Fetcher: finished, elapsed:
4749335 ms

To summarize, generator works probably fine as readseg gives me good amount
of pages, but somehow fetcher loses most of them. I will try to look in
that way, however I will appriciate any more help.

Best,
Maciej

pt., 28 mar 2025 o 11:53 Sebastian Nagel <[email protected]>
napisał(a):

> Hi Maciek,
>
>  > Input/output records do not match with previous steps. What do I miss?
>
> Indeed:
>
> - generator
>  >                  Launched reduce tasks=4
>  >                  Reduce output records=499940
>
> - fetcher
>  >          Job Counters
>  >                  Launched map tasks=4
>  >                  Map input records=59533
>
> The number of fetch lists (fetcher map tasks) does fit,
> but generator output records and fetcher input records do not.
>
> No idea why. I'd first count items in the segment:
>
>    bin/nutch readseg -list ...
>
>
>  > PS I have run one more crawl and generator gives me
>  > 2025-03-28 09:44:49,629 WARN crawl.Generator: Generator: 0 records
> selected
>  > for fetching, exiting ...
>  > Generate returned 1 (no new segments created)
>  > even though stats still have db_unfetched pages:
>
> Also no idea why this happens. Any hints in the last generator log?
>
>
> One minor point:
>
>  > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.05:
> 0.0
>
> This means there are 5% of the records with score == 0.0
> With the default generate.min.score, they're never fetched.
> It's likely redirects because of NUTCH-2749. I definitely should fix it!
> There's an trivial fix / work-around: comment the line
>    datum.setScore(0.0f);
> in OPICScoringFilter initialScore.
>
> Also you could set http.redirect.max to a reasonable value (3-5) and
> redirects
> are followed directly in the fetcher.
>
>
> Best,
> Sebastian
>
> On 3/28/25 09:19, Maciek Puzianowski wrote:
> > Sebastian,
> >
> > thank you for your quick answer,
> >
> > I have set generate.max.count to 1 000 000, so I think it shouldn`t be a
> > problem, generate.count.mode is set to default byHost as well as
> > generate.min.score to 0.
> > It is a fresh start and I have set refetch time to 90 days so I don`t
> blame
> > the pages that were already fetched, I believe they are going to fetch in
> > those 90 days. db_unfetched pages are bothering me.
> > I was looking at the statistics you mentioned, however I can`t get the
> > cause from them. I will share them, maybe you will notice something...
> so:
> >
> > readdb -stats:
> > 2025-03-28 09:12:54,544 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
> > /nutch/crawl/crawldb
> > 2025-03-28 09:12:54,545 INFO crawl.CrawlDbReader: TOTAL urls:   11761224
> > 2025-03-28 09:12:54,551 INFO crawl.CrawlDbReader: shortest fetch
> interval:
> >       91 days, 07:27:11
> > 2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: avg fetch interval:
>  912
> > days, 21:15:04
> > 2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: longest fetch interval:
> >      24855 days, 03:14:07
> > 2025-03-28 09:12:54,563 INFO crawl.CrawlDbReader: earliest fetch time:
> Wed
> > Mar 26 10:12:00 CET 2025
> > 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: avg of fetch times:
>  Mon
> > Apr 14 23:27:00 CEST 2025
> > 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: latest fetch time:
> Tue
> > Aug 12 09:01:00 CEST 2025
> > 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 0:      11686131
> > 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 1:      37311
> > 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 2:      37782
> > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.01:
> 0.0
> > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.05:
> 0.0
> > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.1:
> > 2.8397222762258305E-9
> > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.2:
> > 1.5968612389262938E-7
> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.25:
> >   3.0930187090863353E-7
> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.3:
> > 5.316786869545881E-7
> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.4:
> > 1.1587939392930727E-6
> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.5:
> > 2.5814283409185258E-6
> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.6:
> > 5.908654160550735E-6
> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.7:
> > 1.7618150654886576E-5
> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.75:
> >   2.9190806586558215E-5
> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.8:
> > 8.921349358020518E-5
> > 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.9:
> > 5.628835416129725E-4
> > 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.95:
> >   0.007623361698306406
> > 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.99:
> >   0.9327549852155366
> > 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: min score:    0.0
> > 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: avg score:
> >   0.015449582458424396
> > 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: max score:
> >   183.34019470214844
> > 2025-03-28 09:12:54,602 INFO crawl.CrawlDbReader: status 1
> (db_unfetched):
> >       9406711
> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> >       1978175
> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 3 (db_gone):
> > 98857
> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 4
> (db_redir_temp):
> >      69608
> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 5
> (db_redir_perm):
> >      169813
> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 6
> > (db_notmodified):    6045
> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 7
> (db_duplicate):
> >       32015
> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: CrawlDb statistics:
> done
> >
> > Please keep in mind, that "longest fetch interval:       24855 days" is
> not
> > an issue as I have custom plugin that is doing so to prevent certain
> pages
> > from fetching ever again.
> >
> > Generate counters:
> > First one:
> > 2025-03-28 07:59:06,502 INFO mapreduce.Job: Counters: 57
> >          File System Counters
> >                  FILE: Number of bytes read=259336882
> >                  FILE: Number of bytes written=731826855
> >                  FILE: Number of read operations=0
> >                  FILE: Number of large read operations=0
> >                  FILE: Number of write operations=0
> >                  HDFS: Number of bytes read=1460078564
> >                  HDFS: Number of bytes written=66867570
> >                  HDFS: Number of read operations=1680
> >                  HDFS: Number of large read operations=0
> >                  HDFS: Number of write operations=561
> >                  HDFS: Number of bytes read erasure-coded=0
> >          Job Counters
> >                  Launched map tasks=140
> >                  Launched reduce tasks=140
> >                  Data-local map tasks=137
> >                  Rack-local map tasks=3
> >                  Total time spent by all maps in occupied slots
> (ms)=16111006
> >                  Total time spent by all reduces in occupied slots
> > (ms)=20877184
> >                  Total time spent by all map tasks (ms)=8055503
> >                  Total time spent by all reduce tasks (ms)=10438592
> >                  Total vcore-milliseconds taken by all map tasks=8055503
> >                  Total vcore-milliseconds taken by all reduce
> tasks=10438592
> >                  Total megabyte-milliseconds taken by all map
> > tasks=65990680576
> >                  Total megabyte-milliseconds taken by all reduce
> > tasks=85512945664
> >          Map-Reduce Framework
> >                  Map input records=11556117
> >                  Map output records=9192906
> >                  Map output bytes=1171410543
> >                  Map output materialized bytes=373113013
> >                  Input split bytes=19740
> >                  Combine input records=0
> >                  Combine output records=0
> >                  Reduce input groups=2022938
> >                  Reduce shuffle bytes=373113013
> >                  Reduce input records=9192906
> >                  Reduce output records=0
> >                  Spilled Records=18385812
> >                  Shuffled Maps =19600
> >                  Failed Shuffles=0
> >                  Merged Map outputs=19600
> >                  GC time elapsed (ms)=87351
> >                  CPU time spent (ms)=7536820
> >                  Physical memory (bytes) snapshot=159464194048
> >                  Virtual memory (bytes) snapshot=2564550729728
> >                  Total committed heap usage (bytes)=602469826560
> >                  Peak Map Physical memory (bytes)=703803392
> >                  Peak Map Virtual memory (bytes)=9257803776
> >                  Peak Reduce Physical memory (bytes)=1188073472
> >                  Peak Reduce Virtual memory (bytes)=9272107008
> >          Generator
> >                  MALFORMED_URL=2
> >                  SCHEDULE_REJECTED=2363211
> >          Shuffle Errors
> >                  BAD_ID=0
> >                  CONNECTION=0
> >                  IO_ERROR=0
> >                  WRONG_LENGTH=0
> >                  WRONG_MAP=0
> >                  WRONG_REDUCE=0
> >          File Input Format Counters
> >                  Bytes Read=1460058824
> >          File Output Format Counters
> >                  Bytes Written=0
> >
> > Looks fine, output records match with db_unfetched.
> >
> > Second one:
> > 2025-03-28 07:59:55,038 INFO mapreduce.Job: Counters: 55
> >          File System Counters
> >                  FILE: Number of bytes read=24605358
> >                  FILE: Number of bytes written=95719210
> >                  FILE: Number of read operations=0
> >                  FILE: Number of large read operations=0
> >                  FILE: Number of write operations=0
> >                  HDFS: Number of bytes read=66892770
> >                  HDFS: Number of bytes written=62854154
> >                  HDFS: Number of read operations=580
> >                  HDFS: Number of large read operations=0
> >                  HDFS: Number of write operations=8
> >                  HDFS: Number of bytes read erasure-coded=0
> >          Job Counters
> >                  Launched map tasks=140
> >                  Launched reduce tasks=4
> >                  Data-local map tasks=136
> >                  Rack-local map tasks=4
> >                  Total time spent by all maps in occupied slots
> (ms)=6395528
> >                  Total time spent by all reduces in occupied slots
> > (ms)=106208
> >                  Total time spent by all map tasks (ms)=3197764
> >                  Total time spent by all reduce tasks (ms)=53104
> >                  Total vcore-milliseconds taken by all map tasks=3197764
> >                  Total vcore-milliseconds taken by all reduce tasks=53104
> >                  Total megabyte-milliseconds taken by all map
> > tasks=26196082688
> >                  Total megabyte-milliseconds taken by all reduce
> > tasks=435027968
> >          Map-Reduce Framework
> >                  Map input records=499940
> >                  Map output records=499940
> >                  Map output bytes=94641374
> >                  Map output materialized bytes=20372766
> >                  Input split bytes=25200
> >                  Combine input records=0
> >                  Combine output records=0
> >                  Reduce input groups=499914
> >                  Reduce shuffle bytes=20372766
> >                  Reduce input records=499940
> >                  Reduce output records=499940
> >                  Spilled Records=999880
> >                  Shuffled Maps =560
> >                  Failed Shuffles=0
> >                  Merged Map outputs=560
> >                  GC time elapsed (ms)=32655
> >                  CPU time spent (ms)=1206490
> >                  Physical memory (bytes) snapshot=68925956096
> >                  Virtual memory (bytes) snapshot=1315029512192
> >                  Total committed heap usage (bytes)=309841625088
> >                  Peak Map Physical memory (bytes)=519888896
> >                  Peak Map Virtual memory (bytes)=9196109824
> >                  Peak Reduce Physical memory (bytes)=517226496
> >                  Peak Reduce Virtual memory (bytes)=9233317888
> >          Shuffle Errors
> >                  BAD_ID=0
> >                  CONNECTION=0
> >                  IO_ERROR=0
> >                  WRONG_LENGTH=0
> >                  WRONG_MAP=0
> >                  WRONG_REDUCE=0
> >          File Input Format Counters
> >                  Bytes Read=66867570
> >          File Output Format Counters
> >                  Bytes Written=62854154
> >
> > Again output/input records matches with topN parameter.
> >
> > Finally, fetcher counter:
> > 2025-03-28 08:57:25,256 INFO mapreduce.Job: Counters: 65
> >          File System Counters
> >                  FILE: Number of bytes read=1670508693
> >                  FILE: Number of bytes written=2454067433
> >                  FILE: Number of read operations=0
> >                  FILE: Number of large read operations=0
> >                  FILE: Number of write operations=0
> >                  HDFS: Number of bytes read=7590522
> >                  HDFS: Number of bytes written=778632266
> >                  HDFS: Number of read operations=156
> >                  HDFS: Number of large read operations=0
> >                  HDFS: Number of write operations=840
> >                  HDFS: Number of bytes read erasure-coded=0
> >          Job Counters
> >                  Launched map tasks=4
> >                  Launched reduce tasks=140
> >                  Other local map tasks=4
> >                  Total time spent by all maps in occupied slots
> (ms)=14400056
> >                  Total time spent by all reduces in occupied slots
> > (ms)=758192886
> >                  Total time spent by all map tasks (ms)=7200028
> >                  Total time spent by all reduce tasks (ms)=379096443
> >                  Total vcore-milliseconds taken by all map tasks=7200028
> >                  Total vcore-milliseconds taken by all reduce
> tasks=379096443
> >                  Total megabyte-milliseconds taken by all map
> > tasks=58982629376
> >                  Total megabyte-milliseconds taken by all reduce
> > tasks=3105558061056
> >          Map-Reduce Framework
> >                  Map input records=59533
> >                  Map output records=106816
> >                  Map output bytes=3132357328
> >                  Map output materialized bytes=743594506
> >                  Input split bytes=636
> >                  Combine input records=0
> >                  Combine output records=0
> >                  Reduce input groups=54834
> >                  Reduce shuffle bytes=743594506
> >                  Reduce input records=106816
> >                  Reduce output records=106816
> >                  Spilled Records=352386
> >                  Shuffled Maps =560
> >                  Failed Shuffles=0
> >                  Merged Map outputs=560
> >                  GC time elapsed (ms)=35203
> >                  CPU time spent (ms)=9098390
> >                  Physical memory (bytes) snapshot=226334003200
> >                  Virtual memory (bytes) snapshot=1317038624768
> >                  Total committed heap usage (bytes)=309841625088
> >                  Peak Map Physical memory (bytes)=1892413440
> >                  Peak Map Virtual memory (bytes)=9298661376
> >                  Peak Reduce Physical memory (bytes)=1637662720
> >                  Peak Reduce Virtual memory (bytes)=9212592128
> >          FetcherStatus
> >                  access_denied=98
> >                  bytes_downloaded=3062491817
> >                  exception=1302
> >                  gone=26
> >                  moved=2598
> >                  notfound=723
> >                  redirect_count_exceeded=146
> >                  robots_denied=1145
> >                  robots_denied_maxcrawldelay=16
> >                  success=47650
> >                  temp_moved=1432
> >          Shuffle Errors
> >                  BAD_ID=0
> >                  CONNECTION=0
> >                  IO_ERROR=0
> >                  WRONG_LENGTH=0
> >                  WRONG_MAP=0
> >                  WRONG_REDUCE=0
> >          File Input Format Counters
> >                  Bytes Read=7589886
> >          File Output Format Counters
> >                  Bytes Written=778632266
> > 2025-03-28 08:57:25,258 INFO fetcher.Fetcher: Fetcher: finished, elapsed:
> > 3444497 ms
> >
> > Input/output records do not match with previous steps. What do I miss?
> >
> > Best,
> > Maciej
> >
> > pt., 28 mar 2025 o 08:54 Sebastian Nagel <[email protected]
> .invalid>
> > napisał(a):
> >
> >> Hi Maciek,
> >>
> >> there are multiple configurations which set a limit on the items
> >> during fetch list generations.
> >>
> >> - topN (Ok, it's obviously not the reason)
> >>
> >> - a limit per host is defined by the property generate.max.count
> >>     - default is -1 (no limit)
> >>     - eventually you want to set a limit per host, in order to avoid
> that
> >>       a single host with overlong fetch lists slows down the overall
> >> crawling
> >>     - by generate.count.mode this limit can be applied per registered
> >>       domain or IP address
> >>
> >> - generate.min.score (default: 0.0): only CrawlDatum items with a higher
> >>     score are put into fetch lists
> >>
> >> - the fetch scheduling: re-fetch pages after a certain amount of time
> >>     (default: 30 days), also wait 1 day for retrying a page which failed
> >>     to fetch with an error (not a 404)
> >>
> >>
> >> Running the CrawlDb statistics
> >>
> >>     bin/nutch readdb crawldb -stats
> >>
> >> shows the number of items per status, retry count, the distribution
> >> of scores and fetch intervals.
> >>
> >>
> >>   > should generate a segment with 500k pages to fetch.
> >>
> >> Not necessarily, see above. The final size of the fetch list
> >> is shown by the counter
> >>     Reduce output records=NNN
> >> Note: the second appearance of it in the generator log because of
> >> NUTCH-3059.
> >>
> >>   > fetches only around 100k pages.
> >>
> >> The fetcher counters also show how many items are skipped because of the
> >> fetcher timelimit and alike.
> >>
> >>
> >> Let us know whether you need more information. If possible please share
> the
> >> CrawlDb statistics or the generator and fetcher counters. It might help
> >> to find the reason.
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 3/27/25 18:56, Maciek Puzianowski wrote:
> >>> Hi,
> >>> I have a problem with topN value in Apache Nutch.
> >>> I have 8 million+ db_unfetched pages in crawldb. I use crawl script
> with
> >>> following command:
> >>> bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20
> >>> --size-fetchlist 500000 /nutch/crawl 1
> >>> --size-fetchlist parameter is the topN for generate method, meaning
> that
> >> it
> >>> should generate a segment with 500k pages to fetch. However, the
> fetcher
> >>> fetches only around 100k pages. Also I get around 1 million
> >>> SCHEDULE_REJECTED counter in generate method, but I think its just
> pages
> >>> that I have already fetched.
> >>>
> >>> I have checked url filters and they affect only few pages.
> >>>
> >>> What can be causing such issue with such a big difference?
> >>>
> >>> Best,
> >>> Maciej
> >>>
> >>
> >>
> >
>
>

Re: Generator or fetcher does not get topN pages

Reply via email to