Re: Generator or fetcher does not get topN pages

Maciek Puzianowski Mon, 31 Mar 2025 05:40:16 -0700

PS
I think I manged to overpass it, but I don`t think it is supposed to be
like that. I have modified crawl command to:
bin/crawl -i --num-fetchers 140 --num-tasks 170 --num-threads 50
/nutch/crawl 1
I have 4 worker nodes in my hadoop cluster so previously I set
--num-fetchers to 4 as crawl script recommends. Now I changed it to 140 and
now my fetcher job returns:
Map-Reduce Framework
                Map input records=487214
                Map output records=883753


So it looks like it works fine now, however why does it fix my problem?

Best,
Maciej

pt., 28 mar 2025 o 16:51 Maciek Puzianowski <[email protected]>
napisał(a):

> Sebastian,
> bin/nutch readseg -list ... returns after fetching:
> NAME            GENERATED       FETCHER START           FETCHER END
>       FETCHED PARSED
> 20250328111939  1499960         2025-03-28T11:24:17
> 2025-03-28T12:25:20     259802  ?
> which seems right because I have set topN to 1500000 and as you can see it
> fetched more than 100k.
> I did an experiment and tried to set topN to 2500000 to see if it rises
> proportionally. However, with topN set to 2.5 million I still get around
> 260k fetched pages. Maybe that fact can be a hint.
>
> Nothing new in generator log. I ran it after few minutes and generator
> didn`t say there is no more pages to fetch... maybe that is another hint.
>
> There is a fetcher log from topN=2.5million run:
>
> 2025-03-28 16:17:16,479 INFO mapreduce.Job: Counters: 65
>         File System Counters
>                 FILE: Number of bytes read=10247252578
>                 FILE: Number of bytes written=13839868610
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>                 HDFS: Number of bytes read=33380868
>                 HDFS: Number of bytes written=3826710903
>                 HDFS: Number of read operations=156
>                 HDFS: Number of large read operations=0
>                 HDFS: Number of write operations=840
>                 HDFS: Number of bytes read erasure-coded=0
>         Job Counters
>                 Launched map tasks=4
>                 Launched reduce tasks=140
>                 Other local map tasks=4
>                 Total time spent by all maps in occupied slots
> (ms)=29492618
>                 Total time spent by all reduces in occupied slots
> (ms)=909990524
>                 Total time spent by all map tasks (ms)=14746309
>                 Total time spent by all reduce tasks (ms)=454995262
>                 Total vcore-milliseconds taken by all map tasks=14746309
>                 Total vcore-milliseconds taken by all reduce
> tasks=454995262
>                 Total megabyte-milliseconds taken by all map
> tasks=120801763328
>                 Total megabyte-milliseconds taken by all reduce
> tasks=3727321186304
>         Map-Reduce Framework
>                 Map input records=259372
>                 Map output records=531984
>                 Map output bytes=15944298687
>                 Map output materialized bytes=3596875971
>                 Input split bytes=636
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=271454
>                 Reduce shuffle bytes=3596875971
>                 Reduce input records=531984
>                 Reduce output records=531984
>                 Spilled Records=2053170
>                 Shuffled Maps =560
>                 Failed Shuffles=0
>                 Merged Map outputs=560
>                 GC time elapsed (ms)=42297
>                 CPU time spent (ms)=15190940
>                 Physical memory (bytes) snapshot=239405228032
>                 Virtual memory (bytes) snapshot=1317027561472
>                 Total committed heap usage (bytes)=309841625088
>                 Peak Map Physical memory (bytes)=1934864384
>                 Peak Map Virtual memory (bytes)=9297256448
>                 Peak Reduce Physical memory (bytes)=1720426496
>                 Peak Reduce Virtual memory (bytes)=9214562304
>         FetcherStatus
>                 access_denied=588
>                 bytes_downloaded=15524376452
>                 exception=4933
>                 gone=85
>                 moved=14393
>                 notfound=4081
>                 redirect_count_exceeded=46
>                 robots_denied=4769
>                 robots_denied_maxcrawldelay=48
>                 success=233691
>                 temp_moved=10665
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=33380232
>         File Output Format Counters
>                 Bytes Written=3826710903
> 2025-03-28 16:17:16,480 INFO fetcher.Fetcher: Fetcher: finished, elapsed:
> 4749335 ms
>
> To summarize, generator works probably fine as readseg gives me good
> amount of pages, but somehow fetcher loses most of them. I will try to look
> in that way, however I will appriciate any more help.
>
> Best,
> Maciej
>
> pt., 28 mar 2025 o 11:53 Sebastian Nagel
> <[email protected]> napisał(a):
>
>> Hi Maciek,
>>
>>  > Input/output records do not match with previous steps. What do I miss?
>>
>> Indeed:
>>
>> - generator
>>  >                  Launched reduce tasks=4
>>  >                  Reduce output records=499940
>>
>> - fetcher
>>  >          Job Counters
>>  >                  Launched map tasks=4
>>  >                  Map input records=59533
>>
>> The number of fetch lists (fetcher map tasks) does fit,
>> but generator output records and fetcher input records do not.
>>
>> No idea why. I'd first count items in the segment:
>>
>>    bin/nutch readseg -list ...
>>
>>
>>  > PS I have run one more crawl and generator gives me
>>  > 2025-03-28 09:44:49,629 WARN crawl.Generator: Generator: 0 records
>> selected
>>  > for fetching, exiting ...
>>  > Generate returned 1 (no new segments created)
>>  > even though stats still have db_unfetched pages:
>>
>> Also no idea why this happens. Any hints in the last generator log?
>>
>>
>> One minor point:
>>
>>  > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile
>> 0.05:  0.0
>>
>> This means there are 5% of the records with score == 0.0
>> With the default generate.min.score, they're never fetched.
>> It's likely redirects because of NUTCH-2749. I definitely should fix it!
>> There's an trivial fix / work-around: comment the line
>>    datum.setScore(0.0f);
>> in OPICScoringFilter initialScore.
>>
>> Also you could set http.redirect.max to a reasonable value (3-5) and
>> redirects
>> are followed directly in the fetcher.
>>
>>
>> Best,
>> Sebastian
>>
>> On 3/28/25 09:19, Maciek Puzianowski wrote:
>> > Sebastian,
>> >
>> > thank you for your quick answer,
>> >
>> > I have set generate.max.count to 1 000 000, so I think it shouldn`t be a
>> > problem, generate.count.mode is set to default byHost as well as
>> > generate.min.score to 0.
>> > It is a fresh start and I have set refetch time to 90 days so I don`t
>> blame
>> > the pages that were already fetched, I believe they are going to fetch
>> in
>> > those 90 days. db_unfetched pages are bothering me.
>> > I was looking at the statistics you mentioned, however I can`t get the
>> > cause from them. I will share them, maybe you will notice something...
>> so:
>> >
>> > readdb -stats:
>> > 2025-03-28 09:12:54,544 INFO crawl.CrawlDbReader: Statistics for
>> CrawlDb:
>> > /nutch/crawl/crawldb
>> > 2025-03-28 09:12:54,545 INFO crawl.CrawlDbReader: TOTAL urls:   11761224
>> > 2025-03-28 09:12:54,551 INFO crawl.CrawlDbReader: shortest fetch
>> interval:
>> >       91 days, 07:27:11
>> > 2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: avg fetch interval:
>>  912
>> > days, 21:15:04
>> > 2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: longest fetch
>> interval:
>> >      24855 days, 03:14:07
>> > 2025-03-28 09:12:54,563 INFO crawl.CrawlDbReader: earliest fetch time:
>> Wed
>> > Mar 26 10:12:00 CET 2025
>> > 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: avg of fetch times:
>>  Mon
>> > Apr 14 23:27:00 CEST 2025
>> > 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: latest fetch time:
>> Tue
>> > Aug 12 09:01:00 CEST 2025
>> > 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 0:      11686131
>> > 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 1:      37311
>> > 2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 2:      37782
>> > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.01:
>> 0.0
>> > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.05:
>> 0.0
>> > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.1:
>> > 2.8397222762258305E-9
>> > 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.2:
>> > 1.5968612389262938E-7
>> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.25:
>> >   3.0930187090863353E-7
>> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.3:
>> > 5.316786869545881E-7
>> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.4:
>> > 1.1587939392930727E-6
>> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.5:
>> > 2.5814283409185258E-6
>> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.6:
>> > 5.908654160550735E-6
>> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.7:
>> > 1.7618150654886576E-5
>> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.75:
>> >   2.9190806586558215E-5
>> > 2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.8:
>> > 8.921349358020518E-5
>> > 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.9:
>> > 5.628835416129725E-4
>> > 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.95:
>> >   0.007623361698306406
>> > 2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.99:
>> >   0.9327549852155366
>> > 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: min score:    0.0
>> > 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: avg score:
>> >   0.015449582458424396
>> > 2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: max score:
>> >   183.34019470214844
>> > 2025-03-28 09:12:54,602 INFO crawl.CrawlDbReader: status 1
>> (db_unfetched):
>> >       9406711
>> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 2 (db_fetched):
>> >       1978175
>> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 3 (db_gone):
>> > 98857
>> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 4
>> (db_redir_temp):
>> >      69608
>> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 5
>> (db_redir_perm):
>> >      169813
>> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 6
>> > (db_notmodified):    6045
>> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 7
>> (db_duplicate):
>> >       32015
>> > 2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: CrawlDb statistics:
>> done
>> >
>> > Please keep in mind, that "longest fetch interval:       24855 days" is
>> not
>> > an issue as I have custom plugin that is doing so to prevent certain
>> pages
>> > from fetching ever again.
>> >
>> > Generate counters:
>> > First one:
>> > 2025-03-28 07:59:06,502 INFO mapreduce.Job: Counters: 57
>> >          File System Counters
>> >                  FILE: Number of bytes read=259336882
>> >                  FILE: Number of bytes written=731826855
>> >                  FILE: Number of read operations=0
>> >                  FILE: Number of large read operations=0
>> >                  FILE: Number of write operations=0
>> >                  HDFS: Number of bytes read=1460078564
>> >                  HDFS: Number of bytes written=66867570
>> >                  HDFS: Number of read operations=1680
>> >                  HDFS: Number of large read operations=0
>> >                  HDFS: Number of write operations=561
>> >                  HDFS: Number of bytes read erasure-coded=0
>> >          Job Counters
>> >                  Launched map tasks=140
>> >                  Launched reduce tasks=140
>> >                  Data-local map tasks=137
>> >                  Rack-local map tasks=3
>> >                  Total time spent by all maps in occupied slots
>> (ms)=16111006
>> >                  Total time spent by all reduces in occupied slots
>> > (ms)=20877184
>> >                  Total time spent by all map tasks (ms)=8055503
>> >                  Total time spent by all reduce tasks (ms)=10438592
>> >                  Total vcore-milliseconds taken by all map tasks=8055503
>> >                  Total vcore-milliseconds taken by all reduce
>> tasks=10438592
>> >                  Total megabyte-milliseconds taken by all map
>> > tasks=65990680576
>> >                  Total megabyte-milliseconds taken by all reduce
>> > tasks=85512945664
>> >          Map-Reduce Framework
>> >                  Map input records=11556117
>> >                  Map output records=9192906
>> >                  Map output bytes=1171410543
>> >                  Map output materialized bytes=373113013
>> >                  Input split bytes=19740
>> >                  Combine input records=0
>> >                  Combine output records=0
>> >                  Reduce input groups=2022938
>> >                  Reduce shuffle bytes=373113013
>> >                  Reduce input records=9192906
>> >                  Reduce output records=0
>> >                  Spilled Records=18385812
>> >                  Shuffled Maps =19600
>> >                  Failed Shuffles=0
>> >                  Merged Map outputs=19600
>> >                  GC time elapsed (ms)=87351
>> >                  CPU time spent (ms)=7536820
>> >                  Physical memory (bytes) snapshot=159464194048
>> >                  Virtual memory (bytes) snapshot=2564550729728
>> >                  Total committed heap usage (bytes)=602469826560
>> >                  Peak Map Physical memory (bytes)=703803392
>> >                  Peak Map Virtual memory (bytes)=9257803776
>> >                  Peak Reduce Physical memory (bytes)=1188073472
>> >                  Peak Reduce Virtual memory (bytes)=9272107008
>> >          Generator
>> >                  MALFORMED_URL=2
>> >                  SCHEDULE_REJECTED=2363211
>> >          Shuffle Errors
>> >                  BAD_ID=0
>> >                  CONNECTION=0
>> >                  IO_ERROR=0
>> >                  WRONG_LENGTH=0
>> >                  WRONG_MAP=0
>> >                  WRONG_REDUCE=0
>> >          File Input Format Counters
>> >                  Bytes Read=1460058824
>> >          File Output Format Counters
>> >                  Bytes Written=0
>> >
>> > Looks fine, output records match with db_unfetched.
>> >
>> > Second one:
>> > 2025-03-28 07:59:55,038 INFO mapreduce.Job: Counters: 55
>> >          File System Counters
>> >                  FILE: Number of bytes read=24605358
>> >                  FILE: Number of bytes written=95719210
>> >                  FILE: Number of read operations=0
>> >                  FILE: Number of large read operations=0
>> >                  FILE: Number of write operations=0
>> >                  HDFS: Number of bytes read=66892770
>> >                  HDFS: Number of bytes written=62854154
>> >                  HDFS: Number of read operations=580
>> >                  HDFS: Number of large read operations=0
>> >                  HDFS: Number of write operations=8
>> >                  HDFS: Number of bytes read erasure-coded=0
>> >          Job Counters
>> >                  Launched map tasks=140
>> >                  Launched reduce tasks=4
>> >                  Data-local map tasks=136
>> >                  Rack-local map tasks=4
>> >                  Total time spent by all maps in occupied slots
>> (ms)=6395528
>> >                  Total time spent by all reduces in occupied slots
>> > (ms)=106208
>> >                  Total time spent by all map tasks (ms)=3197764
>> >                  Total time spent by all reduce tasks (ms)=53104
>> >                  Total vcore-milliseconds taken by all map tasks=3197764
>> >                  Total vcore-milliseconds taken by all reduce
>> tasks=53104
>> >                  Total megabyte-milliseconds taken by all map
>> > tasks=26196082688
>> >                  Total megabyte-milliseconds taken by all reduce
>> > tasks=435027968
>> >          Map-Reduce Framework
>> >                  Map input records=499940
>> >                  Map output records=499940
>> >                  Map output bytes=94641374
>> >                  Map output materialized bytes=20372766
>> >                  Input split bytes=25200
>> >                  Combine input records=0
>> >                  Combine output records=0
>> >                  Reduce input groups=499914
>> >                  Reduce shuffle bytes=20372766
>> >                  Reduce input records=499940
>> >                  Reduce output records=499940
>> >                  Spilled Records=999880
>> >                  Shuffled Maps =560
>> >                  Failed Shuffles=0
>> >                  Merged Map outputs=560
>> >                  GC time elapsed (ms)=32655
>> >                  CPU time spent (ms)=1206490
>> >                  Physical memory (bytes) snapshot=68925956096
>> >                  Virtual memory (bytes) snapshot=1315029512192
>> >                  Total committed heap usage (bytes)=309841625088
>> >                  Peak Map Physical memory (bytes)=519888896
>> >                  Peak Map Virtual memory (bytes)=9196109824
>> >                  Peak Reduce Physical memory (bytes)=517226496
>> >                  Peak Reduce Virtual memory (bytes)=9233317888
>> >          Shuffle Errors
>> >                  BAD_ID=0
>> >                  CONNECTION=0
>> >                  IO_ERROR=0
>> >                  WRONG_LENGTH=0
>> >                  WRONG_MAP=0
>> >                  WRONG_REDUCE=0
>> >          File Input Format Counters
>> >                  Bytes Read=66867570
>> >          File Output Format Counters
>> >                  Bytes Written=62854154
>> >
>> > Again output/input records matches with topN parameter.
>> >
>> > Finally, fetcher counter:
>> > 2025-03-28 08:57:25,256 INFO mapreduce.Job: Counters: 65
>> >          File System Counters
>> >                  FILE: Number of bytes read=1670508693
>> >                  FILE: Number of bytes written=2454067433
>> >                  FILE: Number of read operations=0
>> >                  FILE: Number of large read operations=0
>> >                  FILE: Number of write operations=0
>> >                  HDFS: Number of bytes read=7590522
>> >                  HDFS: Number of bytes written=778632266
>> >                  HDFS: Number of read operations=156
>> >                  HDFS: Number of large read operations=0
>> >                  HDFS: Number of write operations=840
>> >                  HDFS: Number of bytes read erasure-coded=0
>> >          Job Counters
>> >                  Launched map tasks=4
>> >                  Launched reduce tasks=140
>> >                  Other local map tasks=4
>> >                  Total time spent by all maps in occupied slots
>> (ms)=14400056
>> >                  Total time spent by all reduces in occupied slots
>> > (ms)=758192886
>> >                  Total time spent by all map tasks (ms)=7200028
>> >                  Total time spent by all reduce tasks (ms)=379096443
>> >                  Total vcore-milliseconds taken by all map tasks=7200028
>> >                  Total vcore-milliseconds taken by all reduce
>> tasks=379096443
>> >                  Total megabyte-milliseconds taken by all map
>> > tasks=58982629376
>> >                  Total megabyte-milliseconds taken by all reduce
>> > tasks=3105558061056
>> >          Map-Reduce Framework
>> >                  Map input records=59533
>> >                  Map output records=106816
>> >                  Map output bytes=3132357328
>> >                  Map output materialized bytes=743594506
>> >                  Input split bytes=636
>> >                  Combine input records=0
>> >                  Combine output records=0
>> >                  Reduce input groups=54834
>> >                  Reduce shuffle bytes=743594506
>> >                  Reduce input records=106816
>> >                  Reduce output records=106816
>> >                  Spilled Records=352386
>> >                  Shuffled Maps =560
>> >                  Failed Shuffles=0
>> >                  Merged Map outputs=560
>> >                  GC time elapsed (ms)=35203
>> >                  CPU time spent (ms)=9098390
>> >                  Physical memory (bytes) snapshot=226334003200
>> >                  Virtual memory (bytes) snapshot=1317038624768
>> >                  Total committed heap usage (bytes)=309841625088
>> >                  Peak Map Physical memory (bytes)=1892413440
>> >                  Peak Map Virtual memory (bytes)=9298661376
>> >                  Peak Reduce Physical memory (bytes)=1637662720
>> >                  Peak Reduce Virtual memory (bytes)=9212592128
>> >          FetcherStatus
>> >                  access_denied=98
>> >                  bytes_downloaded=3062491817
>> >                  exception=1302
>> >                  gone=26
>> >                  moved=2598
>> >                  notfound=723
>> >                  redirect_count_exceeded=146
>> >                  robots_denied=1145
>> >                  robots_denied_maxcrawldelay=16
>> >                  success=47650
>> >                  temp_moved=1432
>> >          Shuffle Errors
>> >                  BAD_ID=0
>> >                  CONNECTION=0
>> >                  IO_ERROR=0
>> >                  WRONG_LENGTH=0
>> >                  WRONG_MAP=0
>> >                  WRONG_REDUCE=0
>> >          File Input Format Counters
>> >                  Bytes Read=7589886
>> >          File Output Format Counters
>> >                  Bytes Written=778632266
>> > 2025-03-28 08:57:25,258 INFO fetcher.Fetcher: Fetcher: finished,
>> elapsed:
>> > 3444497 ms
>> >
>> > Input/output records do not match with previous steps. What do I miss?
>> >
>> > Best,
>> > Maciej
>> >
>> > pt., 28 mar 2025 o 08:54 Sebastian Nagel <[email protected]
>> .invalid>
>> > napisał(a):
>> >
>> >> Hi Maciek,
>> >>
>> >> there are multiple configurations which set a limit on the items
>> >> during fetch list generations.
>> >>
>> >> - topN (Ok, it's obviously not the reason)
>> >>
>> >> - a limit per host is defined by the property generate.max.count
>> >>     - default is -1 (no limit)
>> >>     - eventually you want to set a limit per host, in order to avoid
>> that
>> >>       a single host with overlong fetch lists slows down the overall
>> >> crawling
>> >>     - by generate.count.mode this limit can be applied per registered
>> >>       domain or IP address
>> >>
>> >> - generate.min.score (default: 0.0): only CrawlDatum items with a
>> higher
>> >>     score are put into fetch lists
>> >>
>> >> - the fetch scheduling: re-fetch pages after a certain amount of time
>> >>     (default: 30 days), also wait 1 day for retrying a page which
>> failed
>> >>     to fetch with an error (not a 404)
>> >>
>> >>
>> >> Running the CrawlDb statistics
>> >>
>> >>     bin/nutch readdb crawldb -stats
>> >>
>> >> shows the number of items per status, retry count, the distribution
>> >> of scores and fetch intervals.
>> >>
>> >>
>> >>   > should generate a segment with 500k pages to fetch.
>> >>
>> >> Not necessarily, see above. The final size of the fetch list
>> >> is shown by the counter
>> >>     Reduce output records=NNN
>> >> Note: the second appearance of it in the generator log because of
>> >> NUTCH-3059.
>> >>
>> >>   > fetches only around 100k pages.
>> >>
>> >> The fetcher counters also show how many items are skipped because of
>> the
>> >> fetcher timelimit and alike.
>> >>
>> >>
>> >> Let us know whether you need more information. If possible please
>> share the
>> >> CrawlDb statistics or the generator and fetcher counters. It might help
>> >> to find the reason.
>> >>
>> >> Best,
>> >> Sebastian
>> >>
>> >> On 3/27/25 18:56, Maciek Puzianowski wrote:
>> >>> Hi,
>> >>> I have a problem with topN value in Apache Nutch.
>> >>> I have 8 million+ db_unfetched pages in crawldb. I use crawl script
>> with
>> >>> following command:
>> >>> bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20
>> >>> --size-fetchlist 500000 /nutch/crawl 1
>> >>> --size-fetchlist parameter is the topN for generate method, meaning
>> that
>> >> it
>> >>> should generate a segment with 500k pages to fetch. However, the
>> fetcher
>> >>> fetches only around 100k pages. Also I get around 1 million
>> >>> SCHEDULE_REJECTED counter in generate method, but I think its just
>> pages
>> >>> that I have already fetched.
>> >>>
>> >>> I have checked url filters and they affect only few pages.
>> >>>
>> >>> What can be causing such issue with such a big difference?
>> >>>
>> >>> Best,
>> >>> Maciej
>> >>>
>> >>
>> >>
>> >
>>
>>

Re: Generator or fetcher does not get topN pages

Reply via email to