Hi Maciek,
> Input/output records do not match with previous steps. What do I miss?
Indeed:
- generator
> Launched reduce tasks=4
> Reduce output records=499940
- fetcher
> Job Counters
> Launched map tasks=4
> Map input records=59533
The number of fetch lists (fetcher map tasks) does fit,
but generator output records and fetcher input records do not.
No idea why. I'd first count items in the segment:
bin/nutch readseg -list ...
> PS I have run one more crawl and generator gives me
> 2025-03-28 09:44:49,629 WARN crawl.Generator: Generator: 0 records
selected
> for fetching, exiting ...
> Generate returned 1 (no new segments created)
> even though stats still have db_unfetched pages:
Also no idea why this happens. Any hints in the last generator log?
One minor point:
> 2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile
0.05: 0.0
This means there are 5% of the records with score == 0.0
With the default generate.min.score, they're never fetched.
It's likely redirects because of NUTCH-2749. I definitely should fix it!
There's an trivial fix / work-around: comment the line
datum.setScore(0.0f);
in OPICScoringFilter initialScore.
Also you could set http.redirect.max to a reasonable value (3-5) and
redirects
are followed directly in the fetcher.
Best,
Sebastian
On 3/28/25 09:19, Maciek Puzianowski wrote:
Sebastian,
thank you for your quick answer,
I have set generate.max.count to 1 000 000, so I think it shouldn`t be a
problem, generate.count.mode is set to default byHost as well as
generate.min.score to 0.
It is a fresh start and I have set refetch time to 90 days so I don`t
blame
the pages that were already fetched, I believe they are going to fetch
in
those 90 days. db_unfetched pages are bothering me.
I was looking at the statistics you mentioned, however I can`t get the
cause from them. I will share them, maybe you will notice something...
so:
readdb -stats:
2025-03-28 09:12:54,544 INFO crawl.CrawlDbReader: Statistics for
CrawlDb:
/nutch/crawl/crawldb
2025-03-28 09:12:54,545 INFO crawl.CrawlDbReader: TOTAL urls: 11761224
2025-03-28 09:12:54,551 INFO crawl.CrawlDbReader: shortest fetch
interval:
91 days, 07:27:11
2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: avg fetch interval:
912
days, 21:15:04
2025-03-28 09:12:54,552 INFO crawl.CrawlDbReader: longest fetch
interval:
24855 days, 03:14:07
2025-03-28 09:12:54,563 INFO crawl.CrawlDbReader: earliest fetch time:
Wed
Mar 26 10:12:00 CET 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: avg of fetch times:
Mon
Apr 14 23:27:00 CEST 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: latest fetch time:
Tue
Aug 12 09:01:00 CEST 2025
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 0: 11686131
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 1: 37311
2025-03-28 09:12:54,564 INFO crawl.CrawlDbReader: retry 2: 37782
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.01:
0.0
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.05:
0.0
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.1:
2.8397222762258305E-9
2025-03-28 09:12:54,580 INFO crawl.CrawlDbReader: score quantile 0.2:
1.5968612389262938E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.25:
3.0930187090863353E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.3:
5.316786869545881E-7
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.4:
1.1587939392930727E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.5:
2.5814283409185258E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.6:
5.908654160550735E-6
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.7:
1.7618150654886576E-5
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.75:
2.9190806586558215E-5
2025-03-28 09:12:54,581 INFO crawl.CrawlDbReader: score quantile 0.8:
8.921349358020518E-5
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.9:
5.628835416129725E-4
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.95:
0.007623361698306406
2025-03-28 09:12:54,582 INFO crawl.CrawlDbReader: score quantile 0.99:
0.9327549852155366
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: min score: 0.0
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: avg score:
0.015449582458424396
2025-03-28 09:12:54,583 INFO crawl.CrawlDbReader: max score:
183.34019470214844
2025-03-28 09:12:54,602 INFO crawl.CrawlDbReader: status 1
(db_unfetched):
9406711
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 2 (db_fetched):
1978175
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 3 (db_gone):
98857
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 4
(db_redir_temp):
69608
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 5
(db_redir_perm):
169813
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 6
(db_notmodified): 6045
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: status 7
(db_duplicate):
32015
2025-03-28 09:12:54,603 INFO crawl.CrawlDbReader: CrawlDb statistics:
done
Please keep in mind, that "longest fetch interval: 24855 days" is
not
an issue as I have custom plugin that is doing so to prevent certain
pages
from fetching ever again.
Generate counters:
First one:
2025-03-28 07:59:06,502 INFO mapreduce.Job: Counters: 57
File System Counters
FILE: Number of bytes read=259336882
FILE: Number of bytes written=731826855
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1460078564
HDFS: Number of bytes written=66867570
HDFS: Number of read operations=1680
HDFS: Number of large read operations=0
HDFS: Number of write operations=561
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=140
Launched reduce tasks=140
Data-local map tasks=137
Rack-local map tasks=3
Total time spent by all maps in occupied slots
(ms)=16111006
Total time spent by all reduces in occupied slots
(ms)=20877184
Total time spent by all map tasks (ms)=8055503
Total time spent by all reduce tasks (ms)=10438592
Total vcore-milliseconds taken by all map tasks=8055503
Total vcore-milliseconds taken by all reduce
tasks=10438592
Total megabyte-milliseconds taken by all map
tasks=65990680576
Total megabyte-milliseconds taken by all reduce
tasks=85512945664
Map-Reduce Framework
Map input records=11556117
Map output records=9192906
Map output bytes=1171410543
Map output materialized bytes=373113013
Input split bytes=19740
Combine input records=0
Combine output records=0
Reduce input groups=2022938
Reduce shuffle bytes=373113013
Reduce input records=9192906
Reduce output records=0
Spilled Records=18385812
Shuffled Maps =19600
Failed Shuffles=0
Merged Map outputs=19600
GC time elapsed (ms)=87351
CPU time spent (ms)=7536820
Physical memory (bytes) snapshot=159464194048
Virtual memory (bytes) snapshot=2564550729728
Total committed heap usage (bytes)=602469826560
Peak Map Physical memory (bytes)=703803392
Peak Map Virtual memory (bytes)=9257803776
Peak Reduce Physical memory (bytes)=1188073472
Peak Reduce Virtual memory (bytes)=9272107008
Generator
MALFORMED_URL=2
SCHEDULE_REJECTED=2363211
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1460058824
File Output Format Counters
Bytes Written=0
Looks fine, output records match with db_unfetched.
Second one:
2025-03-28 07:59:55,038 INFO mapreduce.Job: Counters: 55
File System Counters
FILE: Number of bytes read=24605358
FILE: Number of bytes written=95719210
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=66892770
HDFS: Number of bytes written=62854154
HDFS: Number of read operations=580
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=140
Launched reduce tasks=4
Data-local map tasks=136
Rack-local map tasks=4
Total time spent by all maps in occupied slots
(ms)=6395528
Total time spent by all reduces in occupied slots
(ms)=106208
Total time spent by all map tasks (ms)=3197764
Total time spent by all reduce tasks (ms)=53104
Total vcore-milliseconds taken by all map tasks=3197764
Total vcore-milliseconds taken by all reduce
tasks=53104
Total megabyte-milliseconds taken by all map
tasks=26196082688
Total megabyte-milliseconds taken by all reduce
tasks=435027968
Map-Reduce Framework
Map input records=499940
Map output records=499940
Map output bytes=94641374
Map output materialized bytes=20372766
Input split bytes=25200
Combine input records=0
Combine output records=0
Reduce input groups=499914
Reduce shuffle bytes=20372766
Reduce input records=499940
Reduce output records=499940
Spilled Records=999880
Shuffled Maps =560
Failed Shuffles=0
Merged Map outputs=560
GC time elapsed (ms)=32655
CPU time spent (ms)=1206490
Physical memory (bytes) snapshot=68925956096
Virtual memory (bytes) snapshot=1315029512192
Total committed heap usage (bytes)=309841625088
Peak Map Physical memory (bytes)=519888896
Peak Map Virtual memory (bytes)=9196109824
Peak Reduce Physical memory (bytes)=517226496
Peak Reduce Virtual memory (bytes)=9233317888
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=66867570
File Output Format Counters
Bytes Written=62854154
Again output/input records matches with topN parameter.
Finally, fetcher counter:
2025-03-28 08:57:25,256 INFO mapreduce.Job: Counters: 65
File System Counters
FILE: Number of bytes read=1670508693
FILE: Number of bytes written=2454067433
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=7590522
HDFS: Number of bytes written=778632266
HDFS: Number of read operations=156
HDFS: Number of large read operations=0
HDFS: Number of write operations=840
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=4
Launched reduce tasks=140
Other local map tasks=4
Total time spent by all maps in occupied slots
(ms)=14400056
Total time spent by all reduces in occupied slots
(ms)=758192886
Total time spent by all map tasks (ms)=7200028
Total time spent by all reduce tasks (ms)=379096443
Total vcore-milliseconds taken by all map tasks=7200028
Total vcore-milliseconds taken by all reduce
tasks=379096443
Total megabyte-milliseconds taken by all map
tasks=58982629376
Total megabyte-milliseconds taken by all reduce
tasks=3105558061056
Map-Reduce Framework
Map input records=59533
Map output records=106816
Map output bytes=3132357328
Map output materialized bytes=743594506
Input split bytes=636
Combine input records=0
Combine output records=0
Reduce input groups=54834
Reduce shuffle bytes=743594506
Reduce input records=106816
Reduce output records=106816
Spilled Records=352386
Shuffled Maps =560
Failed Shuffles=0
Merged Map outputs=560
GC time elapsed (ms)=35203
CPU time spent (ms)=9098390
Physical memory (bytes) snapshot=226334003200
Virtual memory (bytes) snapshot=1317038624768
Total committed heap usage (bytes)=309841625088
Peak Map Physical memory (bytes)=1892413440
Peak Map Virtual memory (bytes)=9298661376
Peak Reduce Physical memory (bytes)=1637662720
Peak Reduce Virtual memory (bytes)=9212592128
FetcherStatus
access_denied=98
bytes_downloaded=3062491817
exception=1302
gone=26
moved=2598
notfound=723
redirect_count_exceeded=146
robots_denied=1145
robots_denied_maxcrawldelay=16
success=47650
temp_moved=1432
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=7589886
File Output Format Counters
Bytes Written=778632266
2025-03-28 08:57:25,258 INFO fetcher.Fetcher: Fetcher: finished,
elapsed:
3444497 ms
Input/output records do not match with previous steps. What do I miss?
Best,
Maciej
pt., 28 mar 2025 o 08:54 Sebastian Nagel <[email protected]
.invalid>
napisał(a):
Hi Maciek,
there are multiple configurations which set a limit on the items
during fetch list generations.
- topN (Ok, it's obviously not the reason)
- a limit per host is defined by the property generate.max.count
- default is -1 (no limit)
- eventually you want to set a limit per host, in order to avoid
that
a single host with overlong fetch lists slows down the overall
crawling
- by generate.count.mode this limit can be applied per registered
domain or IP address
- generate.min.score (default: 0.0): only CrawlDatum items with a
higher
score are put into fetch lists
- the fetch scheduling: re-fetch pages after a certain amount of time
(default: 30 days), also wait 1 day for retrying a page which
failed
to fetch with an error (not a 404)
Running the CrawlDb statistics
bin/nutch readdb crawldb -stats
shows the number of items per status, retry count, the distribution
of scores and fetch intervals.
> should generate a segment with 500k pages to fetch.
Not necessarily, see above. The final size of the fetch list
is shown by the counter
Reduce output records=NNN
Note: the second appearance of it in the generator log because of
NUTCH-3059.
> fetches only around 100k pages.
The fetcher counters also show how many items are skipped because of
the
fetcher timelimit and alike.
Let us know whether you need more information. If possible please
share the
CrawlDb statistics or the generator and fetcher counters. It might help
to find the reason.
Best,
Sebastian
On 3/27/25 18:56, Maciek Puzianowski wrote:
Hi,
I have a problem with topN value in Apache Nutch.
I have 8 million+ db_unfetched pages in crawldb. I use crawl script
with
following command:
bin/crawl -i --num-fetchers 4 --num-tasks 45 --num-threads 20
--size-fetchlist 500000 /nutch/crawl 1
--size-fetchlist parameter is the topN for generate method, meaning
that
it
should generate a segment with 500k pages to fetch. However, the
fetcher
fetches only around 100k pages. Also I get around 1 million
SCHEDULE_REJECTED counter in generate method, but I think its just
pages
that I have already fetched.
I have checked url filters and they affect only few pages.
What can be causing such issue with such a big difference?
Best,
Maciej