[
https://issues.apache.org/jira/browse/NUTCH-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18072080#comment-18072080
]
ASF GitHub Bot commented on NUTCH-3162:
---------------------------------------
sebastian-nagel commented on code in PR #906:
URL: https://github.com/apache/nutch/pull/906#discussion_r3053657797
##########
src/java/org/apache/nutch/parse/ParseSegment.java:
##########
@@ -98,15 +100,18 @@ public void setup(Mapper<WritableComparable<?>, Content,
Text, ParseImpl>.Contex
}
@Override
- public void cleanup(Mapper<WritableComparable<?>, Content, Text,
ParseImpl>.Context context)
+ public void cleanup(Mapper<WritableComparable<?>, Content, Text,
Writable>.Context context)
throws IOException, InterruptedException {
- // Emit parse latency metrics
- parseLatencyTracker.emitCounters(context);
+ parseLatencyTracker.emitCountAndSumOnly(context);
+ byte[] digestBytes = parseLatencyTracker.toBytes();
+ if (digestBytes.length > 0) {
+ context.write(new Text(NutchMetrics.LATENCY_KEY), new
BytesWritable(digestBytes));
Review Comment:
This makes the job fail:
```
2026-04-08 20:32:21,403 INFO mapreduce.Job: Task Id :
attempt_1775672934348_0005_m_000000_0, Status : FAILED
Error: java.io.IOException: Type mismatch in value from map: expected
org.apache.hadoop.io.Writable, received org.apache.hadoop.io.BytesWritable
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1104)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:728)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.cleanup(ParseSegment.java:108)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:148)
```
The solution is to wrap the ParseImpl or the BytesWritable into a
NutchWritable object.
I've already implemented the fix. Will push it later, but continue testing
now.
##########
src/java/org/apache/nutch/indexer/IndexingJob.java:
##########
@@ -155,6 +159,25 @@ public void index(Path crawlDb, Path linkDb, List<Path>
segments,
LOG.error(StringUtils.stringifyException(e));
throw e;
}
+ Path latencyDir = new Path(tmp, "_latency");
+ FileSystem fs = tmp.getFileSystem(conf);
+ if (fs.exists(latencyDir)) {
+ try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf,
latencyDir)) {
+ FileOutputFormat.setOutputPath(mergeJob, new Path(tmp,
"_latency_merge_out"));
+ boolean mergeSuccess = mergeJob.waitForCompletion(true);
Review Comment:
Or when running on a single-node cluster (indexing to Solr):
```
2026-04-08 21:50:01,953 ERROR indexer.IndexingJob: Indexer:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
not exist:
hdfs://localhost:9000/user/wastl/tmp_1775677717095--1619189953/_latency
```
This only affects the latency-merge job.
##########
src/java/org/apache/nutch/indexer/IndexingJob.java:
##########
@@ -155,6 +159,25 @@ public void index(Path crawlDb, Path linkDb, List<Path>
segments,
LOG.error(StringUtils.stringifyException(e));
throw e;
}
+ Path latencyDir = new Path(tmp, "_latency");
+ FileSystem fs = tmp.getFileSystem(conf);
+ if (fs.exists(latencyDir)) {
+ try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf,
latencyDir)) {
+ FileOutputFormat.setOutputPath(mergeJob, new Path(tmp,
"_latency_merge_out"));
+ boolean mergeSuccess = mergeJob.waitForCompletion(true);
Review Comment:
```
2026-04-08 21:38:08,406 ERROR o.a.n.i.IndexingJob [main] Indexer:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
not exist:
file:/mnt/data/wastl/proj/crawler/nutch/test/tmp_1775677086987-2006747810/_latency
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:342)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:281)
at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:445)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:311)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:328)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:201)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1677)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1674)
at
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1674)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1695)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:167)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:320)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:329)
```
when running
````
bin/nutch index -Dplugin.includes='indexer-dummy|index-(basic|more)'
-nocrawldb /path/to/segments/20260408205641/
```
> Latency metrics to properly merge data from all threads and tasks
> -----------------------------------------------------------------
>
> Key: NUTCH-3162
> URL: https://issues.apache.org/jira/browse/NUTCH-3162
> Project: Nutch
> Issue Type: Bug
> Components: fetcher, indexer, parser
> Affects Versions: 1.22
> Reporter: Sebastian Nagel
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 1.23
>
>
> The latency metrics (NUTCH-3134) have to issues:
> 1. Only the data from one thread is used, in case, a tool is multi-threaded.
> That's definitely the case for Fetcher. The "emitCounters" methods needs to
> increment the counter values, instead of calling "setValue". However, this is
> not the correct approach for the percentiles, see also next point.
> 2. If running full cluster mode with multiple parallel tasks, the task
> counters are summed up to the job counter value. However, the values of the
> latency percentiles then turn out to be too high.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)