[jira] [Commented] (NUTCH-3162) Latency metrics to properly merge data from all threads and tasks

ASF GitHub Bot (Jira) Wed, 08 Apr 2026 12:59:57 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18072080#comment-18072080
 ]


ASF GitHub Bot commented on NUTCH-3162:
---------------------------------------

sebastian-nagel commented on code in PR #906:
URL: https://github.com/apache/nutch/pull/906#discussion_r3053657797


##########
src/java/org/apache/nutch/parse/ParseSegment.java:
##########
@@ -98,15 +100,18 @@ public void setup(Mapper<WritableComparable<?>, Content, 
Text, ParseImpl>.Contex
     }
 
     @Override
-    public void cleanup(Mapper<WritableComparable<?>, Content, Text, 
ParseImpl>.Context context)
+    public void cleanup(Mapper<WritableComparable<?>, Content, Text, 
Writable>.Context context)
         throws IOException, InterruptedException {
-      // Emit parse latency metrics
-      parseLatencyTracker.emitCounters(context);
+      parseLatencyTracker.emitCountAndSumOnly(context);
+      byte[] digestBytes = parseLatencyTracker.toBytes();
+      if (digestBytes.length > 0) {
+        context.write(new Text(NutchMetrics.LATENCY_KEY), new 
BytesWritable(digestBytes));

Review Comment:
   This makes the job fail:
   ```
   2026-04-08 20:32:21,403 INFO mapreduce.Job: Task Id : 
attempt_1775672934348_0005_m_000000_0, Status : FAILED
   Error: java.io.IOException: Type mismatch in value from map: expected 
org.apache.hadoop.io.Writable, received org.apache.hadoop.io.BytesWritable
           at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1104)
           at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:728)
           at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
           at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
           at 
org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.cleanup(ParseSegment.java:108)
           at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:148)
   ```
   
   The solution is to wrap the ParseImpl or the BytesWritable into a 
NutchWritable object.
   
   I've already implemented the fix. Will push it later, but continue testing 
now.



##########
src/java/org/apache/nutch/indexer/IndexingJob.java:
##########
@@ -155,6 +159,25 @@ public void index(Path crawlDb, Path linkDb, List<Path> 
segments,
         LOG.error(StringUtils.stringifyException(e));
         throw e;
       }
+      Path latencyDir = new Path(tmp, "_latency");
+      FileSystem fs = tmp.getFileSystem(conf);
+      if (fs.exists(latencyDir)) {
+        try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf, 
latencyDir)) {
+          FileOutputFormat.setOutputPath(mergeJob, new Path(tmp, 
"_latency_merge_out"));
+          boolean mergeSuccess = mergeJob.waitForCompletion(true);

Review Comment:
   Or when running on a single-node cluster (indexing to Solr):
   ```
   2026-04-08 21:50:01,953 ERROR indexer.IndexingJob: Indexer: 
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: 
hdfs://localhost:9000/user/wastl/tmp_1775677717095--1619189953/_latency
   ```
   
   This only affects the latency-merge job.



##########
src/java/org/apache/nutch/indexer/IndexingJob.java:
##########
@@ -155,6 +159,25 @@ public void index(Path crawlDb, Path linkDb, List<Path> 
segments,
         LOG.error(StringUtils.stringifyException(e));
         throw e;
       }
+      Path latencyDir = new Path(tmp, "_latency");
+      FileSystem fs = tmp.getFileSystem(conf);
+      if (fs.exists(latencyDir)) {
+        try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf, 
latencyDir)) {
+          FileOutputFormat.setOutputPath(mergeJob, new Path(tmp, 
"_latency_merge_out"));
+          boolean mergeSuccess = mergeJob.waitForCompletion(true);

Review Comment:
   ```
   2026-04-08 21:38:08,406 ERROR o.a.n.i.IndexingJob [main] Indexer: 
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: 
file:/mnt/data/wastl/proj/crawler/nutch/test/tmp_1775677086987-2006747810/_latency
           at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:342)
           at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:281)
           at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
           at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:445)
           at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:311)
           at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:328)
           at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:201)
           at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1677)
           at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1674)
           at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
           at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
           at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
           at org.apache.hadoop.mapreduce.Job.submit(Job.java:1674)
           at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1695)
           at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:167)
           at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:320)
           at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
           at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:329)
   ```
   when running
   ````
   bin/nutch index -Dplugin.includes='indexer-dummy|index-(basic|more)' 
-nocrawldb /path/to/segments/20260408205641/
   ```
   





> Latency metrics to properly merge data from all threads and tasks
> -----------------------------------------------------------------
>
>                 Key: NUTCH-3162
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3162
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, indexer, parser
>    Affects Versions: 1.22
>            Reporter: Sebastian Nagel
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.23
>
>
> The latency metrics (NUTCH-3134) have to issues:
> 1. Only the data from one thread is used, in case, a tool is multi-threaded. 
> That's definitely the case for Fetcher. The "emitCounters" methods needs to 
> increment the counter values, instead of calling "setValue". However, this is 
> not the correct approach for the percentiles, see also next point.
> 2. If running full cluster mode with multiple parallel tasks, the task 
> counters are summed up to the job counter value. However, the values of the 
> latency percentiles then turn out to be too high.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3162) Latency metrics to properly merge data from all threads and tasks

Reply via email to