sebastian-nagel commented on code in PR #906:
URL: https://github.com/apache/nutch/pull/906#discussion_r3053657797
##########
src/java/org/apache/nutch/parse/ParseSegment.java:
##########
@@ -98,15 +100,18 @@ public void setup(Mapper<WritableComparable<?>, Content,
Text, ParseImpl>.Contex
}
@Override
- public void cleanup(Mapper<WritableComparable<?>, Content, Text,
ParseImpl>.Context context)
+ public void cleanup(Mapper<WritableComparable<?>, Content, Text,
Writable>.Context context)
throws IOException, InterruptedException {
- // Emit parse latency metrics
- parseLatencyTracker.emitCounters(context);
+ parseLatencyTracker.emitCountAndSumOnly(context);
+ byte[] digestBytes = parseLatencyTracker.toBytes();
+ if (digestBytes.length > 0) {
+ context.write(new Text(NutchMetrics.LATENCY_KEY), new
BytesWritable(digestBytes));
Review Comment:
This makes the job fail:
```
2026-04-08 20:32:21,403 INFO mapreduce.Job: Task Id :
attempt_1775672934348_0005_m_000000_0, Status : FAILED
Error: java.io.IOException: Type mismatch in value from map: expected
org.apache.hadoop.io.Writable, received org.apache.hadoop.io.BytesWritable
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1104)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:728)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.cleanup(ParseSegment.java:108)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:148)
```
The solution is to wrap the ParseImpl or the BytesWritable into a
NutchWritable object.
I've already implemented the fix. Will push it later, but continue testing
now.
##########
src/java/org/apache/nutch/indexer/IndexingJob.java:
##########
@@ -155,6 +159,25 @@ public void index(Path crawlDb, Path linkDb, List<Path>
segments,
LOG.error(StringUtils.stringifyException(e));
throw e;
}
+ Path latencyDir = new Path(tmp, "_latency");
+ FileSystem fs = tmp.getFileSystem(conf);
+ if (fs.exists(latencyDir)) {
+ try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf,
latencyDir)) {
+ FileOutputFormat.setOutputPath(mergeJob, new Path(tmp,
"_latency_merge_out"));
+ boolean mergeSuccess = mergeJob.waitForCompletion(true);
Review Comment:
Or when running on a single-node cluster (indexing to Solr):
```
2026-04-08 21:50:01,953 ERROR indexer.IndexingJob: Indexer:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
not exist:
hdfs://localhost:9000/user/wastl/tmp_1775677717095--1619189953/_latency
```
This only affects the latency-merge job.
##########
src/java/org/apache/nutch/indexer/IndexingJob.java:
##########
@@ -155,6 +159,25 @@ public void index(Path crawlDb, Path linkDb, List<Path>
segments,
LOG.error(StringUtils.stringifyException(e));
throw e;
}
+ Path latencyDir = new Path(tmp, "_latency");
+ FileSystem fs = tmp.getFileSystem(conf);
+ if (fs.exists(latencyDir)) {
+ try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf,
latencyDir)) {
+ FileOutputFormat.setOutputPath(mergeJob, new Path(tmp,
"_latency_merge_out"));
+ boolean mergeSuccess = mergeJob.waitForCompletion(true);
Review Comment:
```
2026-04-08 21:38:08,406 ERROR o.a.n.i.IndexingJob [main] Indexer:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
not exist:
file:/mnt/data/wastl/proj/crawler/nutch/test/tmp_1775677086987-2006747810/_latency
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:342)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:281)
at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:445)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:311)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:328)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:201)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1677)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1674)
at
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1674)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1695)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:167)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:320)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:329)
```
when running
````
bin/nutch index -Dplugin.includes='indexer-dummy|index-(basic|more)'
-nocrawldb /path/to/segments/20260408205641/
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]