Copilot commented on code in PR #905:
URL: https://github.com/apache/nutch/pull/905#discussion_r2900557404
##########
src/java/org/apache/nutch/indexer/IndexerOutputFormat.java:
##########
@@ -40,32 +33,67 @@ public RecordWriter<Text, NutchIndexAction> getRecordWriter(
Configuration conf = context.getConfiguration();
final IndexWriters writers = IndexWriters.get(conf);
- String name = getUniqueFile(context, "part", "");
- writers.open(conf, name);
+ // open writers (no temporary file output anymore)
+ writers.open(conf, "index");
Review Comment:
`writers.open(conf, "index")` uses a constant name, while the previous
implementation used a task-unique name via `getUniqueFile(...)`. If any
`IndexWriter` implementation uses this `name` to create per-task resources
(files, temp dirs, IDs, etc.), a constant can cause collisions between
reducers/attempts. Consider incorporating the task attempt ID (or task ID) into
the name to preserve the prior uniqueness without relying on `FileOutputFormat`.
```suggestion
String indexName = "index-" + context.getTaskAttemptID().toString();
writers.open(conf, indexName);
```
##########
src/java/org/apache/nutch/indexer/IndexerOutputFormat.java:
##########
@@ -40,32 +33,67 @@ public RecordWriter<Text, NutchIndexAction> getRecordWriter(
Configuration conf = context.getConfiguration();
final IndexWriters writers = IndexWriters.get(conf);
- String name = getUniqueFile(context, "part", "");
- writers.open(conf, name);
+ // open writers (no temporary file output anymore)
+ writers.open(conf, "index");
LOG.info(writers.describe());
return new RecordWriter<Text, NutchIndexAction>() {
@Override
public void close(TaskAttemptContext context) throws IOException {
- // do the commits once and for all the reducers in one go
- boolean noCommit = conf
- .getBoolean(IndexerMapReduce.INDEXER_NO_COMMIT, false);
+
+ boolean noCommit =
+ conf.getBoolean(IndexerMapReduce.INDEXER_NO_COMMIT, false);
+
if (!noCommit) {
writers.commit();
}
+
writers.close();
}
@Override
public void write(Text key, NutchIndexAction indexAction)
throws IOException {
+
if (indexAction.action == NutchIndexAction.ADD) {
writers.write(indexAction.doc);
+
} else if (indexAction.action == NutchIndexAction.DELETE) {
writers.delete(key.toString());
}
}
};
}
-}
+
+ @Override
+ public void checkOutputSpecs(JobContext context)
+ throws IOException, InterruptedException {
+ // No output specs required since we don't write files
+ }
+
+ @Override
+ public OutputCommitter getOutputCommitter(TaskAttemptContext context)
+ throws IOException, InterruptedException {
+
+ return new OutputCommitter() {
+
+ @Override
+ public void setupJob(JobContext jobContext) {}
+
+ @Override
+ public void setupTask(TaskAttemptContext taskContext) {}
+
+ @Override
+ public boolean needsTaskCommit(TaskAttemptContext taskContext) {
+ return false;
+ }
+
+ @Override
+ public void commitTask(TaskAttemptContext taskContext) {}
+
+ @Override
+ public void abortTask(TaskAttemptContext taskContext) {}
+ };
Review Comment:
The anonymous no-op `OutputCommitter` here largely duplicates Hadoop's
built-in null output commit semantics (see `NullOutputFormat` usage elsewhere,
e.g. `CleaningJob`). To reduce maintenance risk across Hadoop upgrades (API
surface changes) and keep behavior consistent, consider extending
`org.apache.hadoop.mapreduce.lib.output.NullOutputFormat` and only overriding
`getRecordWriter`, or otherwise reuse a shared no-op committer implementation
instead of defining a new anonymous one.
##########
src/java/org/apache/nutch/indexer/IndexerOutputFormat.java:
##########
@@ -1,37 +1,30 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
+ * contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
*/
Review Comment:
The ASF license header has been truncated (missing the Apache License 2.0
URL and the standard warranty/limitation text). This file should keep the full
ASF boilerplate header consistent with the rest of the codebase to avoid
licensing/compliance issues.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]