[
https://issues.apache.org/jira/browse/NUTCH-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063807#comment-18063807
]
ASF GitHub Bot commented on NUTCH-1446:
---------------------------------------
Copilot commented on code in PR #905:
URL: https://github.com/apache/nutch/pull/905#discussion_r2900557404
##########
src/java/org/apache/nutch/indexer/IndexerOutputFormat.java:
##########
@@ -40,32 +33,67 @@ public RecordWriter<Text, NutchIndexAction> getRecordWriter(
Configuration conf = context.getConfiguration();
final IndexWriters writers = IndexWriters.get(conf);
- String name = getUniqueFile(context, "part", "");
- writers.open(conf, name);
+ // open writers (no temporary file output anymore)
+ writers.open(conf, "index");
Review Comment:
`writers.open(conf, "index")` uses a constant name, while the previous
implementation used a task-unique name via `getUniqueFile(...)`. If any
`IndexWriter` implementation uses this `name` to create per-task resources
(files, temp dirs, IDs, etc.), a constant can cause collisions between
reducers/attempts. Consider incorporating the task attempt ID (or task ID) into
the name to preserve the prior uniqueness without relying on `FileOutputFormat`.
```suggestion
String indexName = "index-" + context.getTaskAttemptID().toString();
writers.open(conf, indexName);
```
##########
src/java/org/apache/nutch/indexer/IndexerOutputFormat.java:
##########
@@ -40,32 +33,67 @@ public RecordWriter<Text, NutchIndexAction> getRecordWriter(
Configuration conf = context.getConfiguration();
final IndexWriters writers = IndexWriters.get(conf);
- String name = getUniqueFile(context, "part", "");
- writers.open(conf, name);
+ // open writers (no temporary file output anymore)
+ writers.open(conf, "index");
LOG.info(writers.describe());
return new RecordWriter<Text, NutchIndexAction>() {
@Override
public void close(TaskAttemptContext context) throws IOException {
- // do the commits once and for all the reducers in one go
- boolean noCommit = conf
- .getBoolean(IndexerMapReduce.INDEXER_NO_COMMIT, false);
+
+ boolean noCommit =
+ conf.getBoolean(IndexerMapReduce.INDEXER_NO_COMMIT, false);
+
if (!noCommit) {
writers.commit();
}
+
writers.close();
}
@Override
public void write(Text key, NutchIndexAction indexAction)
throws IOException {
+
if (indexAction.action == NutchIndexAction.ADD) {
writers.write(indexAction.doc);
+
} else if (indexAction.action == NutchIndexAction.DELETE) {
writers.delete(key.toString());
}
}
};
}
-}
+
+ @Override
+ public void checkOutputSpecs(JobContext context)
+ throws IOException, InterruptedException {
+ // No output specs required since we don't write files
+ }
+
+ @Override
+ public OutputCommitter getOutputCommitter(TaskAttemptContext context)
+ throws IOException, InterruptedException {
+
+ return new OutputCommitter() {
+
+ @Override
+ public void setupJob(JobContext jobContext) {}
+
+ @Override
+ public void setupTask(TaskAttemptContext taskContext) {}
+
+ @Override
+ public boolean needsTaskCommit(TaskAttemptContext taskContext) {
+ return false;
+ }
+
+ @Override
+ public void commitTask(TaskAttemptContext taskContext) {}
+
+ @Override
+ public void abortTask(TaskAttemptContext taskContext) {}
+ };
Review Comment:
The anonymous no-op `OutputCommitter` here largely duplicates Hadoop's
built-in null output commit semantics (see `NullOutputFormat` usage elsewhere,
e.g. `CleaningJob`). To reduce maintenance risk across Hadoop upgrades (API
surface changes) and keep behavior consistent, consider extending
`org.apache.hadoop.mapreduce.lib.output.NullOutputFormat` and only overriding
`getRecordWriter`, or otherwise reuse a shared no-op committer implementation
instead of defining a new anonymous one.
##########
src/java/org/apache/nutch/indexer/IndexerOutputFormat.java:
##########
@@ -1,37 +1,30 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
+ * contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
*/
Review Comment:
The ASF license header has been truncated (missing the Apache License 2.0
URL and the standard warranty/limitation text). This file should keep the full
ASF boilerplate header consistent with the rest of the codebase to avoid
licensing/compliance issues.
> Port NUTCH-1444 to trunk (Indexing should not create temporary files)
> ---------------------------------------------------------------------
>
> Key: NUTCH-1446
> URL: https://issues.apache.org/jira/browse/NUTCH-1446
> Project: Nutch
> Issue Type: Bug
> Reporter: Ferdy
> Priority: Major
> Fix For: 1.23
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)