[jira] [Commented] (NUTCH-1446) Port NUTCH-1444 to trunk (Indexing should not create temporary files)

ASF GitHub Bot (Jira) Sat, 07 Mar 2026 13:13:07 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063807#comment-18063807
 ]


ASF GitHub Bot commented on NUTCH-1446:
---------------------------------------

Copilot commented on code in PR #905:
URL: https://github.com/apache/nutch/pull/905#discussion_r2900557404


##########
src/java/org/apache/nutch/indexer/IndexerOutputFormat.java:
##########
@@ -40,32 +33,67 @@ public RecordWriter<Text, NutchIndexAction> getRecordWriter(
     Configuration conf = context.getConfiguration();
     final IndexWriters writers = IndexWriters.get(conf);
 
-    String name = getUniqueFile(context, "part", "");
-    writers.open(conf, name);
+    // open writers (no temporary file output anymore)
+    writers.open(conf, "index");

Review Comment:
   `writers.open(conf, "index")` uses a constant name, while the previous 
implementation used a task-unique name via `getUniqueFile(...)`. If any 
`IndexWriter` implementation uses this `name` to create per-task resources 
(files, temp dirs, IDs, etc.), a constant can cause collisions between 
reducers/attempts. Consider incorporating the task attempt ID (or task ID) into 
the name to preserve the prior uniqueness without relying on `FileOutputFormat`.
   ```suggestion
       String indexName = "index-" + context.getTaskAttemptID().toString();
       writers.open(conf, indexName);
   ```



##########
src/java/org/apache/nutch/indexer/IndexerOutputFormat.java:
##########
@@ -40,32 +33,67 @@ public RecordWriter<Text, NutchIndexAction> getRecordWriter(
     Configuration conf = context.getConfiguration();
     final IndexWriters writers = IndexWriters.get(conf);
 
-    String name = getUniqueFile(context, "part", "");
-    writers.open(conf, name);
+    // open writers (no temporary file output anymore)
+    writers.open(conf, "index");
     LOG.info(writers.describe());
 
     return new RecordWriter<Text, NutchIndexAction>() {
 
       @Override
       public void close(TaskAttemptContext context) throws IOException {
-        // do the commits once and for all the reducers in one go
-        boolean noCommit = conf
-            .getBoolean(IndexerMapReduce.INDEXER_NO_COMMIT, false);
+
+        boolean noCommit =
+            conf.getBoolean(IndexerMapReduce.INDEXER_NO_COMMIT, false);
+
         if (!noCommit) {
           writers.commit();
         }
+
         writers.close();
       }
 
       @Override
       public void write(Text key, NutchIndexAction indexAction)
           throws IOException {
+
         if (indexAction.action == NutchIndexAction.ADD) {
           writers.write(indexAction.doc);
+
         } else if (indexAction.action == NutchIndexAction.DELETE) {
           writers.delete(key.toString());
         }
       }
     };
   }
-}
+
+  @Override
+  public void checkOutputSpecs(JobContext context)
+      throws IOException, InterruptedException {
+    // No output specs required since we don't write files
+  }
+
+  @Override
+  public OutputCommitter getOutputCommitter(TaskAttemptContext context)
+      throws IOException, InterruptedException {
+
+    return new OutputCommitter() {
+
+      @Override
+      public void setupJob(JobContext jobContext) {}
+
+      @Override
+      public void setupTask(TaskAttemptContext taskContext) {}
+
+      @Override
+      public boolean needsTaskCommit(TaskAttemptContext taskContext) {
+        return false;
+      }
+
+      @Override
+      public void commitTask(TaskAttemptContext taskContext) {}
+
+      @Override
+      public void abortTask(TaskAttemptContext taskContext) {}
+    };

Review Comment:
   The anonymous no-op `OutputCommitter` here largely duplicates Hadoop's 
built-in null output commit semantics (see `NullOutputFormat` usage elsewhere, 
e.g. `CleaningJob`). To reduce maintenance risk across Hadoop upgrades (API 
surface changes) and keep behavior consistent, consider extending 
`org.apache.hadoop.mapreduce.lib.output.NullOutputFormat` and only overriding 
`getRecordWriter`, or otherwise reuse a shared no-op committer implementation 
instead of defining a new anonymous one.



##########
src/java/org/apache/nutch/indexer/IndexerOutputFormat.java:
##########
@@ -1,37 +1,30 @@
 /*
  * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
+ * contributor license agreements. See the NOTICE file distributed with
  * this work for additional information regarding copyright ownership.
  * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
  */

Review Comment:
   The ASF license header has been truncated (missing the Apache License 2.0 
URL and the standard warranty/limitation text). This file should keep the full 
ASF boilerplate header consistent with the rest of the codebase to avoid 
licensing/compliance issues.





> Port NUTCH-1444 to trunk (Indexing should not create temporary files)
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-1446
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1446
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Ferdy
>            Priority: Major
>             Fix For: 1.23
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-1446) Port NUTCH-1444 to trunk (Indexing should not create temporary files)

Reply via email to