jbrennan333 commented on a change in pull request #2349:
URL: https://github.com/apache/hadoop/pull/2349#discussion_r496954586



##########
File path: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java
##########
@@ -348,6 +348,16 @@ public Path getWorkPath() throws IOException {
    * @param context the job's context
    */
   public void setupJob(JobContext context) throws IOException {
+    // Downgrade v2 to v1 with a warning.

Review comment:
       This comment is no longer accurate.

##########
File path: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java
##########
@@ -348,6 +348,16 @@ public Path getWorkPath() throws IOException {
    * @param context the job's context
    */
   public void setupJob(JobContext context) throws IOException {
+    // Downgrade v2 to v1 with a warning.
+    if (algorithmVersion == 2) {
+      Logger log = LoggerFactory.getLogger(
+          "org.apache.hadoop.mapreduce.lib.output."
+              + "FileOutputCommitter.Algorithm");
+
+      log.warn("The v2 commit algorithm is deprecated;"
+          + " please switch to the v1 algorithm");

Review comment:
       I don't think we should use the word deprecated.   That implies that 
this algorithm will be removed in a future release

##########
File path: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
##########
@@ -1562,10 +1562,35 @@
 
 <property>
   <name>mapreduce.fileoutputcommitter.algorithm.version</name>
-  <value>2</value>
-  <description>The file output committer algorithm version
-  valid algorithm version number: 1 or 2
-  default to 2, which is the original algorithm
+  <value>1</value>
+  <description>The file output committer algorithm version.
+
+  There are two algorithm versions in Hadoop, "1" and "2".
+
+  The version 2 algorithm is deprecated and no longer the default
+  as task commits were not atomic.

Review comment:
       Similarly, remove "deprecated".

##########
File path: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
##########
@@ -1562,10 +1562,35 @@
 
 <property>
   <name>mapreduce.fileoutputcommitter.algorithm.version</name>
-  <value>2</value>
-  <description>The file output committer algorithm version
-  valid algorithm version number: 1 or 2
-  default to 2, which is the original algorithm
+  <value>1</value>
+  <description>The file output committer algorithm version.
+
+  There are two algorithm versions in Hadoop, "1" and "2".
+
+  The version 2 algorithm is deprecated and no longer the default
+  as task commits were not atomic.
+  If a first task attempt fails part-way
+  through its task commit, the output directory could end up
+  with data from that failed commit, alongside the data
+  from any subsequent attempts.
+
+  See https://issues.apache.org/jira/browse/MAPREDUCE-7282
+
+  Although no-longer the default, this algorithm is safe to use if
+  all task attempts for a single task meet the following requirements
+  -they generate exactly the same set of files
+  -the contents of each file are exactly the same in each task attempt
+
+  That is:
+  1. If a second attempt commits work, there will be no leftover files from
+  a first attempt which failed during its task commit.
+  2. If a network partition causes the first task attempt to overwrite
+  some/all of the output of a second attempt, the result will be
+  exactly the same as if it had not done so.
+
+  To avoid the warning message on job setup, set the log level of the log
+  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.Algorithm
+  to ERROR.

Review comment:
       I think this section should be moved to the end of the Algorithm 2 
section below.   You can add (see below for details) to the end of the line 
that says why algorithm v2 in no longer the default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to