Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8687#discussion_r39344473 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -984,6 +986,15 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) hadoopConf.setOutputCommitter(classOf[FileOutputCommitter]) } + // When speculation is on and output committer class name contains "Direct", we should warn + // users that they may loss data if they are using a direct output committer. + val speculationEnabled = self.conf.getBoolean("spark.speculation", false) + if (speculationEnabled && + hadoopConf.get("mapred.output.committer.class", "").contains("Direct")) { + logWarning("We may loss data when use direct output committer with speculation enabled, " + + "please make sure your output committer doesn't write data directly.") + } --- End diff -- How about ``` val outputCommitterClass = hadoopConf.get("mapred.output.committer.class", "") if (speculationEnabled && outputCommitterClass.contains("Direct")) { val warningMessage = s"$outputCommitterClass may be a output committer that writes data directly to the final location. " + "Because speculation is enabled, this output committer may cause data loss (see the case in SPARK-10063). " + "If possible, please use a output committer that does not have this behavior (e.g. FileOutputCommitter)." logWarning(warningMessage) } ```
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org