[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer
[ https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744026#comment-14744026 ] Yin Huai commented on SPARK-9899: - https://github.com/apache/spark/pull/8687 adds a warning message to places where we save data through RDD's API and we save data to Hive for avoiding of using direct output committer when speculation is enabled. This change will be included in 1.6. > JSON/Parquet writing on retry or speculation broken with direct output > committer > > > Key: SPARK-9899 > URL: https://issues.apache.org/jira/browse/SPARK-9899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.5.0 > > > If the first task fails all subsequent tasks will. We probably need to set a > different boolean when calling create. > {code} > java.io.IOException: File already exists: ... > ... > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) > at > org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) > at > org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.(JSONRelation.scala:185) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > {code} > The reason behind this issue is that speculation shouldn't be used together > with direct output committer. As there are multiple corner cases that this > combination may cause data corruption and/or data loss. Please refer to this > [GitHub > comment|https://github.com/apache/spark/pull/8191#issuecomment-131598385] for > more details about these corner cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer
[ https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14738328#comment-14738328 ] Apache Spark commented on SPARK-9899: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8687 > JSON/Parquet writing on retry or speculation broken with direct output > committer > > > Key: SPARK-9899 > URL: https://issues.apache.org/jira/browse/SPARK-9899 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.5.0 > > > If the first task fails all subsequent tasks will. We probably need to set a > different boolean when calling create. > {code} > java.io.IOException: File already exists: ... > ... > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) > at > org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) > at > org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.(JSONRelation.scala:185) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > {code} > The reason behind this issue is that speculation shouldn't be used together > with direct output committer. As there are multiple corner cases that this > combination may cause data corruption and/or data loss. Please refer to this > [GitHub > comment|https://github.com/apache/spark/pull/8191#issuecomment-131598385] for > more details about these corner cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer
[ https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703530#comment-14703530 ] Apache Spark commented on SPARK-9899: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8317 JSON/Parquet writing on retry or speculation broken with direct output committer Key: SPARK-9899 URL: https://issues.apache.org/jira/browse/SPARK-9899 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker If the first task fails all subsequent tasks will. We probably need to set a different boolean when calling create. {code} java.io.IOException: File already exists: ... ... at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.init(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer
[ https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696672#comment-14696672 ] Apache Spark commented on SPARK-9899: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8191 JSON/Parquet writing on retry or speculation broken with direct output committer Key: SPARK-9899 URL: https://issues.apache.org/jira/browse/SPARK-9899 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker If the first task fails all subsequent tasks will. We probably need to set a different boolean when calling create. {code} java.io.IOException: File already exists: ... ... at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.init(JSONRelation.scala:185) at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org