[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer

2015-09-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744026#comment-14744026
 ] 

Yin Huai commented on SPARK-9899:
-

https://github.com/apache/spark/pull/8687 adds a warning message to places 
where we save data through RDD's API and we save data to Hive for avoiding of 
using direct output committer when speculation is enabled. This change will be 
included in 1.6.

> JSON/Parquet writing on retry or speculation broken with direct output 
> committer
> 
>
> Key: SPARK-9899
> URL: https://issues.apache.org/jira/browse/SPARK-9899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.0
>
>
> If the first task fails all subsequent tasks will.  We probably need to set a 
> different boolean when calling create.
> {code}
> java.io.IOException: File already exists: ...
> ...
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
>   at 
> org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.(JSONRelation.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> {code}
> The reason behind this issue is that speculation shouldn't be used together 
> with direct output committer. As there are multiple corner cases that this 
> combination may cause data corruption and/or data loss. Please refer to this 
> [GitHub 
> comment|https://github.com/apache/spark/pull/8191#issuecomment-131598385] for 
> more details about these corner cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer

2015-09-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14738328#comment-14738328
 ] 

Apache Spark commented on SPARK-9899:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8687

> JSON/Parquet writing on retry or speculation broken with direct output 
> committer
> 
>
> Key: SPARK-9899
> URL: https://issues.apache.org/jira/browse/SPARK-9899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.0
>
>
> If the first task fails all subsequent tasks will.  We probably need to set a 
> different boolean when calling create.
> {code}
> java.io.IOException: File already exists: ...
> ...
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
>   at 
> org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.(JSONRelation.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> {code}
> The reason behind this issue is that speculation shouldn't be used together 
> with direct output committer. As there are multiple corner cases that this 
> combination may cause data corruption and/or data loss. Please refer to this 
> [GitHub 
> comment|https://github.com/apache/spark/pull/8191#issuecomment-131598385] for 
> more details about these corner cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer

2015-08-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703530#comment-14703530
 ] 

Apache Spark commented on SPARK-9899:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8317

 JSON/Parquet writing on retry or speculation broken with direct output 
 committer
 

 Key: SPARK-9899
 URL: https://issues.apache.org/jira/browse/SPARK-9899
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 If the first task fails all subsequent tasks will.  We probably need to set a 
 different boolean when calling create.
 {code}
 java.io.IOException: File already exists: ...
 ...
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
   at 
 org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
   at 
 org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.init(JSONRelation.scala:185)
   at 
 org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
   at 
 org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
   at 
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
   at 
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
   at org.apache.spark.scheduler.Task.run(Task.scala:88)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696672#comment-14696672
 ] 

Apache Spark commented on SPARK-9899:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8191

 JSON/Parquet writing on retry or speculation broken with direct output 
 committer
 

 Key: SPARK-9899
 URL: https://issues.apache.org/jira/browse/SPARK-9899
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 If the first task fails all subsequent tasks will.  We probably need to set a 
 different boolean when calling create.
 {code}
 java.io.IOException: File already exists: ...
 ...
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
   at 
 org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
   at 
 org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.init(JSONRelation.scala:185)
   at 
 org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
   at 
 org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
   at 
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
   at 
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
   at org.apache.spark.scheduler.Task.run(Task.scala:88)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org