[jira] [Commented] (SPARK-17493) Spark Job hangs while DataFrame writing to HDFS path with parquet mode

2016-12-15 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752443#comment-15752443
 ] 

Anbu Cheeralan commented on SPARK-17493:


[~sowen] I faced a similar error while writing to google storage. This issue is 
specific while writing to object stores. This happens in append mode.

In org.apache.spark.sql.execution.datasources.DataSource.write() following code 
causes huge number of RPC calls when the file system is on Object Stores (S3, 
GS). 
{quote}
  if (mode == SaveMode.Append) {
val existingPartitionColumns = Try {
  resolveRelation()
.asInstanceOf[HadoopFsRelation]
.location
.partitionSpec()
.partitionColumns
.fieldNames
.toSeq
}.getOrElse(Seq.empty[String])
{quote}
There should be a flag to skip Partition Match Check in append mode. I can work 
on the patch.

> Spark Job hangs while DataFrame writing to HDFS path with parquet mode
> --
>
> Key: SPARK-17493
> URL: https://issues.apache.org/jira/browse/SPARK-17493
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: AWS Cluster
>Reporter: Gautam Solanki
>
> While saving a RDD to HDFS path in parquet format with the following 
> rddout.write.partitionBy("event_date").mode(org.apache.spark.sql.SaveMode.Append).parquet("hdfs:tmp//rddout_parquet_full_hdfs1//")
>  , the spark job was hanging as the two write tasks with Shuffle Read of size 
> 0 could not complete. But, the executors notified the driver about the 
> completion of these two tasks. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17493) Spark Job hangs while DataFrame writing to HDFS path with parquet mode

2016-09-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481314#comment-15481314
 ] 

Sean Owen commented on SPARK-17493:
---

I don't think this is enough info. What does 'hang' mean here, is there a 
reproduction? what do thread dumps show is going on? 

> Spark Job hangs while DataFrame writing to HDFS path with parquet mode
> --
>
> Key: SPARK-17493
> URL: https://issues.apache.org/jira/browse/SPARK-17493
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: AWS Cluster
>Reporter: Gautam Solanki
>
> While saving a RDD to HDFS path in parquet format with the following 
> rddout.write.partitionBy("event_date").mode(org.apache.spark.sql.SaveMode.Append).parquet("hdfs:tmp//rddout_parquet_full_hdfs1//")
>  , the spark job was hanging as the two write tasks with Shuffle Read of size 
> 0 could not complete. But, the executors notified the driver about the 
> completion of these two tasks. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org