[jira] [Assigned] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores
[ https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18917: Assignee: (was: Apache Spark) > Dataframe - Time Out Issues / Taking long time in append mode on object stores > -- > > Key: SPARK-18917 > URL: https://issues.apache.org/jira/browse/SPARK-18917 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL, YARN >Affects Versions: 2.0.2 >Reporter: Anbu Cheeralan >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > When using Dataframe write in append mode on object stores (S3 / Google > Storage), the writes are taking long time to write/ getting read time out. > This is because dataframe.write lists all leaf folders in the target > directory. If there are lot of subfolders due to partitions, this is taking > for ever. > The code is In org.apache.spark.sql.execution.datasources.DataSource.write() > following code causes huge number of RPC calls when the file system is an > Object Store (S3, GS). > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > resolveRelation() > .asInstanceOf[HadoopFsRelation] > .location > .partitionSpec() > .partitionColumns > .fieldNames > .toSeq > }.getOrElse(Seq.empty[String]) > There should be a flag to skip Partition Match Check in append mode. I can > work on the patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores
[ https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18917: Assignee: Apache Spark > Dataframe - Time Out Issues / Taking long time in append mode on object stores > -- > > Key: SPARK-18917 > URL: https://issues.apache.org/jira/browse/SPARK-18917 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL, YARN >Affects Versions: 2.0.2 >Reporter: Anbu Cheeralan >Assignee: Apache Spark >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > When using Dataframe write in append mode on object stores (S3 / Google > Storage), the writes are taking long time to write/ getting read time out. > This is because dataframe.write lists all leaf folders in the target > directory. If there are lot of subfolders due to partitions, this is taking > for ever. > The code is In org.apache.spark.sql.execution.datasources.DataSource.write() > following code causes huge number of RPC calls when the file system is an > Object Store (S3, GS). > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > resolveRelation() > .asInstanceOf[HadoopFsRelation] > .location > .partitionSpec() > .partitionColumns > .fieldNames > .toSeq > }.getOrElse(Seq.empty[String]) > There should be a flag to skip Partition Match Check in append mode. I can > work on the patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org