[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2017-01-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826620#comment-15826620
 ] 

Apache Spark commented on SPARK-18917:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16622

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2017-01-09 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812608#comment-15812608
 ] 

Anbu Cheeralan commented on SPARK-18917:


I agree the Hadoop fix will reduce the recursive calls. 
But the solution addresses only S3. This not solved by Google Storage or Azure.

Further, Optionally disabling the code, removes one entire stage in execution. 
I would still like to have his as a feature.

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2017-01-03 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795069#comment-15795069
 ] 

Steve Loughran commented on SPARK-18917:


looking at the code being optionally disabled, the underlying problem is that 
it's doing a recursive treewalk of the filesystem paths, which is a performance 
killer on object stores, which is issuing 3-4 HTTPS requests per directory.

HADOOP-13208 makes the recursive {{FileSystem.listFiles(path, recursive=true)}} 
call an {{O(leaf-files/5000)}}, irrespective of directory structure. This will 
be in Hadoop 2.8 and things derived from that code.

Moving {{PartitioningAwareFileIndex.bulkListLeafFiles()}} to using that method 
will deliver the speedup without having to add & test a new option.

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2016-12-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761926#comment-15761926
 ] 

Apache Spark commented on SPARK-18917:
--

User 'alunarbeach' has created a pull request for this issue:
https://github.com/apache/spark/pull/16339

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2016-12-19 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15761444#comment-15761444
 ] 

Anbu Cheeralan commented on SPARK-18917:


Thanks [~dongjoon]. This is my first PR. will not do target versions and fixed 
versions in future. 
Does this task need to be assigned to me?

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2016-12-18 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15759931#comment-15759931
 ] 

Dongjoon Hyun commented on SPARK-18917:
---

Hi, [~alunarbeach].
Sure, you can make a PR.
BTW, please don't set target versions and fixed versions.
Usually, target versions are used by committers. Also, fixed versions are 
recorded only when your patch is merged.

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2
>Reporter: Anbu Cheeralan
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org