Github user vgankidi commented on the issue:
https://github.com/apache/spark/pull/20372
I agree with @ash211. Applications shouldn't rely on the order of the files
within a partition.
This optimization looks good to me
Github user vgankidi commented on a diff in the pull request:
https://github.com/apache/spark/pull/19633#discussion_r153957431
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
---
@@ -424,11 +424,19 @@ case class FileSourceScanExec
Github user vgankidi commented on a diff in the pull request:
https://github.com/apache/spark/pull/19633#discussion_r151236188
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
---
@@ -424,11 +424,19 @@ case class FileSourceScanExec
Github user vgankidi commented on a diff in the pull request:
https://github.com/apache/spark/pull/19633#discussion_r150431620
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
---
@@ -424,11 +424,19 @@ case class FileSourceScanExec
Github user vgankidi commented on the issue:
https://github.com/apache/spark/pull/19633
@gatorsmile Can you please take a look? I'd like to hear your thoughts on
this.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user vgankidi commented on the issue:
https://github.com/apache/spark/pull/19634
We will end up having fewer combined splits. That reduces the number of
files that the job produces and also reduces the number of tasks in the
downstream jobs. In some tests I have noticed about
Github user vgankidi commented on the issue:
https://github.com/apache/spark/pull/19634
@gatorsmile I also wanted to discuss if we should consider other bin
packing algorithms. According to this
http://www.math.unl.edu/~s-sjessie1/203Handouts/Bin%20Packing.pdf, next fit
decreasing
GitHub user vgankidi opened a pull request:
https://github.com/apache/spark/pull/19634
[SPARK-22412][SQL] Fix incorrect comment in DataSourceScanExec
## What changes were proposed in this pull request?
Next fit decreasing bin packing algorithm is used to combine splits
Github user vgankidi commented on the issue:
https://github.com/apache/spark/pull/19633
How about using spark.dynamicAllocation.maxExecutors for calculating
bytesPerCore when dynamic allocation is enabled
Github user vgankidi commented on the issue:
https://github.com/apache/spark/pull/19633
ping @davies
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
GitHub user vgankidi opened a pull request:
https://github.com/apache/spark/pull/19633
[SPARK-22411][SQL] Disable the heuristic to calculate max partition size
when dynamic allocation is enabled and use the value specified by the property
spark.sql.files.maxPartitionBytes instead
Github user vgankidi commented on the issue:
https://github.com/apache/spark/pull/19425
@davies Can you please take a look?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands
GitHub user vgankidi opened a pull request:
https://github.com/apache/spark/pull/19425
[SPARK-22196][Core] Combine multiple input splits into a HadoopPartition
## What changes were proposed in this pull request?
Spark native read path allows tuning the partition size based
13 matches
Mail list logo