[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-26 Thread vgankidi
Github user vgankidi commented on the issue: https://github.com/apache/spark/pull/20372 I agree with @ash211. Applications shouldn't rely on the order of the files within a partition. This optimization looks good to me

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

2017-11-29 Thread vgankidi
Github user vgankidi commented on a diff in the pull request: https://github.com/apache/spark/pull/19633#discussion_r153957431 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -424,11 +424,19 @@ case class FileSourceScanExec

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

2017-11-15 Thread vgankidi
Github user vgankidi commented on a diff in the pull request: https://github.com/apache/spark/pull/19633#discussion_r151236188 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -424,11 +424,19 @@ case class FileSourceScanExec

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

2017-11-12 Thread vgankidi
Github user vgankidi commented on a diff in the pull request: https://github.com/apache/spark/pull/19633#discussion_r150431620 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -424,11 +424,19 @@ case class FileSourceScanExec

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

2017-11-08 Thread vgankidi
Github user vgankidi commented on the issue: https://github.com/apache/spark/pull/19633 @gatorsmile Can you please take a look? I'd like to hear your thoughts on this. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19634: [SPARK-22412][SQL] Fix incorrect comment in DataSourceSc...

2017-11-08 Thread vgankidi
Github user vgankidi commented on the issue: https://github.com/apache/spark/pull/19634 We will end up having fewer combined splits. That reduces the number of files that the job produces and also reduces the number of tasks in the downstream jobs. In some tests I have noticed about

[GitHub] spark issue #19634: [SPARK-22412][SQL] Fix incorrect comment in DataSourceSc...

2017-11-07 Thread vgankidi
Github user vgankidi commented on the issue: https://github.com/apache/spark/pull/19634 @gatorsmile I also wanted to discuss if we should consider other bin packing algorithms. According to this http://www.math.unl.edu/~s-sjessie1/203Handouts/Bin%20Packing.pdf, next fit decreasing

[GitHub] spark pull request #19634: [SPARK-22412][SQL] Fix incorrect comment in DataS...

2017-11-01 Thread vgankidi
GitHub user vgankidi opened a pull request: https://github.com/apache/spark/pull/19634 [SPARK-22412][SQL] Fix incorrect comment in DataSourceScanExec ## What changes were proposed in this pull request? Next fit decreasing bin packing algorithm is used to combine splits

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

2017-11-01 Thread vgankidi
Github user vgankidi commented on the issue: https://github.com/apache/spark/pull/19633 How about using spark.dynamicAllocation.maxExecutors for calculating bytesPerCore when dynamic allocation is enabled

[GitHub] spark issue #19633: [SPARK-22411][SQL] Disable the heuristic to calculate ma...

2017-11-01 Thread vgankidi
Github user vgankidi commented on the issue: https://github.com/apache/spark/pull/19633 ping @davies --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

2017-11-01 Thread vgankidi
GitHub user vgankidi opened a pull request: https://github.com/apache/spark/pull/19633 [SPARK-22411][SQL] Disable the heuristic to calculate max partition size when dynamic allocation is enabled and use the value specified by the property spark.sql.files.maxPartitionBytes instead

[GitHub] spark issue #19425: [SPARK-22196][Core] Combine multiple input splits into a...

2017-10-04 Thread vgankidi
Github user vgankidi commented on the issue: https://github.com/apache/spark/pull/19425 @davies Can you please take a look? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands

[GitHub] spark pull request #19425: [SPARK-22196][Core] Combine multiple input splits...

2017-10-04 Thread vgankidi
GitHub user vgankidi opened a pull request: https://github.com/apache/spark/pull/19425 [SPARK-22196][Core] Combine multiple input splits into a HadoopPartition ## What changes were proposed in this pull request? Spark native read path allows tuning the partition size based