[ https://issues.apache.org/jira/browse/SPARK-23249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342207#comment-16342207 ]
Apache Spark commented on SPARK-23249: -------------------------------------- User 'glentakahashi' has created a pull request for this issue: https://github.com/apache/spark/pull/20372 > Improve partition bin-filling algorithm to have less skew and fewer partitions > ------------------------------------------------------------------------------ > > Key: SPARK-23249 > URL: https://issues.apache.org/jira/browse/SPARK-23249 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.2.1 > Reporter: Glen Takahashi > Priority: Major > > Change DataSourceScanExec so that when grouping blocks together into > partitions, also checks the end of the sorted list of splits to more > efficiently fill out partitions. > > h2. Rationale > The current bin-packing method of next-fit descending for blocks into > partitions is sub-optimal in a lot of cases and will result in extra > partitions, un-even distribution of block-counts across partitions, and > un-even distribution of partition sizes. > As an example, 128 files ranging from 1MB, 2MB,...127MB,128MB. will result in > 82 partitions with the current algorithm, but only 64 using this algorithm. > Also in this example, the max # of blocks per partition in NFD is 13, while > in this algorithm is is 2. > More generally, running a simulation of 1000 runs using 128MB blocksize, > between 1-1000 normally distributed file sizes between 1-500Mb, you can see > an improvement of approx 5% reduction of partition counts, and a large > reduction in standard deviation of blocks per partition. > This algorithm also runs in O(n) time as NFD does, and in every case is > strictly better results than NFD. > Overall, the more even distribution of blocks across partitions and therefore > reduced partition counts should result in a small but significant performance > increase across the board -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org