[ https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312702#comment-17312702 ]
Jason Yarbrough commented on SPARK-34910: ----------------------------------------- Hi [~srowen], thanks for asking. The idea here is that if a user's data is heavily skewed to the "right" (or upper bound), that the last strides (and therefore partitions/tasks) will have the most density of data. Say a stage has 32 tasks, and the first 30 finish at the same time, the last two task that are the heaviest would then start. The issue with this is, 2 cores would be running for a long time while the other cores are sitting idle (since there are no more tasks left). The hope is, if we have the option to reverse that, the first 2 task will be the heaviest, and while 2 cores work on those, the other cores will be working on the next tasks. As we increase the partition count, we try to flatten out this issue, but having the option to do descending could still be of help. As for a random stride order, it's just another option that may help flatten the distribution (even more so with a higher partition count). This one is a little more of a edge usage, but could help break apart some hot spots, and is easy to implement within the pattern I've developed. On somewhat of a side-note (although I may try to implement this in the current code, but may save it for another branch), I think something that would be pretty nice to add is a ranking of the density of partitions. Depending on if the column is indexed (I would recommend that for columns people are partitioning on), the performance impact of doing the extra queries/count may not be that bad. Once a count of records per partition is created, we could order the partition array in a way where the most dense partitions are towards the head. Implementing what I'm proposing here gives people more options on how their data is processed, and can also be extended for other algorithms if they make sense. > JDBC - Add an option for different stride orders > ------------------------------------------------ > > Key: SPARK-34910 > URL: https://issues.apache.org/jira/browse/SPARK-34910 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.0 > Reporter: Jason Yarbrough > Priority: Trivial > > Currently, the JDBCRelation columnPartition function orders the strides in > ascending order, starting from the lower bound and working its way towards > the upper bound. > I'm proposing leaving that as the default, but adding an option (such as > strideOrder) in JDBCOptions. Since it will default to the current behavior, > this will keep people's current code working as expected. However, people who > may have data skew closer to the upper bound might appreciate being able to > have the strides in descending order, thus filling up the first partition > with the last stride and so forth. Also, people with nondeterministic data > skew or sporadic data density might be able to benefit from a random ordering > of the strides. > I have the code created to implement this, and it creates a pattern that can > be used to add other algorithms that people may want to add (such as counting > the rows and ranking each stride, and then ordering from most dense to > least). The current two options I have coded is 'descending' and 'random.' > The original idea was to create something closer to Spark's hash partitioner, > but for JDBC and pushed down to the database engine for efficiency. However, > that would require adding hashing algorithms for each dialect, and the > performance from those algorithms may outweigh the benefit. The method I'm > proposing in this ticket avoids those complexities while still giving some of > the benefit (in the case of random ordering). > I'll put a PR in if others feel this is a good idea. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org