[ https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775922#comment-17775922 ]
Jason Yarbrough commented on SPARK-34910: ----------------------------------------- Hey guys, Sorry for commenting on such an old issue, but the PR went stale, and I just wanted to revisit this to see if anyone thinks it's worthwhile now. Thanks > JDBC - Add an option for different stride orders > ------------------------------------------------ > > Key: SPARK-34910 > URL: https://issues.apache.org/jira/browse/SPARK-34910 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.0 > Reporter: Jason Yarbrough > Priority: Trivial > > Currently, the JDBCRelation columnPartition function orders the strides in > ascending order, starting from the lower bound and working its way towards > the upper bound. > I'm proposing leaving that as the default, but adding an option (such as > strideOrder) in JDBCOptions. Since it will default to the current behavior, > this will keep people's current code working as expected. However, people who > may have data skew closer to the upper bound might appreciate being able to > have the strides in descending order, thus filling up the first partition > with the last stride and so forth. Also, people with nondeterministic data > skew or sporadic data density might be able to benefit from a random ordering > of the strides. > I have the code created to implement this, and it creates a pattern that can > be used to add other algorithms that people may want to add (such as counting > the rows and ranking each stride, and then ordering from most dense to > least). The current two options I have coded is 'descending' and 'random.' > The original idea was to create something closer to Spark's hash partitioner, > but for JDBC and pushed down to the database engine for efficiency. However, > that would require adding hashing algorithms for each dialect, and the > performance from those algorithms may outweigh the benefit. The method I'm > proposing in this ticket avoids those complexities while still giving some of > the benefit (in the case of random ordering). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org