[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders

Jason Yarbrough (Jira) Wed, 31 Mar 2021 13:59:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312702#comment-17312702
 ]


Jason Yarbrough commented on SPARK-34910:
-----------------------------------------

Hi [~srowen], thanks for asking. The idea here is that if a user's data is 
heavily skewed to the "right" (or upper bound), that the last strides (and 
therefore partitions/tasks) will have the most density of data. Say a stage has 
32 tasks, and the first 30 finish at the same time, the last two task that are 
the heaviest would then start. The issue with this is, 2 cores would be running 
for a long time while the other cores are sitting idle (since there are no more 
tasks left). The hope is, if we have the option to reverse that, the first 2 
task will be the heaviest, and while 2 cores work on those, the other cores 
will be working on the next tasks. As we increase the partition count, we try 
to flatten out this issue, but having the option to do descending could still 
be of help.

As for a random stride order, it's just another option that may help flatten 
the distribution (even more so with a higher partition count). This one is a 
little more of a edge usage, but could help break apart some hot spots, and is 
easy to implement within the pattern I've developed.

On somewhat of a side-note (although I may try to implement this in the current 
code, but may save it for another branch), I think something that would be 
pretty nice to add is a ranking of the density of partitions. Depending on if 
the column is indexed (I would recommend that for columns people are 
partitioning on), the performance impact of doing the extra queries/count may 
not be that bad. Once a count of records per partition is created, we could 
order the partition array in a way where the most dense partitions are towards 
the head.

Implementing what I'm proposing here gives people more options on how their 
data is processed, and can also be extended for other algorithms if they make 
sense.

> JDBC - Add an option for different stride orders
> ------------------------------------------------
>
>                 Key: SPARK-34910
>                 URL: https://issues.apache.org/jira/browse/SPARK-34910
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Jason Yarbrough
>            Priority: Trivial
>
> Currently, the JDBCRelation columnPartition function orders the strides in 
> ascending order, starting from the lower bound and working its way towards 
> the upper bound.
> I'm proposing leaving that as the default, but adding an option (such as 
> strideOrder) in JDBCOptions. Since it will default to the current behavior, 
> this will keep people's current code working as expected. However, people who 
> may have data skew closer to the upper bound might appreciate being able to 
> have the strides in descending order, thus filling up the first partition 
> with the last stride and so forth. Also, people with nondeterministic data 
> skew or sporadic data density might be able to benefit from a random ordering 
> of the strides.
> I have the code created to implement this, and it creates a pattern that can 
> be used to add other algorithms that people may want to add (such as counting 
> the rows and ranking each stride, and then ordering from most dense to 
> least). The current two options I have coded is 'descending' and 'random.'
> The original idea was to create something closer to Spark's hash partitioner, 
> but for JDBC and pushed down to the database engine for efficiency. However, 
> that would require adding hashing algorithms for each dialect, and the 
> performance from those algorithms may outweigh the benefit. The method I'm 
> proposing in this ticket avoids those complexities while still giving some of 
> the benefit (in the case of random ordering).
> I'll put a PR in if others feel this is a good idea.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders

Reply via email to