[ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308978#comment-17308978
 ] 

Jason Yarbrough commented on SPARK-34843:
-----------------------------------------

PR created: [Spark 34843 - JDBCRelation columnPartition function improperly 
determines stride size. Upper bound is skewed do to stride alignment. by 
hanover-fiste · Pull Request #31965 · apache/spark 
(github.com)|https://github.com/apache/spark/pull/31965]

> JDBCRelation columnPartition function improperly determines stride size. 
> Upper bound is skewed do to stride alignment.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34843
>                 URL: https://issues.apache.org/jira/browse/SPARK-34843
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Jason Yarbrough
>            Priority: Minor
>         Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing strides of 33, you'll end up at 
> 2017-07-08. This is over 3 years of extra data that will go into the last 
> partition, and depending on the shape of the data could cause a very long 
> running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-04-02, which is much closer to the 
> original supplied upper bound. This is the best you can do to get as close as 
> possible to the upper bound (without adjusting the number of partitions). For 
> example, a stride size of 35 would go well past the supplied upper bound 
> (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to