[ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307493#comment-17307493
 ] 

Jason Yarbrough commented on SPARK-34843:
-----------------------------------------

That's a good example! Thank you for sharing that. I have a new formula that 
will handle that while also retaining the accuracy of the other proposed 
formula.

> JDBCRelation columnPartition function improperly determines stride size
> -----------------------------------------------------------------------
>
>                 Key: SPARK-34843
>                 URL: https://issues.apache.org/jira/browse/SPARK-34843
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Jason Yarbrough
>            Priority: Minor
>         Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride: Long = (upperBound - lowerBound) / (numPartitions - 2)
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing 998 strides of 33, you end up at 
> 2017-06-05 (17323). This is over 3 years of extra data that will go into the 
> last partition, and depending on the shape of the data could cause a very 
> long running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> (18563L - -15611) / (1000L - 2) = 34
> This would put the upper bound at 2020-02-28 (18321), which is much closer to 
> the original supplied upper bound. This is the best you can do to get as 
> close as possible to the upper bound (without adjusting the number of 
> partitions). For example, a stride size of 35 would go well past the supplied 
> upper bound (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to