[jira] [Commented] (SPARK-30399) Bucketing does not compatible with partitioning in practice

Hyukjin Kwon (Jira) Thu, 11 Jun 2020 20:25:12 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-30399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133878#comment-17133878
 ]


Hyukjin Kwon commented on SPARK-30399:
--------------------------------------

Okay but Spark 2.3.0 is EOL. Mind checking if the issue exists in the latest 
Spark version? It would be nicer if we have the reproducer as well.

> Bucketing does not compatible with partitioning in practice
> -----------------------------------------------------------
>
>                 Key: SPARK-30399
>                 URL: https://issues.apache.org/jira/browse/SPARK-30399
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>         Environment: HDP 2.7
>            Reporter: Shay Elbaz
>            Priority: Minor
>
> When using Spark Bucketed table, Spark would use as many partitions as the 
> number of buckets for the map-side join 
> (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" 
> tables, but quite disastrous for _time-partitioned_ tables. In our use case, 
> a daily partitioned key-value table is added 100GB of data every day. So in 
> 100 days there are 10TB of data we want to join with. Aiming to this 
> scenario, we need thousands of buckets if we want every task to successfully 
> *read and sort* all of it's data in a map-side join. But in such case, every 
> daily increment would emit thousands of small files, leading to other big 
> issues.
> In practice, and with a hope for some hidden optimization, we set the number 
> of buckets to 1000 and backfilled such a table with 10TB. When trying to join 
> with the smallest input, every executor was killed by Yarn due to over 
> allocating memory in the sorting phase. Even without such failures, it would 
> take every executor unreasonably amount of time to locally sort all its data.
> A question on SO remained unanswered for a while, so I thought asking here - 
> is it by design that buckets cannot be used in time-partitioned table, or am 
> I doing something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30399) Bucketing does not compatible with partitioning in practice

Reply via email to