Shay Elbaz created SPARK-30399:
----------------------------------

             Summary: Bucketing does not compatible with partitioning in 
practice
                 Key: SPARK-30399
                 URL: https://issues.apache.org/jira/browse/SPARK-30399
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
         Environment: HDP 2.7
            Reporter: Shay Elbaz


When using Spark Bucketed table, Spark would use as many partitions as the 
number of buckets for the map-side join 
(_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" 
tables, but quite disastrous for _time-partitioned_ tables. In our use case, a 
daily partitioned key-value table is added 100GB of data every day. So in 100 
days there are 10TB of data we want to join with - aiming to this scenario, we 
need thousands of buckets if we want every task to successfully *read and sort* 
all of it's data in a map-side join. But in such case, every daily increment 
would emit thousands of small files, leading to other big issues. 

In practice, and with a hope for some hidden optimization, we set the number of 
buckets to 1000 and backfilled such a table with 10TB. When trying to join with 
the smallest input, every executor was killed by Yarn due to over allocating 
memory in the sorting phase. Even without such failures, it would take every 
executor unreasonably amount of time to locally sort all its data.

A question on SO remained unanswered for a while, so I thought asking here - is 
it by design that buckets cannot be used in time-partitioned table, or am I 
doing something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to