[ https://issues.apache.org/jira/browse/SPARK-30399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133878#comment-17133878 ]
Hyukjin Kwon commented on SPARK-30399: -------------------------------------- Okay but Spark 2.3.0 is EOL. Mind checking if the issue exists in the latest Spark version? It would be nicer if we have the reproducer as well. > Bucketing does not compatible with partitioning in practice > ----------------------------------------------------------- > > Key: SPARK-30399 > URL: https://issues.apache.org/jira/browse/SPARK-30399 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Environment: HDP 2.7 > Reporter: Shay Elbaz > Priority: Minor > > When using Spark Bucketed table, Spark would use as many partitions as the > number of buckets for the map-side join > (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" > tables, but quite disastrous for _time-partitioned_ tables. In our use case, > a daily partitioned key-value table is added 100GB of data every day. So in > 100 days there are 10TB of data we want to join with. Aiming to this > scenario, we need thousands of buckets if we want every task to successfully > *read and sort* all of it's data in a map-side join. But in such case, every > daily increment would emit thousands of small files, leading to other big > issues. > In practice, and with a hope for some hidden optimization, we set the number > of buckets to 1000 and backfilled such a table with 10TB. When trying to join > with the smallest input, every executor was killed by Yarn due to over > allocating memory in the sorting phase. Even without such failures, it would > take every executor unreasonably amount of time to locally sort all its data. > A question on SO remained unanswered for a while, so I thought asking here - > is it by design that buckets cannot be used in time-partitioned table, or am > I doing something wrong? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org