Hi everyone,

I am working towards making Spark's Sort Merge join in par with Hive's
Sort-Merge-Bucket join to use sorted. So far I have identified these main
items to be addressed:

1. Make query planner to use `sorted`ness information for sort merge join
(SPARK-15453, SPARK-17271)
2. Configurable hashing and bucketing function in Spark. There need to be
separate impls for Hive as the default one in Spark is not compatible with
Hive.
3. Propagate bucketing information for Hive tables to / from Catalog
4. Ensure that writes to Hive bucketed tables produce single file per bucket

I have started off with #1 and got few questions as I am moving to next
items:

(a) Current Spark way of creating bucketed tables (ie. `bucketBy`) produces
multiple files per bucket which produces bucketed data faster [1]. But on
the read side, one cannot utilize the benefits of sorted-ness as the files
for a given bucket-id are locally sorted (not globally). Was this by design
? I read the design doc for "Bucketed Tables" [0] and there is no mention
about this (actually the plan seemed to do same as Hive, see sections
"Writing Bucketed Tables" and "Reading persisted bucketed tables").

(b) In one of the TODO / future work section, the design doc [0] says :
"Take advantage of sorted files. We will do this later as it has many
subtle issues". I could not find much information on what these issues are.
Can anyone highlight the known issues ?

[0] : https://issues.apache.org/jira/secure/attachment/
12778890/BucketedTables.pdf
[1] : https://issues.apache.org/jira/browse/SPARK-12394

Reply via email to