Hi everyone, I am working towards making Spark's Sort Merge join in par with Hive's Sort-Merge-Bucket join to use sorted. So far I have identified these main items to be addressed:
1. Make query planner to use `sorted`ness information for sort merge join (SPARK-15453, SPARK-17271) 2. Configurable hashing and bucketing function in Spark. There need to be separate impls for Hive as the default one in Spark is not compatible with Hive. 3. Propagate bucketing information for Hive tables to / from Catalog 4. Ensure that writes to Hive bucketed tables produce single file per bucket I have started off with #1 and got few questions as I am moving to next items: (a) Current Spark way of creating bucketed tables (ie. `bucketBy`) produces multiple files per bucket which produces bucketed data faster [1]. But on the read side, one cannot utilize the benefits of sorted-ness as the files for a given bucket-id are locally sorted (not globally). Was this by design ? I read the design doc for "Bucketed Tables" [0] and there is no mention about this (actually the plan seemed to do same as Hive, see sections "Writing Bucketed Tables" and "Reading persisted bucketed tables"). (b) In one of the TODO / future work section, the design doc [0] says : "Take advantage of sorted files. We will do this later as it has many subtle issues". I could not find much information on what these issues are. Can anyone highlight the known issues ? [0] : https://issues.apache.org/jira/secure/attachment/ 12778890/BucketedTables.pdf [1] : https://issues.apache.org/jira/browse/SPARK-12394