It's not supported yet, and not sure if there is a ticket for it. I don't think there is anything fundamentally hard here either.
On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > (this is kind of a cross-post from the user list) > > Does Spark support doing a sort merge join on two datasets on the file > system that have already been partitioned the same with the same number of > partitions and sorted within each partition, without needing to > repartition/sort them again? > > This functionality exists in > - Hive (hive.optimize.bucketmapjoin.sortedmerge) > - Pig (USING 'merge') > - MapReduce (CompositeInputFormat) > > If this is not supported in Spark, is a ticket already open for it? Does > the Spark architecture present unique difficulties to having this feature? > > It is very useful to have this ability, as you can prepare dataset A to be > joined with dataset B before B even exists, by pre-processing A with a > partition/sort. > > Thanks. >