Re: Sort Merge Join from the filesystem

Reynold Xin Wed, 04 Nov 2015 09:37:19 -0800

It's not supported yet, and not sure if there is a ticket for it. I don't
think there is anything fundamentally hard here either.



On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:

> (this is kind of a cross-post from the user list)
>
> Does Spark support doing a sort merge join on two datasets on the file
> system that have already been partitioned the same with the same number of
> partitions and sorted within each partition, without needing to
> repartition/sort them again?
>
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)
>
> If this is not supported in Spark, is a ticket already open for it? Does
> the Spark architecture present unique difficulties to having this feature?
>
> It is very useful to have this ability, as you can prepare dataset A to be
> joined with dataset B before B even exists, by pre-processing A with a
> partition/sort.
>
> Thanks.
>

Re: Sort Merge Join from the filesystem

Reply via email to