Yes, we probably need more change for the data source API if we need to implement it in a generic way. BTW, I create the JIRA by copy most of words from Alex. ☺
https://issues.apache.org/jira/browse/SPARK-11512 From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, November 5, 2015 1:36 AM To: Alex Nastetsky Cc: dev@spark.apache.org Subject: Re: Sort Merge Join from the filesystem It's not supported yet, and not sure if there is a ticket for it. I don't think there is anything fundamentally hard here either. On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <alex.nastet...@vervemobile.com<mailto:alex.nastet...@vervemobile.com>> wrote: (this is kind of a cross-post from the user list) Does Spark support doing a sort merge join on two datasets on the file system that have already been partitioned the same with the same number of partitions and sorted within each partition, without needing to repartition/sort them again? This functionality exists in - Hive (hive.optimize.bucketmapjoin.sortedmerge) - Pig (USING 'merge') - MapReduce (CompositeInputFormat) If this is not supported in Spark, is a ticket already open for it? Does the Spark architecture present unique difficulties to having this feature? It is very useful to have this ability, as you can prepare dataset A to be joined with dataset B before B even exists, by pre-processing A with a partition/sort. Thanks.