Yes. In my original question, when I said I wanted to pre-sort the master file, I should have said "pre-sort and pre-partition the file".
Years ago, I did this with Hadoop MapReduce. I pre-sorted/partitioned the master file into N partitions. Then, when a transaction file would arrive, I would sort/partition the transaction file on the join key into N partitions. Then I could perform what was called a mapside join. Basically, I want to do the same thing in Spark. And it looks like all the pieces to accomplish this exist, but I can't figure out how to connect all the dots. It seems like this functionality is pretty new so there aren't a lot of examples available. On Thu, Nov 10, 2016 at 7:33 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Can you split the files beforehand in several files (e.g. By the column > you do the join on?) ? > > On 10 Nov 2016, at 23:45, Stuart White <stuart.whi...@gmail.com> wrote: > > I have a large "master" file (~700m records) that I frequently join > smaller "transaction" files to. (The transaction files have 10's of > millions of records, so too large for a broadcast join). > > I would like to pre-sort the master file, write it to disk, and then, in > subsequent jobs, read the file off disk and join to it without having to > re-sort it. I'm using Spark SQL, and my understanding is that the Spark > Catalyst Optimizer will choose an optimal join algorithm if it is aware > that the datasets are sorted. So, the trick is to make the optimizer aware > that the master file is already sorted. > > I think SPARK-12394 <https://issues.apache.org/jira/browse/SPARK-12394> > provides this functionality, but I can't seem to put the pieces together > for how to use it. > > Could someone possibly provide a simple example of how to: > > 1. Sort a master file by a key column and write it to disk in such a > way that its "sorted-ness" is preserved. > 2. In a later job, read a transaction file, sort/partition it as > necessary. Read the master file, preserving its sorted-ness. Join the two > DataFrames in such a way that the master rows are not sorted again. > > Thanks! > >