I have a large "master" file (~700m records) that I frequently join smaller "transaction" files to. (The transaction files have 10's of millions of records, so too large for a broadcast join).
I would like to pre-sort the master file, write it to disk, and then, in subsequent jobs, read the file off disk and join to it without having to re-sort it. I'm using Spark SQL, and my understanding is that the Spark Catalyst Optimizer will choose an optimal join algorithm if it is aware that the datasets are sorted. So, the trick is to make the optimizer aware that the master file is already sorted. I think SPARK-12394 <https://issues.apache.org/jira/browse/SPARK-12394> provides this functionality, but I can't seem to put the pieces together for how to use it. Could someone possibly provide a simple example of how to: 1. Sort a master file by a key column and write it to disk in such a way that its "sorted-ness" is preserved. 2. In a later job, read a transaction file, sort/partition it as necessary. Read the master file, preserving its sorted-ness. Join the two DataFrames in such a way that the master rows are not sorted again. Thanks!