I have a large "master" file (~700m records) that I frequently join smaller
"transaction" files to.  (The transaction files have 10's of millions of
records, so too large for a broadcast join).

I would like to pre-sort the master file, write it to disk, and then, in
subsequent jobs, read the file off disk and join to it without having to
re-sort it.  I'm using Spark SQL, and my understanding is that the Spark
Catalyst Optimizer will choose an optimal join algorithm if it is aware
that the datasets are sorted.  So, the trick is to make the optimizer aware
that the master file is already sorted.

I think SPARK-12394 <https://issues.apache.org/jira/browse/SPARK-12394>
provides this functionality, but I can't seem to put the pieces together
for how to use it.

Could someone possibly provide a simple example of how to:

   1. Sort a master file by a key column and write it to disk in such a way
   that its "sorted-ness" is preserved.
   2. In a later job, read a transaction file, sort/partition it as
   necessary.  Read the master file, preserving its sorted-ness.  Join the two
   DataFrames in such a way that the master rows are not sorted again.

Thanks!

Reply via email to