Re: Joining to a large, pre-sorted file

2016-11-15 Thread Rohit Verma
___ From: Stuart White <stuart.whi...@gmail.com<mailto:stuart.whi...@gmail.com>> Sent: Saturday, November 12, 2016 11:20:28 AM To: Silvio Fiorito Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Joining to a large, pre-sorted file Hi Silvio, Thanks v

Re: Joining to a large, pre-sorted file

2016-11-15 Thread Stuart White
t; > Thanks, > Silvio > -- > *From:* Stuart White <stuart.whi...@gmail.com> > *Sent:* Saturday, November 12, 2016 11:20:28 AM > *To:* Silvio Fiorito > *Cc:* user@spark.apache.org > *Subject:* Re: Joining to a large, pre-sorted file > > Hi Silvio, > > T

Re: Joining to a large, pre-sorted file

2016-11-13 Thread Silvio Fiorito
Silvio Fiorito Cc: user@spark.apache.org Subject: Re: Joining to a large, pre-sorted file Hi Silvio, Thanks very much for the response! I'm pretty new at reading explain plans, so maybe I'm misunderstanding what I'm seeing. Remember my goal is to sort master, write it out, later read it back in and

Re: Joining to a large, pre-sorted file

2016-11-12 Thread Silvio Fiorito
Hi Stuart, You don’t need the sortBy or sortWithinPartitions. https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/594325853373464/879901972425732/6861830365114179/latest.html This is what the job should look like:

Re: Joining to a large, pre-sorted file

2016-11-12 Thread Stuart White
Thanks for the reply. I understand that I need to use bucketBy() to write my master file, but I still can't seem to make it work as expected. Here's a code example for how I'm writing my master file: Range(0, 100) .map(i => (i, s"master_$i")) .toDF("key", "value") .write

Re: Joining to a large, pre-sorted file

2016-11-10 Thread Silvio Fiorito
t;jornfra...@gmail.com> Cc: "user@spark.apache.org" <user@spark.apache.org> Subject: Re: Joining to a large, pre-sorted file Yes. In my original question, when I said I wanted to pre-sort the master file, I should have said "pre-sort and pre-partition the file". Years ago

Re: Joining to a large, pre-sorted file

2016-11-10 Thread Stuart White
Yes. In my original question, when I said I wanted to pre-sort the master file, I should have said "pre-sort and pre-partition the file". Years ago, I did this with Hadoop MapReduce. I pre-sorted/partitioned the master file into N partitions. Then, when a transaction file would arrive, I would

Re: Joining to a large, pre-sorted file

2016-11-10 Thread Jörn Franke
Can you split the files beforehand in several files (e.g. By the column you do the join on?) ? > On 10 Nov 2016, at 23:45, Stuart White wrote: > > I have a large "master" file (~700m records) that I frequently join smaller > "transaction" files to. (The transaction

Joining to a large, pre-sorted file

2016-11-10 Thread Stuart White
I have a large "master" file (~700m records) that I frequently join smaller "transaction" files to. (The transaction files have 10's of millions of records, so too large for a broadcast join). I would like to pre-sort the master file, write it to disk, and then, in subsequent jobs, read the file