___
From: Stuart White <stuart.whi...@gmail.com<mailto:stuart.whi...@gmail.com>>
Sent: Saturday, November 12, 2016 11:20:28 AM
To: Silvio Fiorito
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Joining to a large, pre-sorted file
Hi Silvio,
Thanks v
t;
> Thanks,
> Silvio
> --
> *From:* Stuart White <stuart.whi...@gmail.com>
> *Sent:* Saturday, November 12, 2016 11:20:28 AM
> *To:* Silvio Fiorito
> *Cc:* user@spark.apache.org
> *Subject:* Re: Joining to a large, pre-sorted file
>
> Hi Silvio,
>
> T
Silvio Fiorito
Cc: user@spark.apache.org
Subject: Re: Joining to a large, pre-sorted file
Hi Silvio,
Thanks very much for the response!
I'm pretty new at reading explain plans, so maybe I'm misunderstanding what I'm
seeing.
Remember my goal is to sort master, write it out, later read it back in and
Hi Stuart,
You don’t need the sortBy or sortWithinPartitions.
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/594325853373464/879901972425732/6861830365114179/latest.html
This is what the job should look like:
Thanks for the reply.
I understand that I need to use bucketBy() to write my master file,
but I still can't seem to make it work as expected. Here's a code
example for how I'm writing my master file:
Range(0, 100)
.map(i => (i, s"master_$i"))
.toDF("key", "value")
.write
t;jornfra...@gmail.com>
Cc: "user@spark.apache.org" <user@spark.apache.org>
Subject: Re: Joining to a large, pre-sorted file
Yes. In my original question, when I said I wanted to pre-sort the master
file, I should have said "pre-sort and pre-partition the file".
Years ago
Yes. In my original question, when I said I wanted to pre-sort the master
file, I should have said "pre-sort and pre-partition the file".
Years ago, I did this with Hadoop MapReduce. I pre-sorted/partitioned the
master file into N partitions. Then, when a transaction file would arrive,
I would
Can you split the files beforehand in several files (e.g. By the column you do
the join on?) ?
> On 10 Nov 2016, at 23:45, Stuart White wrote:
>
> I have a large "master" file (~700m records) that I frequently join smaller
> "transaction" files to. (The transaction
I have a large "master" file (~700m records) that I frequently join smaller
"transaction" files to. (The transaction files have 10's of millions of
records, so too large for a broadcast join).
I would like to pre-sort the master file, write it to disk, and then, in
subsequent jobs, read the file