Ah that's right. I didn't mention it: I have 10 executors in my cluster,
and so when I do .coalesce(10) and right after that saving orc to s3 - does
coalescing really affects parallelism? To me it looks like no, because we
went from 100 tasks that are executed in parallel by 10 executors to 10
task
Spark is doing operations on each partition in parallel. If you decrease number
of partitions, you’re potentially doing less work in parallel depending on your
cluster setup.
> On May 23, 2017, at 4:23 PM, Andrii Biletskyi
> wrote:
>
>
> No, I didn't try to use repartition, how exactly it
No, I didn't try to use repartition, how exactly it impacts the parallelism?In
my understanding coalesce simply "unions" multiple partitions located on same
executor "one on on top of the other", while repartition does hash-based
shuffle decreasing the number of output partitions. So how this e
No, I didn't try to use repartition, how exactly it impacts the parallelism?
In my understanding coalesce simply "unions" multiple partitions located on
same executor "one on on top of the other", while repartition does
hash-based shuffle decreasing the number of output partitions. So how this
exac
coalesce is nice because it does not shuffle, but the consequence of
avoiding a shuffle is it will also reduce parallelism of the preceding
computation. Have you tried using repartition instead?
On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi <
andrii.bilets...@yahoo.com.invalid> wrote:
> Hi
Hi all,
I'm trying to understand the impact of coalesce operation on spark job
performance.
As a side note: were are using emrfs (i.e. aws s3) as source and a target
for the job.
Omitting unnecessary details job can be explained as: join 200M records
Dataframe stored in orc format on emrfs with