ok thanks.

mhhhhh. that seems odd. shouldnt coalesce introduce a new map-phase with
less tasks instead of changing the previous shuffle?

using repartition seems too expensive just to keep the number of files
down. so i guess i am back to looking for another solution.



On Wed, Aug 8, 2018 at 4:13 PM, Vadim Semenov <va...@datadoghq.com> wrote:

> `coalesce` sets the number of partitions for the last stage, so you
> have to use `repartition` instead which is going to introduce an extra
> shuffle stage
>
> On Wed, Aug 8, 2018 at 3:47 PM Koert Kuipers <ko...@tresata.com> wrote:
> >
> > one small correction: lots of files leads to pressure on the spark
> driver program when reading this data in spark.
> >
> > On Wed, Aug 8, 2018 at 3:39 PM, Koert Kuipers <ko...@tresata.com> wrote:
> >>
> >> hi,
> >>
> >> i am reading data from files into a dataframe, then doing a groupBy for
> a given column with a count, and finally i coalesce to a smaller number of
> partitions before writing out to disk. so roughly:
> >>
> >> spark.read.format(...).load(...).groupBy(column).count().
> coalesce(100).write.format(...).save(...)
> >>
> >> i have this setting: spark.sql.shuffle.partitions=2048
> >>
> >> i expect to see 2048 partitions in shuffle. what i am seeing instead is
> a shuffle with only 100 partitions. it's like the coalesce has taken over
> the partitioning of the groupBy.
> >>
> >> any idea why?
> >>
> >> i am doing coalesce because it is not helpful to write out 2048 files,
> lots of files leads to pressure down the line on executors reading this
> data (i am writing to just one partition of a larger dataset), and since i
> have less than 100 executors i expect it to be efficient. so sounds like a
> good idea, no?
> >>
> >> but i do need 2048 partitions in my shuffle due to the operation i am
> doing in the groupBy (in my real problem i am not just doing a count...).
> >>
> >> thanks!
> >> koert
> >>
> >
>
>
> --
> Sent from my iPhone
>

Reply via email to