HI Pedro, What is your usecase ,why you used coqlesce ,coalesce() is very expensive operations as they shuffle the data across many partitions hence try to minimize repartition as much as possible.
Regards, Vaquar khan On Thu, Mar 18, 2021, 5:47 PM Pedro Tuero <tuerope...@gmail.com> wrote: > I was reviewing a spark java application running on aws emr. > > The code was like: > RDD.reduceByKey(func).coalesce(number).saveAsTextFile() > > That stage took hours to complete. > I changed to: > RDD.reduceByKey(func, number).saveAsTextFile() > And it now takes less than 2 minutes, and the final output is the same. > > So, is it a bug or a feature? > Why spark doesn't treat a coalesce after a reduce like a reduce with > output partitions parameterized? > > Just for understanding, > Thanks, > Pedro. > > > >