Re: [Spark Core]: Adding support for size based partition coalescing

2021-03-30 Thread mhawes
Hi Pol, I had considered repartitioning but the main issue for me there is that it will trigger a shuffle and could significantly slow down the query/application as a result. Thanks for contributing that as an alternative suggestion though :) -- Sent from:

Re: [Spark Core]: Adding support for size based partition coalescing

2021-03-30 Thread Pol Santamaria
Hi Matt, I have encountered the same issue several times so I totally agree with you that it would be a useful addition to Spark. I frequently solve the unbalance by coding a custom partitioner which is far from ideal, since then I get down to RDDs. I don't know the Spark code base well enough to

[Spark Core]: Adding support for size based partition coalescing

2021-03-30 Thread mhawes
Hi all, Sending this first before creating a jira issue in an effort to start a discussion :) Problem: We have a situation where we end with a very large number (O(10K)) of partitions, with very little data in most partitions but a lot of data in some of them. This not only causes slow execution

Re: Welcoming six new Apache Spark committers

2021-03-30 Thread Jacek Laskowski
Hi, Congrats to all of you committers! Wishing you all the best (commits)! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books Follow me on https://twitter.com/jaceklaskowski On Fri,