Re: [Spark Core]: Adding support for size based partition coalescing

2021-03-30 Thread mhawes
Hi Pol, I had considered repartitioning but the main issue for me there is
that it will trigger a shuffle and could significantly slow down the
query/application as a result. Thanks for contributing that as an
alternative suggestion though :) 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Spark Core]: Adding support for size based partition coalescing

2021-03-30 Thread Pol Santamaria
Hi Matt,

I have encountered the same issue several times so I totally agree with you
that it would be a useful addition to Spark. I frequently solve the
unbalance by coding a custom partitioner which is far from ideal, since
then I get down to RDDs. I don't know the Spark code base well enough to
judge the complexity or how feasible it is though.

Bests,
Pol Santamaria

On Tue, Mar 30, 2021 at 1:30 PM mhawes  wrote:

> Hi all, Sending this first before creating a jira issue in an effort to
> start
> a discussion :)
>
> Problem:
>
> We have a situation where we end with a very large number (O(10K)) of
> partitions, with very little data in most partitions but a lot of data in
> some of them. This not only causes slow execution but we run into issues
> like  SPARK-12837   .
> Naturally the solution here is to coalesce these partitions, however it's
> impossible for us to know upfront the size and number of partitions we're
> going to get. Additionally this can change on each run. If we naively try
> coalescing to say 50 partitions, we're likely to end up with very poorly
> distributed data and potentially executor OOMs. This is because the
> coalesce
> tries to balance number of partitions rather than partition size.
>
> Proposed Solution
>
> Therefore I was thinking about the possibility of contributing a feature
> whereby a user could specify a target partition size and spark would aim to
> get as close as possible to that. Something along these lines has been
> partially attempted in  SPARK-14042
>    but this falls short
> as there is no generic way (AFAIK) of getting access to the partition sizes
> before the query execution. Thus I would suggest that we try do something
> similar to AQEs "Post Shuffle Partition Coalescing" whereby we allow the
> user to trigger an adaptive coalesce even without a shuffle? The optimum
> partitions would be calculated dynamically at runtime. This way we dont pay
> the price of shuffling but can still get reasonable partition sizes.
>
> What do people think? Happy to clarify things upon request. :)
>
> Best,
> Matt
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[Spark Core]: Adding support for size based partition coalescing

2021-03-30 Thread mhawes
Hi all, Sending this first before creating a jira issue in an effort to start
a discussion :)

Problem:

We have a situation where we end with a very large number (O(10K)) of
partitions, with very little data in most partitions but a lot of data in
some of them. This not only causes slow execution but we run into issues
like  SPARK-12837   .
Naturally the solution here is to coalesce these partitions, however it's
impossible for us to know upfront the size and number of partitions we're
going to get. Additionally this can change on each run. If we naively try
coalescing to say 50 partitions, we're likely to end up with very poorly
distributed data and potentially executor OOMs. This is because the coalesce
tries to balance number of partitions rather than partition size.

Proposed Solution

Therefore I was thinking about the possibility of contributing a feature
whereby a user could specify a target partition size and spark would aim to
get as close as possible to that. Something along these lines has been
partially attempted in  SPARK-14042
   but this falls short
as there is no generic way (AFAIK) of getting access to the partition sizes
before the query execution. Thus I would suggest that we try do something
similar to AQEs "Post Shuffle Partition Coalescing" whereby we allow the
user to trigger an adaptive coalesce even without a shuffle? The optimum
partitions would be calculated dynamically at runtime. This way we dont pay
the price of shuffling but can still get reasonable partition sizes.

What do people think? Happy to clarify things upon request. :) 

Best,
Matt



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcoming six new Apache Spark committers

2021-03-30 Thread Jacek Laskowski
Hi,

Congrats to all of you committers! Wishing you all the best (commits)!

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Fri, Mar 26, 2021 at 9:21 PM Matei Zaharia 
wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new role! Our new committers are:
>
> - Maciej Szymkiewicz (contributor to PySpark)
> - Max Gekk (contributor to Spark SQL)
> - Kent Yao (contributor to Spark SQL)
> - Attila Zsolt Piros (contributor to decommissioning and Spark on
> Kubernetes)
> - Yi Wu (contributor to Spark Core and SQL)
> - Gabor Somogyi (contributor to Streaming and security)
>
> All six of them contributed to Spark 3.1 and we’re very excited to have
> them join as committers.
>
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>