Re: [EXTERNAL] Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-02 Thread Shay Elbaz
Thanks Artemis. We are not using Rapids, but rather using GPUs through the 
Stage Level Scheduling feature with ResourceProfile. In Kubernetes you have to 
turn on shuffle tracking for dynamic allocation, anyhow.
The question is how we can limit the number of executors when building a new 
ResourceProfile, directly (API) or indirectly (some advanced workaround).

Thanks,
Shay



From: Artemis User 
Sent: Thursday, November 3, 2022 1:16 AM
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: Stage level scheduling - lower the number of executors 
when using GPUs


ATTENTION: This email originated from outside of GM.

  Are you using Rapids for GPU support in Spark?  Couple of options you may 
want to try:

  1.  In addition to dynamic allocation turned on, you may also need to turn on 
external shuffling service.
  2.  Sounds like you are using Kubernetes.  In that case, you may also need to 
turn on shuffle tracking.
  3.  The "stages" are controlled by the APIs.  The APIs for dynamic resource 
request (change of stage) do exist, but only for RDDs (e.g. TaskResourceRequest 
and ExecutorResourceRequest).

On 11/2/22 11:30 AM, Shay Elbaz wrote:
Hi,

Our typical applications need less executors for a GPU stage than for a CPU 
stage. We are using dynamic allocation with stage level scheduling, and Spark 
tries to maximize the number of executors also during the GPU stage, causing a 
bit of resources chaos in the cluster. This forces us to use a lower value for 
'maxExecutors' in the first place, at the cost of the CPU stages performance. 
Or try to solve this in the Kubernets scheduler level, which is not 
straightforward and doesn't feel like the right way to go.

Is there a way to effectively use less executors in Stage Level Scheduling? The 
API does not seem to include such an option, but maybe there is some more 
advanced workaround?

Thanks,
Shay








Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-02 Thread Artemis User
Are you using Rapids for GPU support in Spark?  Couple of options you 
may want to try:


1. In addition to dynamic allocation turned on, you may also need to
   turn on external shuffling service.
2. Sounds like you are using Kubernetes.  In that case, you may also
   need to turn on shuffle tracking.
3. The "stages" are controlled by the APIs.  The APIs for dynamic
   resource request (change of stage) do exist, but only for RDDs (e.g.
   TaskResourceRequest and ExecutorResourceRequest).


On 11/2/22 11:30 AM, Shay Elbaz wrote:

Hi,

Our typical applications need less *executors* for a GPU stage than 
for a CPU stage. We are using dynamic allocation with stage level 
scheduling, and Spark tries to maximize the number of executors also 
during the GPU stage, causing a bit of resources chaos in the cluster. 
This forces us to use a lower value for 'maxExecutors' in the first 
place, at the cost of the CPU stages performance. Or try to solve this 
in the Kubernets scheduler level, which is not straightforward and 
doesn't feel like the right way to go.


Is there a way to effectively use less executors in Stage Level 
Scheduling? The API does not seem to include such an option, but maybe 
there is some more advanced workaround?


Thanks,
Shay







Stage level scheduling - lower the number of executors when using GPUs

2022-11-02 Thread Shay Elbaz
Hi,

Our typical applications need less executors for a GPU stage than for a CPU 
stage. We are using dynamic allocation with stage level scheduling, and Spark 
tries to maximize the number of executors also during the GPU stage, causing a 
bit of resources chaos in the cluster. This forces us to use a lower value for 
'maxExecutors' in the first place, at the cost of the CPU stages performance. 
Or try to solve this in the Kubernets scheduler level, which is not 
straightforward and doesn't feel like the right way to go.

Is there a way to effectively use less executors in Stage Level Scheduling? The 
API does not seem to include such an option, but maybe there is some more 
advanced workaround?

Thanks,
Shay







[*IMPORTANT*] update Streaming Query Statistics url

2022-11-02 Thread Priyanshi Shahu
Hello Team,
I wanna write a mail to inform you that I'm using the spark 3.3.0 release
and in that release when im access the spark monitoring UI and I want to
access the Streaming Query Statistics tab (It contains all the running ID),
after that when I click and of running id the page redirected to me this
given URL.
*/prspark-4-wsmsetsibatchpm-1666174298886-driver/StreamingQuery/statistics?id=ef071381-7586-45fe-a8ba-53ac5838d835*
 and this URL gives me a 500 response.
[image: image.png]
But after applying "/" on this URL I get a result so please update your
URL. this issue is also present on spark 3.0.1 and spark 3.2.0 release.

correct URL is :-
prspark-4-wsmsetsibatchpm-1666174298886-driver/StreamingQuery/statistics/?id=ef071381-7586-45fe-a8ba-53ac5838d835
[image: image.png]


Thanks & Regards
Priyanshi Sahu


should one every make a spark streaming job in pyspark

2022-11-02 Thread Joris Billen
Dear community, 
I had a general question about the use of scala VS pyspark for spark streaming.
I believe spark streaming will work most efficiently when written in scala. I 
believe however that things can be implemented in pyspark. My question: 
1)is it completely dumb to make a streaming job in pyspark? 
2)what are the technical reasons that it is done best in scala (is this easy to 
understand why)? 
3)any good links anyone has seen with numbers of the difference in performance 
and under what circumstances+explanation?
4)are there certain scenarios when the use of pyspark can be motivated (maybe 
when someone doesn’t feel confortable writing a job in scala and the number of 
messages/minute aren’t gigantic so performance isnt that crucial?)

Thanks for any input!
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org