SaurabhChawla100 commented on pull request #29413:
URL: https://github.com/apache/spark/pull/29413#issuecomment-673593703


   >I don't get your point here. it can already be configured on a per queue 
basis (set "spark.scheduler.listenerbus.eventqueue.$name.capacity") and if user 
sets spark.set.optmized.event.queue why not just set 
spark.scheduler.listenerbus.eventqueue.$name.capacity to be larger to match 
whatever driver memory set at? I'm not necessarily against a change here, I get 
the issues with dropping events, but this just feels like extra code to do what 
user can already do. If I'm missing something and you have specific use case 
I'm missing, please explain in more detail.
   
   Yes I agree we can use 
"spark.scheduler.listenerbus.eventqueue.$name.capacity" to set the value. 
Entire idea is to make Queue to perform well in situation where the Queue 
overflow the initial capacity(can be set using conf 
spark.scheduler.listenerbus.eventqueue.$name.capacity) and some additional 
threshold capacity helps in reducing the event drop and there will be event 
drop after it crosses the threshold value. This is the best case scenario where 
the event drop handled at the time by the threshold capacity and no event is 
dropped. For handling this I used VariableLinkedBlockingQueue so at run time 
this can be taken care by increasing the initial capacity at run time with some 
validation of keeping the driver memory into the consideration. 
   
   This can be done with LinkedBlockingQueue by setting the queue size to add 
this extra threshold when creating the Queue at the start of the Spark 
Application.
   
   Since this is also true that "There is no fixed size of the Queue which can 
be used in all the Spark Jobs and even for the Same Spark Job that ran on 
different input set on daily basis". So this extra threshold can helps in best 
case scenario where the initial Queue size is set. This will prevent abrupt 
behaviour of Spark application due to event drop for the critical 
Queues(appStatusQueue, executormangmentQueue etc).
   
   If something fails / abrupt behaviour after this extra threshold, then 
consider some manual effort in tuning the queue size which we will do incase of 
setting "spark.scheduler.listenerbus.eventqueue.$name.capacity". If we consider 
only this "spark.scheduler.listenerbus.eventqueue.$name.capacity" and every 
time we need to change the size of the queue when the application performs 
abruptly which requires more manual efforts even to same Job compare to the 
cases which is already handled by using the threshold.
   
   So this is one of the Real World scenario of Multi-Tenant environment, where 
multiple Spark Application running on same Spark Cluster with fixed number of 
max nodes that can come and there is fixed number of the cores and memory 
available. Some of the abrupt behaviour that we have seen in this case
   
   1) Scenario when dynamic Allocation is enabled and event drop has happened 
in executor management queue and can result in not downscaling of the executor 
for one of the Spark Application, Even though Executor is idle for long time. 
So this is holding a resource and some other application wants some more  
resource and did not get since there is limited number of resources available 
in the Cluster.
   
   2) Also seen the scenario where there is dependency Zeppelin notebook on 
AppStatus Queue and event drop prevented the spark application to go down after 
the idle timeout, If non of the spark Jobs running for some time. Here also we 
found the resource is holded one resource is driver and other min number of 
executors that we set, which can be used by some other Spark Application
   
   Also there are many scenarios that can impact the spark Application due to 
Event Drop.
   
   This is just to reduce manual intervention on changing the conf to some 
extent. 
   
   >Another thing we can do is if there are certain events critical we can look 
at putting them into its own queue as well. Perhaps another question is, is 
your driver just running out of CPU to process these fast enough? 
   - Completely Agree on this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to