SaurabhChawla100 edited a comment on pull request #29413: URL: https://github.com/apache/spark/pull/29413#issuecomment-673593703
>I don't get your point here. it can already be configured on a per queue basis (set "spark.scheduler.listenerbus.eventqueue.$name.capacity") and if user sets spark.set.optmized.event.queue why not just set spark.scheduler.listenerbus.eventqueue.$name.capacity to be larger to match whatever driver memory set at? I'm not necessarily against a change here, I get the issues with dropping events, but this just feels like extra code to do what user can already do. If I'm missing something and you have specific use case I'm missing, please explain in more detail. Yes I agree we can use "spark.scheduler.listenerbus.eventqueue.$name.capacity" to set the value. Entire idea is to make Queue to perform well in situation where the Queue overflow the initial capacity(i.e. set using conf spark.scheduler.listenerbus.eventqueue.$name.capacity) and some additional threshold capacity helps in reducing the event drop and there will be event drop after it crosses the threshold value. This is the best case scenario where the event drop handled at the time by the threshold capacity and no event is dropped. For handling this I used VariableLinkedBlockingQueue so at run time this can be taken care by increasing the initial capacity at run time with some validation of keeping the driver memory into the consideration. This can be done with LinkedBlockingQueue by setting the queue size to add this extra threshold when creating the Queue at the start of the Spark Application. Since this is also true that "There is no fixed size of the Queue which can be used in all the Spark Jobs and even for the Same Spark Job that ran on different input set on daily basis". So this extra threshold can helps in best case scenario where the initial Queue size is set. This will prevent abrupt behaviour of Spark application due to event drop for the critical Queues(appStatusQueue, executormangmentQueue etc). If something fails / abrupt behaviour after this extra threshold, then consider some manual effort in tuning the queue size which we will do incase of setting "spark.scheduler.listenerbus.eventqueue.$name.capacity". If we consider only this "spark.scheduler.listenerbus.eventqueue.$name.capacity" and every time we need to change the size of the queue when the application performs abruptly which requires more manual efforts even to same Job compare to the cases which is already handled by using the threshold. So this is one of the Real World scenario of Multi-Tenant environment, where multiple Spark Application running on same Spark Cluster with fixed number of max nodes that can come and there is fixed number of the cores and memory available. Some of the abrupt behaviour that we have seen in this case 1) Scenario when dynamic Allocation is enabled and event drop has happened in executor management queue and can result in not downscaling of the executor for one of the Spark Application, Even though Executor is idle for long time. So this is holding a resource and some other application wants some more resource and did not get since there is limited number of resources available in the Cluster. 2) Also seen the scenario where there is dependency Zeppelin notebook on AppStatus Queue and event drop prevented the spark application to go down after the idle timeout, If non of the spark Jobs running for some time. Here also we found the resource is holded one resource is driver and other min number of executors that we set, which can be used by some other Spark Application Also there are many scenarios that can impact the spark Application due to Event Drop. This is just to reduce manual intervention on changing the conf to some extent. >Another thing we can do is if there are certain events critical we can look at putting them into its own queue as well. Perhaps another question is, is your driver just running out of CPU to process these fast enough? - Completely Agree on this. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org