TakawaAkirayo opened a new pull request, #45367:
URL: https://github.com/apache/spark/pull/45367

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: 
https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: 
https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., 
'[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a 
faster review.
     7. If you want to add a new configuration, please read the guideline first 
for naming configurations in
        
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the 
guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Add config 
spark.scheduler.listenerbus.eventqueue.waitForEventDispatchExitOnStop(default 
true).
   Before this PR: The event queue will wait for the event to drain completely 
on stop.
   After this PR: Allow user to control this behavior(wait for completely drain 
or not) by spark config.
   
   
   ### Why are the changes needed?
   ####Problem statement:
   The SparkContext.stop() hung a long time on LiveEventBus.stop() when 
listeners slow
   
   ####User scenarios:
   We have a centralized service with multiple instances to regularly execute 
user's scheduled tasks.
   For each user task within one service instance, the process is as follows:
   
   1.Create a Spark session directly within the service process with an account 
defined in the task.
   2.Instantiate listeners by class names and register them with the 
SparkContext. The JARs containing the listener classes are uploaded to the 
service by the user.
   3.Prepare resources.
   4.Run user logic (Spark SQL).
   5.Stop the Spark session by invoking SparkSession.stop().
   
   In step 5, it will wait for the LiveEventBus to stop, which requires the 
remaining events to be completely drained by each listener.
   
   Since the listener is implemented by users and we cannot prevent some heavy 
stuffs within the listener on each event, there are cases where a single heavy 
job has over 30,000 tasks,
   and it could take 30 minutes for the listener to process all the remaining 
events, because within the listener, it requires a coarse-grained global lock 
and update the internal status to the remote database.
   
   This kind of delay affects other user tasks in the queue. Therefore, from 
the server side perspective, we need the guarantee that the stop operation 
finishes quickly.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Add cofig 
spark.scheduler.listenerbus.eventqueue.waitForEventDispatchExitOnStop.
   Default is true, it will wait for the event to drain completely. If set to 
false, the LivenEventBus will stop without waitting for the event to drain on 
each queue.
   
   
   ### How was this patch tested?
   By UT and verified the feature in out production environment
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to