Rather than using a separate thread pool, perhaps we can just move the prep code to the call site thread?
On Sun, Mar 4, 2018 at 11:15 PM, Ajith shetty <ajith.she...@huawei.com> wrote: > DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted > events has to be processed as DAGSchedulerEventProcessLoop is single > threaded and it will block other tasks in queue like TaskCompletion. > > The JobSubmitted event is time consuming depending on the nature of the > job (Example: calculating parent stage dependencies, shuffle dependencies, > partitions) and thus it blocks all the events to be processed. > > > > I see multiple JIRA referring to this behavior > > https://issues.apache.org/jira/browse/SPARK-2647 > > https://issues.apache.org/jira/browse/SPARK-4961 > > > > Similarly in my cluster some jobs partition calculation is time consuming > (Similar to stack at SPARK-2647) hence it slows down the spark > DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even > if its tasks are finished within seconds, as TaskCompletion Events are > processed at a slower rate due to blockage. > > > > I think we can split a JobSubmitted Event into 2 events > > Step 1. JobSubmittedPreperation - Runs in separate thread on > JobSubmission, this will involve steps org.apache.spark.scheduler. > DAGScheduler#createResultStage > > Step 2. JobSubmittedExecution - If Step 1 is success, fire an event to > DAGSchedulerEventProcessLoop and let it process output of > org.apache.spark.scheduler.DAGScheduler#createResultStage > > > > I can see the effect of doing this may be that Job Submissions may not be > FIFO depending on how much time Step 1 mentioned above is going to consume. > > > > Does above solution suffice for the problem described? And is there any > other side effect of this solution? > > > > Regards > > Ajith >