[ 
https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777836#comment-16777836
 ] 

liupengcheng commented on SPARK-26927:
--------------------------------------

[~Ngone51]

Let's say we got the following dynamic allocation settings: 

min: 20 initial: 20 max 50
 # we finished 50 tasks on 50 executors, and no more task to execute, thus 50 
executors will idle, then allocationManager will try to remove the 50 idle 
executors, if everything goes well, with the min number of executors 
guards(20), allocationManager will keep 20 executors not killed.
 # However, imagine such a case: when the `SparkListenerExecutorRemoved` comes 
before the `SparkListenerTaskStart`. – It's possible because the 
`SparkListenerTaskStart` event is posted by `DAGSchedulerEventLoop` thread, but 
the `SparkListenerExecutorRemoved` event is posted by `Netty` threads. 

          In this case, we might get a wrong number of `executorIds` due to the 
following logic: 
[https://github.com/apache/spark/blob/bc03c8b3faacd23edf40b8e75ffd9abb5881c50c/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L718]

         Explain:  allocationManager.executorIds does not contains executorId 
because it's removed in the onExecutorRemoved callback. so the removed 
executorId will be readded to the allocationManager.executorIds.

       3. Then, allocationManager now may think it has already got 21 or more 
executors, then we submit 20 tasks on 20 executors, then finish and idle. At 
this time, the allocationManager will not keep 20 min number of executors not 
removed, it remove 1 or more executors.

       4. so forth and back. .....

       5. Finally, there might be no alive executors, but allocationManager 
still think it has kept more than min number of executors. An extrame case is 
the wrong number is greater than the max number of executors, so 
allocationManager will never schedule more executors and the application will 
hangs forever.

> Race condition may cause dynamic allocation not working
> -------------------------------------------------------
>
>                 Key: SPARK-26927
>                 URL: https://issues.apache.org/jira/browse/SPARK-26927
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.4.0
>            Reporter: liupengcheng
>            Priority: Major
>         Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, 
> Selection_045.jpg, Selection_046.jpg
>
>
> Recently, we catch a bug that caused our production spark thriftserver hangs:
> There is a race condition in the ExecutorAllocationManager that the 
> `SparkListenerExecutorRemoved` event is posted before the 
> `SparkListenerTaskStart` event, which will cause the incorrect result of 
> `executorIds`, then when some executor idles, the real executors will be 
> removed even executor number is equal to `minNumExecutors` due to the 
> incorrect computation of `newExecutorTotal`(may greater than the 
> `minNumExecutors`), thus may finally causing zero available executors but a 
> wrong number of executorIds was kept in memory.
> What's more, even the `SparkListenerTaskEnd` event can not make the fake 
> `executorIds` released, because later idle event for the fake executors can 
> not cause the real removal of these executors, as they are already removed 
> and they are not exist in the `executorDataMap`  of 
> `CoaseGrainedSchedulerBackend`.
> Logs:
> !Selection_042.jpg!
> !Selection_043.jpg!
> !Selection_044.jpg!
> !Selection_045.jpg!
> !Selection_046.jpg!  
> EventLogs(DisOrder of events):
> {code:java}
> {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor 
> ID":"131","Removed Reason":"Container 
> container_e28_1547530852233_236191_02_000180 exited from explicit termination 
> request."}
> {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt 
> ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch 
> Time":1549936032872,"Executor 
> ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", 
> "Speculative":false,"Getting Result Time":0,"Finish 
> Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count
>  Faile d 
> Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val
>  ue":39,"Internal":true,"Count Failed 
> Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count
>  Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS 
> ize","Update":3578,"Value":7156,"Internal":true,"Count Failed 
> Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count
>  Failed Values":true},{"ID":12923962,"Na 
> me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou
>  nt Failed 
> Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count
>  Failed Values":true},{"ID":12921550,"Name":"number of output 
> rows","Update":"158","Value" :"289","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output 
> rows","Update":"23","Value":"45","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total 
> (min, med, 
> max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, 
> med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"}]}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to