[ 
https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26927:
---------------------------------
    Description: 
Recently, we catch a bug that caused our production spark thriftserver hangs:

There is a race condition in the ExecutorAllocationManager that the 
`SparkListenerExecutorRemoved` event is posted before the 
`SparkListenerTaskStart` event, which will cause the incorrect result of 
`executorIds`, then when some executor idles, the real executors will be 
removed even executor number is equal to `minNumExecutors` due to the incorrect 
computation of `newExecutorTotal`(may greater than the `minNumExecutors`), thus 
may finally causing zero available executors but a wrong number of executorIds 
was kept in memory.

What's more, even the `SparkListenerTaskEnd` event can not make the fake 
`executorIds` released, because later idle event for the fake executors can not 
cause the real removal of these executors, as they are already removed and they 
are not exist in the `executorDataMap`  of `CoaseGrainedSchedulerBackend`.

Logs:

!Selection_042.jpg!

!Selection_043.jpg!

!Selection_044.jpg!

!Selection_045.jpg!

!Selection_046.jpg!  

EventLogs(DisOrder of events):
{code:java}
{"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor 
ID":"131","Removed Reason":"Container 
container_e28_1547530852233_236191_02_000180 exited from explicit termination 
request."}

{"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt ID":0,"Task 
Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch 
Time":1549936032872,"Executor 
ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", 
"Speculative":false,"Getting Result Time":0,"Finish 
Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count
 Faile d 
Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count
 Failed 
Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val
 ue":39,"Internal":true,"Count Failed 
Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count
 Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS 
ize","Update":3578,"Value":7156,"Internal":true,"Count Failed 
Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count
 Failed Values":true},{"ID":12923962,"Na 
me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count
 Failed 
Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou
 nt Failed 
Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count
 Failed Values":true},{"ID":12921550,"Name":"number of output 
rows","Update":"158","Value" :"289","Internal":true,"Count Failed 
Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output 
rows","Update":"23","Value":"45","Internal":true,"Count Failed 
Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total (min, 
med, max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed 
Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, 
med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed 
Values":true,"Metadata":"sql"}]}}

{code}
 

  was:
Recently, we catch a bug that caused our production spark thriftserver hangs:

There is a race condition in the ExecutorAllocationManager that the 
`SparkListenerExecutorRemoved` event is posted before the 
`SparkListenerTaskStart` event, which will cause the incorrect result of 
`executorIds`, then when some executor idles, the real executors will be 
removed due to the incorrect computation of `newExecutorTotal`, because it may 
greater than the `minNumExecutors`, thus may finally causing zero available 
executors but a wrong number of executorIds was kept in memory.

What's more, even the `SparkListenerTaskEnd` event can not make the fake 
`executorIds` released, because later idle event for the fake executors can not 
cause the real removal of these executors, as they are already removed and they 
are not exist in the `executorDataMap`  of `CoaseGrainedSchedulerBackend`.

Logs:

!Selection_042.jpg!

!Selection_043.jpg!

!Selection_044.jpg!

!Selection_045.jpg!

!Selection_046.jpg!  

EventLogs(DisOrder of events):
{code:java}
{"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor 
ID":"131","Removed Reason":"Container 
container_e28_1547530852233_236191_02_000180 exited from explicit termination 
request."}

{"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt ID":0,"Task 
Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch 
Time":1549936032872,"Executor 
ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", 
"Speculative":false,"Getting Result Time":0,"Finish 
Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count
 Faile d 
Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count
 Failed 
Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val
 ue":39,"Internal":true,"Count Failed 
Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count
 Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS 
ize","Update":3578,"Value":7156,"Internal":true,"Count Failed 
Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count
 Failed Values":true},{"ID":12923962,"Na 
me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count
 Failed 
Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou
 nt Failed 
Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count
 Failed Values":true},{"ID":12921550,"Name":"number of output 
rows","Update":"158","Value" :"289","Internal":true,"Count Failed 
Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output 
rows","Update":"23","Value":"45","Internal":true,"Count Failed 
Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total (min, 
med, max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed 
Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, 
med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed 
Values":true,"Metadata":"sql"}]}}

{code}
 


> Race condition may cause dynamic allocation not working
> -------------------------------------------------------
>
>                 Key: SPARK-26927
>                 URL: https://issues.apache.org/jira/browse/SPARK-26927
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.4.0
>            Reporter: liupengcheng
>            Priority: Major
>         Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, 
> Selection_045.jpg, Selection_046.jpg
>
>
> Recently, we catch a bug that caused our production spark thriftserver hangs:
> There is a race condition in the ExecutorAllocationManager that the 
> `SparkListenerExecutorRemoved` event is posted before the 
> `SparkListenerTaskStart` event, which will cause the incorrect result of 
> `executorIds`, then when some executor idles, the real executors will be 
> removed even executor number is equal to `minNumExecutors` due to the 
> incorrect computation of `newExecutorTotal`(may greater than the 
> `minNumExecutors`), thus may finally causing zero available executors but a 
> wrong number of executorIds was kept in memory.
> What's more, even the `SparkListenerTaskEnd` event can not make the fake 
> `executorIds` released, because later idle event for the fake executors can 
> not cause the real removal of these executors, as they are already removed 
> and they are not exist in the `executorDataMap`  of 
> `CoaseGrainedSchedulerBackend`.
> Logs:
> !Selection_042.jpg!
> !Selection_043.jpg!
> !Selection_044.jpg!
> !Selection_045.jpg!
> !Selection_046.jpg!  
> EventLogs(DisOrder of events):
> {code:java}
> {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor 
> ID":"131","Removed Reason":"Container 
> container_e28_1547530852233_236191_02_000180 exited from explicit termination 
> request."}
> {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt 
> ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch 
> Time":1549936032872,"Executor 
> ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", 
> "Speculative":false,"Getting Result Time":0,"Finish 
> Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count
>  Faile d 
> Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val
>  ue":39,"Internal":true,"Count Failed 
> Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count
>  Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS 
> ize","Update":3578,"Value":7156,"Internal":true,"Count Failed 
> Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count
>  Failed Values":true},{"ID":12923962,"Na 
> me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou
>  nt Failed 
> Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count
>  Failed Values":true},{"ID":12921550,"Name":"number of output 
> rows","Update":"158","Value" :"289","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output 
> rows","Update":"23","Value":"45","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total 
> (min, med, 
> max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, 
> med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"}]}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to