[ https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcelo Vanzin resolved SPARK-26927. ------------------------------------ Resolution: Fixed Fix Version/s: 2.3.4 2.4.2 3.0.0 Issue resolved by pull request 23842 [https://github.com/apache/spark/pull/23842] > Race condition may cause dynamic allocation not working > ------------------------------------------------------- > > Key: SPARK-26927 > URL: https://issues.apache.org/jira/browse/SPARK-26927 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.0, 2.4.0 > Reporter: liupengcheng > Assignee: liupengcheng > Priority: Major > Fix For: 3.0.0, 2.4.2, 2.3.4 > > Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, > Selection_045.jpg, Selection_046.jpg > > > Recently, we catch a bug that caused our production spark thriftserver hangs: > There is a race condition in the ExecutorAllocationManager that the > `SparkListenerExecutorRemoved` event is posted before the > `SparkListenerTaskStart` event, which will cause the incorrect result of > `executorIds`, then when some executor idles, the real executors will be > removed even executor number is equal to `minNumExecutors` due to the > incorrect computation of `newExecutorTotal`(may greater than the > `minNumExecutors`), thus may finally causing zero available executors but a > wrong number of executorIds was kept in memory. > What's more, even the `SparkListenerTaskEnd` event can not make the fake > `executorIds` released, because later idle event for the fake executors can > not cause the real removal of these executors, as they are already removed > and they are not exist in the `executorDataMap` of > `CoaseGrainedSchedulerBackend`. > Logs: > !Selection_042.jpg! > !Selection_043.jpg! > !Selection_044.jpg! > !Selection_045.jpg! > !Selection_046.jpg! > EventLogs(DisOrder of events): > {code:java} > {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor > ID":"131","Removed Reason":"Container > container_e28_1547530852233_236191_02_000180 exited from explicit termination > request."} > {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt > ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch > Time":1549936032872,"Executor > ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", > "Speculative":false,"Getting Result Time":0,"Finish > Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count > Faile d > Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count > Failed > Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val > ue":39,"Internal":true,"Count Failed > Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count > Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS > ize","Update":3578,"Value":7156,"Internal":true,"Count Failed > Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count > Failed Values":true},{"ID":12923962,"Na > me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count > Failed > Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou > nt Failed > Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count > Failed Values":true},{"ID":12921550,"Name":"number of output > rows","Update":"158","Value" :"289","Internal":true,"Count Failed > Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output > rows","Update":"23","Value":"45","Internal":true,"Count Failed > Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total > (min, med, > max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed > Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, > med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed > Values":true,"Metadata":"sql"}]}} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org