Github user mccheah commented on the issue:

    https://github.com/apache/spark/pull/19468
  
    @foxish @mridulm Heads up - since the last review iteration, I wrote an 
extra test in `KubernetesClusterSchedulerBackend` that exposed a bug where if 
executors never register with the driver but end up in the error state, the 
driver doesn't attempt to replace them in subsequent batches. Test case is 
[here](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-e56a211862434414dd307a6366d793f0R362).
 The fix is to ensure that all executors that hit an error state in our watch 
are accounted as "disconnected" executors, regardless if they were ever marked 
as disconnected by the driver endpoint otherwise - see 
[here](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-cb0798d511ec5504fc282407c993d5d4R334).
    
    Additionally, I was able to remove one of the data structures which mapped 
[executor pod names to executor 
IDs](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-cb0798d511ec5504fc282407c993d5d4L58).
 Instead, whenever we get a Pod object, we can look up its ID via the 
executor's ID label. This makes it such that we don't have to worry about as 
much state.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to