Github user mccheah commented on the issue: https://github.com/apache/spark/pull/19468 @foxish @mridulm Heads up - since the last review iteration, I wrote an extra test in `KubernetesClusterSchedulerBackend` that exposed a bug where if executors never register with the driver but end up in the error state, the driver doesn't attempt to replace them in subsequent batches. Test case is [here](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-e56a211862434414dd307a6366d793f0R362). The fix is to ensure that all executors that hit an error state in our watch are accounted as "disconnected" executors, regardless if they were ever marked as disconnected by the driver endpoint otherwise - see [here](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-cb0798d511ec5504fc282407c993d5d4R334). Additionally, I was able to remove one of the data structures which mapped [executor pod names to executor IDs](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-cb0798d511ec5504fc282407c993d5d4L58). Instead, whenever we get a Pod object, we can look up its ID via the executor's ID label. This makes it such that we don't have to worry about as much state.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org