EnricoMi opened a new pull request, #51203: URL: https://github.com/apache/spark/pull/51203
### What changes were proposed in this pull request? This allows executors to register its block manager with the driver via a Kubernetes service name rather than the pod IP. This allows driver and executor to connect to the executor block manager via the service. ### Why are the changes needed? In Kubernetes, connecting to an evicted (decommissioned) executor times out after 2 minutes (default). Executors connect to other executors synchronously (one at a time), so this time out accumulates for each executor peer. An executor that reads from many decommissioned executors blocks for a multiple of the timeout until it fails with a fetch failure. This can be fixed by binding the block manager to a fixed port, defining a Kubernetes service for that block manager port and have the executor register that K8S service port with the driver. The driver and other executors then connect to the service name and instantly fail with a connection refused if the executor got decommissioned and the service removed. Setting `spark.kubernetes.executor.enableService=true` and defining `spark.blockManager.port` will perform this setup for each executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
