[
https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352870#comment-16352870
]
Petar Petrov commented on SPARK-23182:
--------------------------------------
We run a cluster of ~1000 cores in GCE using preemptible VMs for executors /
workers and a standard (non-preemptible) master VM. That cluster processes tons
of jobs 24/7.
It processes about 20000 jobs / day and does not stop. With time many workers
join and get dissociated from the cluster. GCE evicts VMs without a graceful
shutdown.
GCE does support setting a shutdown script on preemptible VMs, but it's not
always invoked (from https://cloud.google.com/compute/docs/shutdownscript):
{noformat}
Compute Engine only executes shutdown scripts on a best-effort basis and does
not guarantee that the shutdown script will be run in all cases.{noformat}
When a worker joins the cluster and is stopped without the executor gracefully
stopped, the master keeps the connection open (although inactive) infinitely
long. After some time the master errors with "Too many open files" and can not
accept connections anymore. Thus the need to enable TCP keep alive. It
guarantees that when the worker is stopped, the master's OS will check the
other side and close the connection if it's not responding.
> Allow enabling of TCP keep alive for master RPC connections
> -----------------------------------------------------------
>
> Key: SPARK-23182
> URL: https://issues.apache.org/jira/browse/SPARK-23182
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.4.0
> Reporter: Petar Petrov
> Priority: Major
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines
> disappear without closing the TCP connections to the master which increases
> the number of established connections and new workers can not connect because
> of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections
> to the master but it's not possible to do so via configuration.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]