Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18320#discussion_r123440068 --- Diff: R/pkg/inst/worker/daemon.R --- @@ -30,8 +30,40 @@ port <- as.integer(Sys.getenv("SPARKR_WORKER_PORT")) inputCon <- socketConnection( port = port, open = "rb", blocking = TRUE, timeout = connectionTimeout) +# Waits indefinitely for a socket connecion by default. +selectTimeout <- NULL + while (TRUE) { - ready <- socketSelect(list(inputCon)) + ready <- socketSelect(list(inputCon), timeout = selectTimeout) + + # Note that the children should be terminated in the parent. If each child terminates + # itself, it appears that the resource is not released properly, that causes an unexpected + # termination of this daemon due to, for example, running out of file descriptors + # (see SPARK-21093). Therefore, the current implementation tries to retrieve children + # that are exited (but not terminated) and then sends a kill signal to terminate them properly + # in the parent. + # + # There are two paths that it attempts to send a signal to terminate the children in the parent. + # + # 1. Every second if any socket connection is not available and if there are child workers + # running. + # 2. Right after a socket connection is available. + # + # In other words, the parent attempts to send the signal to the children every second if + # any worker is running or right before launching other worker children from the following + # new socket connection. + + # Only the process IDs of exited children are returned and the termination is attempted below. + children <- parallel:::selectChildren(timeout = 0) + if (is.integer(children)) { + # If it is PIDs, there are workers exited but not terminated. Attempts to terminate them --- End diff -- I think [this line](https://github.com/s-u/multicore/blob/e9d9bf21e6cf08e24cfe54d762379b4fa923765b/src/fork.c#L440) checks if any data from children is available via file descriptors (pipes I believe). The child do not send any data back to the parent (via `parallel:::sendMaster` API and etc.). So, `FD_ISSET` will only be true on the close of the pipe (as this is a event to the pipe), which I guess happens when `exit()` is called properly.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org