GitHub user HyukjinKwon reopened a pull request: https://github.com/apache/spark/pull/18320
[SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon to prevent a leak ## What changes were proposed in this pull request? `mcfork` in R looks opening a pipe ahead but the existing logic does not properly close it when it is executed hot. This leads to the failure of more forking due to the limit for number of files open. This hot execution looks particularly for `gapply`/`gapplyCollect`. For unknown reason, this happens more easily in CentOS and could be reproduced in Mac too. All the details are described in https://issues.apache.org/jira/browse/SPARK-21093 This PR proposes simply to terminate R's worker processes in the parent of R's daemon to prevent a leak. ## How was this patch tested? I ran the codes below on both CentOS and Mac with that configuration disabled/enabled. ```r df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d")) collect(gapply(df, "a", function(key, x) { x }, schema(df))) collect(gapply(df, "a", function(key, x) { x }, schema(df))) ... # 30 times ``` Also, now it passes R tests on CentOS as below: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark .............................................................................................................................................................. .............................................................................................................................................................. .............................................................................................................................................................. .............................................................................................................................................................. .............................................................................................................................................................. .................................................................................................................................... ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-21093 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18320.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18320 ---- commit 6e57ed2931afc5aec8c4b4bef72c157abcb68c46 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-06-16T02:37:53Z Terminates forked processed in the parent process commit 4eadafe3f009b1c70956c08c99302c1da34db6d4 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-06-17T09:21:37Z Fix typo (renaming missed) commit 18b3ee9a66df40658074511558f0cd36fc102df7 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-06-17T10:54:57Z Rename x to c in lapply commit 72ab1f2e8cafa1d5249a09279825444e3ca38b39 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-06-19T12:21:08Z Update comments to describe the behaviour change commit 6cba54c243123d25f479363d3dfd7eb92bb25599 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-06-20T01:02:26Z Do not check every second if there is no worker running commit 4954008884ff02a9eae9ea50586e86e8923fc593 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-06-20T09:04:03Z Address comment commit f3f57e46868e66b8f50268910c1eff494638059d Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-06-20T09:30:56Z Fix comments commit 04bb37a6d8d4387365f6d46cb8e2c6fbe912351d Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-06-20T09:32:29Z Fix comments commit d6f0ff275abd3b7641210427eea955c9f0ea8d86 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-06-20T09:39:21Z Add more comments commit 8b48274dc565dc5c6722e983c55494b0067bda72 Author: Hyukjin Kwon <gurwls...@gmail.com> Date: 2017-06-20T10:00:05Z Fix a typo ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org