Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]
HyukjinKwon closed pull request #45635: [SPARK-47565][PYTHON] PySpark worker pool crash resilience URL: https://github.com/apache/spark/pull/45635 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]
HyukjinKwon commented on PR #45635: URL: https://github.com/apache/spark/pull/45635#issuecomment-2036882209 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]
sebastianhillig-db commented on code in PR #45635: URL: https://github.com/apache/spark/pull/45635#discussion_r1542043094 ## core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala: ## @@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory( def create(): (PythonWorker, Option[Long]) = { if (useDaemon) { self.synchronized { -if (idleWorkers.nonEmpty) { +// Pull from idle workers until we one that is alive, otherwise create a new one. +while (idleWorkers.nonEmpty) { val worker = idleWorkers.dequeue() - worker.selectionKey.interestOps(SelectionKey.OP_READ | SelectionKey.OP_WRITE) - return (worker, daemonWorkers.get(worker)) + val workerHandle = daemonWorkers(worker) Review Comment: Since the idleWorkers queue keeps the reference alive, there should never be a case where the daemonWorker disappears until it is retrieved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]
sebastianhillig-db commented on code in PR #45635: URL: https://github.com/apache/spark/pull/45635#discussion_r1542043889 ## core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala: ## @@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory( def create(): (PythonWorker, Option[Long]) = { if (useDaemon) { self.synchronized { -if (idleWorkers.nonEmpty) { +// Pull from idle workers until we one that is alive, otherwise create a new one. +while (idleWorkers.nonEmpty) { Review Comment: Promise not to force push again: https://github.com/apache/spark/pull/45635/files#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]
sebastianhillig-db commented on code in PR #45635: URL: https://github.com/apache/spark/pull/45635#discussion_r1542041741 ## core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala: ## @@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory( def create(): (PythonWorker, Option[Long]) = { if (useDaemon) { self.synchronized { -if (idleWorkers.nonEmpty) { +// Pull from idle workers until we one that is alive, otherwise create a new one. +while (idleWorkers.nonEmpty) { Review Comment: Ugh, sorry - the force push broke that link. I'm referring to "releaseWorker" using the same synchronization, so we should not be adding new workers while this code runs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]
ueshin commented on code in PR #45635: URL: https://github.com/apache/spark/pull/45635#discussion_r1541986448 ## core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala: ## @@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory( def create(): (PythonWorker, Option[Long]) = { if (useDaemon) { self.synchronized { -if (idleWorkers.nonEmpty) { +// Pull from idle workers until we one that is alive, otherwise create a new one. +while (idleWorkers.nonEmpty) { val worker = idleWorkers.dequeue() - worker.selectionKey.interestOps(SelectionKey.OP_READ | SelectionKey.OP_WRITE) - return (worker, daemonWorkers.get(worker)) + val workerHandle = daemonWorkers(worker) Review Comment: Do we have a chance that this throws an exception? If so, should we use `daemonWorkers.get(worker)` same as before, just in case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]
dongjoon-hyun commented on code in PR #45635: URL: https://github.com/apache/spark/pull/45635#discussion_r1541298473 ## core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala: ## @@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory( def create(): (PythonWorker, Option[Long]) = { if (useDaemon) { self.synchronized { -if (idleWorkers.nonEmpty) { +// Pull from idle workers until we one that is alive, otherwise create a new one. +while (idleWorkers.nonEmpty) { Review Comment: > On each iteration, a worker is pulled from `idleWorkers`, this will end up "emptying" the pool. The synchronization around this will ensure that no other workers are added while this happens. (see https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413) The link seems to be broken. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]
sebastianhillig-db commented on code in PR #45635: URL: https://github.com/apache/spark/pull/45635#discussion_r1540623063 ## core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala: ## @@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory( def create(): (PythonWorker, Option[Long]) = { if (useDaemon) { self.synchronized { -if (idleWorkers.nonEmpty) { +// Pull from idle workers until we one that is alive, otherwise create a new one. +while (idleWorkers.nonEmpty) { Review Comment: On each iteration, a worker is pulled from `idleWorkers`, this will end up "emptying" the pool. The synchronization around this will ensure that no other workers are added while this happens. (see https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org