Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]

2024-04-04 Thread via GitHub


HyukjinKwon closed pull request #45635: [SPARK-47565][PYTHON] PySpark worker 
pool crash resilience
URL: https://github.com/apache/spark/pull/45635


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]

2024-04-04 Thread via GitHub


HyukjinKwon commented on PR #45635:
URL: https://github.com/apache/spark/pull/45635#issuecomment-2036882209

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]

2024-03-27 Thread via GitHub


sebastianhillig-db commented on code in PR #45635:
URL: https://github.com/apache/spark/pull/45635#discussion_r1542043094


##
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala:
##
@@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory(
   def create(): (PythonWorker, Option[Long]) = {
 if (useDaemon) {
   self.synchronized {
-if (idleWorkers.nonEmpty) {
+// Pull from idle workers until we one that is alive, otherwise create 
a new one.
+while (idleWorkers.nonEmpty) {
   val worker = idleWorkers.dequeue()
-  worker.selectionKey.interestOps(SelectionKey.OP_READ | 
SelectionKey.OP_WRITE)
-  return (worker, daemonWorkers.get(worker))
+  val workerHandle = daemonWorkers(worker)

Review Comment:
   Since the idleWorkers queue keeps the reference alive, there should never be 
a case where the daemonWorker disappears until it is retrieved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]

2024-03-27 Thread via GitHub


sebastianhillig-db commented on code in PR #45635:
URL: https://github.com/apache/spark/pull/45635#discussion_r1542043889


##
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala:
##
@@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory(
   def create(): (PythonWorker, Option[Long]) = {
 if (useDaemon) {
   self.synchronized {
-if (idleWorkers.nonEmpty) {
+// Pull from idle workers until we one that is alive, otherwise create 
a new one.
+while (idleWorkers.nonEmpty) {

Review Comment:
   Promise not to force push again: 
https://github.com/apache/spark/pull/45635/files#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]

2024-03-27 Thread via GitHub


sebastianhillig-db commented on code in PR #45635:
URL: https://github.com/apache/spark/pull/45635#discussion_r1542041741


##
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala:
##
@@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory(
   def create(): (PythonWorker, Option[Long]) = {
 if (useDaemon) {
   self.synchronized {
-if (idleWorkers.nonEmpty) {
+// Pull from idle workers until we one that is alive, otherwise create 
a new one.
+while (idleWorkers.nonEmpty) {

Review Comment:
   Ugh, sorry - the force push broke that link. I'm referring to 
"releaseWorker" using the same synchronization, so we should not be adding new 
workers while this code runs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]

2024-03-27 Thread via GitHub


ueshin commented on code in PR #45635:
URL: https://github.com/apache/spark/pull/45635#discussion_r1541986448


##
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala:
##
@@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory(
   def create(): (PythonWorker, Option[Long]) = {
 if (useDaemon) {
   self.synchronized {
-if (idleWorkers.nonEmpty) {
+// Pull from idle workers until we one that is alive, otherwise create 
a new one.
+while (idleWorkers.nonEmpty) {
   val worker = idleWorkers.dequeue()
-  worker.selectionKey.interestOps(SelectionKey.OP_READ | 
SelectionKey.OP_WRITE)
-  return (worker, daemonWorkers.get(worker))
+  val workerHandle = daemonWorkers(worker)

Review Comment:
   Do we have a chance that this throws an exception?
   If so, should we use `daemonWorkers.get(worker)` same as before, just in 
case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]

2024-03-27 Thread via GitHub


dongjoon-hyun commented on code in PR #45635:
URL: https://github.com/apache/spark/pull/45635#discussion_r1541298473


##
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala:
##
@@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory(
   def create(): (PythonWorker, Option[Long]) = {
 if (useDaemon) {
   self.synchronized {
-if (idleWorkers.nonEmpty) {
+// Pull from idle workers until we one that is alive, otherwise create 
a new one.
+while (idleWorkers.nonEmpty) {

Review Comment:
   > On each iteration, a worker is pulled from `idleWorkers`, this will end up 
"emptying" the pool. The synchronization around this will ensure that no other 
workers are added while this happens. (see 
https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413)
   
   The link seems to be broken.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47565][PYTHON] PySpark worker pool crash resilience [spark]

2024-03-27 Thread via GitHub


sebastianhillig-db commented on code in PR #45635:
URL: https://github.com/apache/spark/pull/45635#discussion_r1540623063


##
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala:
##
@@ -95,10 +95,20 @@ private[spark] class PythonWorkerFactory(
   def create(): (PythonWorker, Option[Long]) = {
 if (useDaemon) {
   self.synchronized {
-if (idleWorkers.nonEmpty) {
+// Pull from idle workers until we one that is alive, otherwise create 
a new one.
+while (idleWorkers.nonEmpty) {

Review Comment:
   On each iteration, a worker is pulled from `idleWorkers`, this will end up 
"emptying" the pool. The synchronization around this will ensure that no other 
workers are added while this happens. (see  
https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org