[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-10 Thread vanzin
GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/20223

[SPARK-23020][core] Fix races in launcher code, test.

The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-23020

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20223.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20223


commit 5139f605904996667d8d97941172bbe9d366a579
Author: Marcelo Vanzin 
Date:   2018-01-10T17:56:17Z

[SPARK-23020][core] Fix races in launcher code, test.

The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160952457
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java ---
@@ -91,10 +92,15 @@ LauncherConnection getConnection() {
 return connection;
   }
 
-  boolean isDisposed() {
+  synchronized boolean isDisposed() {
 return disposed;
--- End diff --

can we simply mark `disposed` as `transient`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160956968
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java ---
@@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws 
IOException {
   }
 
   @Override
-  public void close() throws IOException {
+  public synchronized void close() throws IOException {
 if (!closed) {
-  synchronized (this) {
--- End diff --

what's wrong with this? It's a classic "double-checked locking" in java.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160957301
  
--- Diff: 
core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java ---
@@ -137,7 +139,9 @@ public void testInProcessLauncher() throws Exception {
   // Here DAGScheduler is stopped, while 
SparkContext.clearActiveContext may not be called yet.
   // Wait for a reasonable amount of time to avoid creating two active 
SparkContext in JVM.
--- End diff --

why is `500 ms` not a reasonable waiting time anymore?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160989352
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java ---
@@ -91,10 +92,15 @@ LauncherConnection getConnection() {
 return connection;
   }
 
-  boolean isDisposed() {
+  synchronized boolean isDisposed() {
--- End diff --

why do we need `synchronized` here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160991236
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java ---
@@ -71,15 +71,16 @@ public void stop() {
   @Override
   public synchronized void disconnect() {
 if (!disposed) {
--- End diff --

`if(!isDisposed())` ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r160989105
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java ---
@@ -91,10 +92,15 @@ LauncherConnection getConnection() {
 return connection;
   }
 
-  boolean isDisposed() {
+  synchronized boolean isDisposed() {
 return disposed;
   }
 
+  synchronized void markDisposed() {
--- End diff --

why do we need `synchronized` here? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r161006918
  
--- Diff: 
core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java ---
@@ -137,7 +139,9 @@ public void testInProcessLauncher() throws Exception {
   // Here DAGScheduler is stopped, while 
SparkContext.clearActiveContext may not be called yet.
   // Wait for a reasonable amount of time to avoid creating two active 
SparkContext in JVM.
--- End diff --

Here Vanzin is waiting for the active SparkContext being empty.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r161022567
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/AbstractAppHandle.java ---
@@ -91,10 +92,15 @@ LauncherConnection getConnection() {
 return connection;
   }
 
-  boolean isDisposed() {
+  synchronized boolean isDisposed() {
 return disposed;
--- End diff --

The `synchronized` is not protecting the variable, but all the actions that 
happen around the code that sets the variable.

So this method returning guarantees that either all the disposal tasks have 
run or none have.

That being said `ServerConnection.close` is not really holding the handle 
lock as it should, so I might have to make some changes here.

These are not really contended locks. In the worst case there will be 2 
threads looking the them. So trying to come up with a finer-grained locking 
scheme here sounds like more trouble than it's worth.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r161022671
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java ---
@@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws 
IOException {
   }
 
   @Override
-  public void close() throws IOException {
+  public synchronized void close() throws IOException {
 if (!closed) {
-  synchronized (this) {
--- End diff --

The sub-class needs coarser-grained locking so this wasn't really doing 
anything useful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-11 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r161022921
  
--- Diff: 
core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java ---
@@ -137,7 +139,9 @@ public void testInProcessLauncher() throws Exception {
   // Here DAGScheduler is stopped, while 
SparkContext.clearActiveContext may not be called yet.
   // Wait for a reasonable amount of time to avoid creating two active 
SparkContext in JVM.
--- End diff --

It's not that 500ms is not reasonable, it's that this code will return more 
quickly when the context shuts down in under 500ms, have a longer grace period 
in case it ever takes more than 500ms, and fail the test here if the context 
does not shut down.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-15 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r161564385
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -48,14 +48,16 @@ public synchronized void disconnect() {
 
   @Override
   public synchronized void kill() {
-disconnect();
-if (childProc != null) {
-  if (childProc.isAlive()) {
-childProc.destroyForcibly();
+if (!isDisposed()) {
+  setState(State.KILLED);
--- End diff --

Why put `setState` in the front? It should be OK but in previous code some 
of the `setState` is called in the end.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r161651289
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -48,14 +48,16 @@ public synchronized void disconnect() {
 
   @Override
   public synchronized void kill() {
-disconnect();
-if (childProc != null) {
-  if (childProc.isAlive()) {
-childProc.destroyForcibly();
+if (!isDisposed()) {
+  setState(State.KILLED);
--- End diff --

I have the same question, but should be OK as we always synchronize when 
touching the state.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20223


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-16 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20223#discussion_r161834649
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -48,14 +48,16 @@ public synchronized void disconnect() {
 
   @Override
   public synchronized void kill() {
-disconnect();
-if (childProc != null) {
-  if (childProc.isAlive()) {
-childProc.destroyForcibly();
+if (!isDisposed()) {
+  setState(State.KILLED);
--- End diff --

I changed this because the old disconnect code, at least, might change the 
handle's state. It's easier to put this call first and not have to reason about 
whether that will happen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org