GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/20462

    [SPARK-23020][core] Fix another race in the in-process launcher test.

    First the bad news: there's an unfixable race in the launcher code.
    (By unfixable I mean it would take a lot more effort than this change
    to fix it.) The good news is that it should only affect super short
    lived applications, such as the one run by the flaky test, so it's
    possible to work around it in our test.
    
    The fix also uncovered an issue with the recently added "closeAndWait()"
    method; closing the connection would still possibly cause data loss,
    so this change waits a while for the connection to finish itself, and
    closes the socket if that times out. The existing connection timeout
    is reused so that if desired it's possible to control how long to wait.
    
    As part of that I also restored the old behavior that disconnect() would
    force a disconnection from the child app; the "wait for data to arrive"
    approach is only taken when disposing of the handle.
    
    I tested this by inserting a bunch of sleeps in the test and the socket
    handling code in the launcher library; with those I was able to reproduce
    the error from the jenkins jobs. With the changes, even with all the
    sleeps still in place, all tests pass.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-23020

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20462.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20462
    
----
commit daa5b70d66b32d582dc7c2cdba79ce748ca5cc66
Author: Marcelo Vanzin <vanzin@...>
Date:   2018-01-30T19:35:54Z

    [SPARK-23020][core] Fix another race in the in-process launcher test.
    
    First the bad news: there's an unfixable race in the launcher code.
    (By unfixable I mean it would take a lot more effort than this change
    to fix it.) The good news is that it should only affect super short
    lived applications, such as the one run by the flaky test, so it's
    possible to work around it in our test.
    
    The fix also uncovered an issue with the recently added "closeAndWait()"
    method; closing the connection would still possibly cause data loss,
    so this change waits a while for the connection to finish itself, and
    closes the socket if that times out. The existing connection timeout
    is reused so that if desired it's possible to control how long to wait.
    
    As part of that I also restored the old behavior that disconnect() would
    force a disconnection from the child app; the "wait for data to arrive"
    approach is only taken when disposing of the handle.
    
    I tested this by inserting a bunch of sleeps in the test and the socket
    handling code in the launcher library; with those I was able to reproduce
    the error from the jenkins jobs. With the changes, even with all the
    sleeps still in place, all tests pass.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to