GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/640

    SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions

    This patch includes several cleanups to PythonRDD, focused around fixing 
[SPARK-1579](https://issues.apache.org/jira/browse/SPARK-1579) cleanly. Listed 
in order of approximate importance:
    
    - The Python daemon waits for Spark to close the socket before exiting,
      in order to avoid causing spurious IOExceptions in Spark's
      `PythonRDD::WriterThread`.
    - Removes the Python Monitor Thread, which polled for task cancellations
      in order to kill the Python worker. Instead, we do this in the
      onCompleteCallback, since this is guaranteed to be called during
      cancellation.
    - Adds a "completed" variable to TaskContext to avoid the issue noted in
      [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), where 
onCompleteCallbacks may be execution-order dependent.
      Along with this, I removed the "context.interrupted = true" flag in
      the onCompleteCallback.
    - Extracts PythonRDD::WriterThread to its own class.
    
    Since this patch provides an alternative solution to 
[SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), I did test it 
with
    
    ```
    sc.textFile("latlon.tsv").take(5)
    ```
    
    many times without error.
    
    Additionally, in order to test the unswallowed exceptions, I performed
    
    ```
    sc.textFile("s3n://<big file>").count()
    ```
    
    and cut my internet during execution. Prior to this patch, we got the 
"stdin writer exited early" message, which was unhelpful. Now, we get the 
SocketExceptions propagated through Spark to the user and get proper (though 
unsuccessful) task retries.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark pyspark-io

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/640.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #640
    
----
commit c0c49da3754d664ab1de76ff8e6fc86a4d5d8714
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-05-05T02:26:03Z

    SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
    
    This patch includes several cleanups to PythonRDD, focused around
    fixing SPARK-1579 cleanly. Listed in order of importance:
    
    - The Python daemon waits for Spark to close the socket before exiting,
      in order to avoid causing spurious IOExceptions in Spark's
      PythonRDD::WriterThread.
    
    - Removes the Python Monitor Thread, which polled for task cancellations
      in order to kill the Python worker. Instead, we do this in the
      onCompleteCallback, since this is guaranteed to be called during
      cancellation.
    
    - Adds a "completed" variable to TaskContext to avoid the issue noted in
      SPARK-1019, where onCompleteCallbacks may be execution-order dependent.
      Along with this, I removed the "context.interrupted = true" flag in
      the onCompleteCallback.
    
    - Extracts PythonRDD::WriterThread to its own class.
    
    Since this patch provides an alternative solution to SPARK-1019, I did
    test it with
    
    ```sc.textFile("latlon.tsv").take(5)```
    
    many times without error.
    
    Additionally, in order to test the unswallowed exceptions, I performed
    
    ```sc.textFile("s3n://<big file>").count()```
    
    and cut my internet. Prior to this patch, we got the "stdin writer
    exited early" message, which is unhelpful. Now, we get the
    SocketExceptions propagated through Spark to the user and get
    proper (but unsuccessful) task retries.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to