[GitHub] spark pull request #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tes...

HyukjinKwon Wed, 19 Sep 2018 20:16:42 -0700

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/22480


    [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 
and macOS High Serria

    ## What changes were proposed in this pull request?
    
    This PR does not fix the problem itself but just target to add few comments 
to run PySpark tests on Python 3.6 and macOS High Serria since it actually 
blocks to run tests on Mac.
    
    it does not target to fix the problem yet. I am pretty sure there are some 
guys already debugging this.
    
    The problem here looks because we fork python workers and the workers 
somehow are able to call Objective-C libraries in some codes at CPython's 
implementation. I suspect `pickle` in Python 3.6 has some changes:
    
    
https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L577
    
    After debugging, looks the problem is there in forked worker.
    
    This link 
(http://sealiesoftware.com/blog/archive/2017/6/5/Objective-C_and_fork_in_macOS_1013.html)
 and this link 
(https://blog.phusion.nl/2017/10/13/why-ruby-app-servers-break-on-macos-high-sierra-and-what-can-be-done-about-it/)
 were helpful for me to understand this.
    
    I am still debugging this but my guts say it's difficult to fix or 
workaround within Spark side.
    
    ## How was this patch tested?
    
    Manually tested:
    
    Before:
    
    ```
    
/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py:766:
 ResourceWarning: subprocess 27563 is still running
      ResourceWarning, source=self)
    [Stage 0:>                                                          (0 + 1) 
/ 1]objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
progress in another thread when fork() was called.
    objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
progress in another thread when fork() was called. We cannot safely call it or 
ignore it in the fork() child process. Crashing instead. Set a breakpoint on 
objc_initializeAfterForkError to debug.
    ERROR
    
    ======================================================================
    ERROR: test_streaming_foreach_with_simple_function 
(pyspark.sql.tests.SQLTests)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
        return f(*a, **kw)
      File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
328, in get_return_value
        format(target_id, ".", name), value)
    py4j.protocol.Py4JJavaError: An error occurred while calling 
o54.processAllAvailable.
    : org.apache.spark.sql.streaming.StreamingQueryException: Writing job 
aborted.
    === Streaming Query ===
    Identifier: [id = f508d634-407c-4232-806b-70e54b055c42, runId = 
08d1435b-5358-4fb6-b167-811584a3163e]
    Current Committed Offsets: {}
    Current Available Offsets: 
{FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hr0000gp/T/tmpolebys1s]:
 {"logOffset":0}}
    
    Current State: ACTIVE
    Thread State: RUNNABLE
    
    Logical Plan:
    
FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hr0000gp/T/tmpolebys1s]
        at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
    Caused by: org.apache.spark.SparkException: Writing job aborted.
        at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:91)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    ```
    
    After:
    
    ```
    test_streaming_foreach_with_simple_function (pyspark.sql.tests.SQLTests) ...
    ok
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-25473

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22480.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22480
    
----
commit 97e95afeba368dd06f747665c41f96a50141305a
Author: hyukjinkwon <gurwls223@...>
Date:   2018-09-20T03:03:42Z

    Add a note for streaming forech tests

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tes...

Reply via email to