GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/22480
[SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6
and macOS High Serria
## What changes were proposed in this pull request?
This PR does not fix the problem itself but just target to add few comments
to run PySpark tests on Python 3.6 and macOS High Serria since it actually
blocks to run tests on Mac.
it does not target to fix the problem yet. I am pretty sure there are some
guys already debugging this.
The problem here looks because we fork python workers and the workers
somehow are able to call Objective-C libraries in some codes at CPython's
implementation. I suspect `pickle` in Python 3.6 has some changes:
https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L577
After debugging, looks the problem is there in forked worker.
This link
(http://sealiesoftware.com/blog/archive/2017/6/5/Objective-C_and_fork_in_macOS_1013.html)
and this link
(https://blog.phusion.nl/2017/10/13/why-ruby-app-servers-break-on-macos-high-sierra-and-what-can-be-done-about-it/)
were helpful for me to understand this.
I am still debugging this but my guts say it's difficult to fix or
workaround within Spark side.
## How was this patch tested?
Manually tested:
Before:
```
/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py:766:
ResourceWarning: subprocess 27563 is still running
ResourceWarning, source=self)
[Stage 0:> (0 + 1)
/ 1]objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in
progress in another thread when fork() was called.
objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in
progress in another thread when fork() was called. We cannot safely call it or
ignore it in the fork() child process. Crashing instead. Set a breakpoint on
objc_initializeAfterForkError to debug.
ERROR
==
ERROR: test_streaming_foreach_with_simple_function
(pyspark.sql.tests.SQLTests)
--
Traceback (most recent call last):
File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line
328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
o54.processAllAvailable.
: org.apache.spark.sql.streaming.StreamingQueryException: Writing job
aborted.
=== Streaming Query ===
Identifier: [id = f508d634-407c-4232-806b-70e54b055c42, runId =
08d1435b-5358-4fb6-b167-811584a3163e]
Current Committed Offsets: {}
Current Available Offsets:
{FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]:
{"logOffset":0}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]
at
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Writing job aborted.
at
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:91)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
```
After:
```
test_streaming_foreach_with_simple_function (pyspark.sql.tests.SQLTests) ...
ok
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-25473
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22480.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22480
commit 97e95afeba368dd06f747665c41f96a50141305a
Author: hyukjinkwon
Date: 2018-09-20T03:03:42Z
Add a note for streaming forech tests
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional