[GitHub] spark pull request #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tes...

2018-10-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22480#discussion_r222486276
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -1962,6 +1962,9 @@ def __getstate__(self):
 def __setstate__(self, state):
 self.open_events_dir, self.process_events_dir, 
self.close_events_dir = state
 
+# Those foreach tests are failed in Python 3.6 and macOS High Sierra 
by defined rules
+# at 
http://sealiesoftware.com/blog/archive/2017/6/5/Objective-C_and_fork_in_macOS_1013.html
+# To work around this, OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES.
--- End diff --

cc @jose-torres @tdas 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tes...

2018-09-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22480


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tes...

2018-09-19 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22480

[SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 
and macOS High Serria

## What changes were proposed in this pull request?

This PR does not fix the problem itself but just target to add few comments 
to run PySpark tests on Python 3.6 and macOS High Serria since it actually 
blocks to run tests on Mac.

it does not target to fix the problem yet. I am pretty sure there are some 
guys already debugging this.

The problem here looks because we fork python workers and the workers 
somehow are able to call Objective-C libraries in some codes at CPython's 
implementation. I suspect `pickle` in Python 3.6 has some changes:


https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L577

After debugging, looks the problem is there in forked worker.

This link 
(http://sealiesoftware.com/blog/archive/2017/6/5/Objective-C_and_fork_in_macOS_1013.html)
 and this link 
(https://blog.phusion.nl/2017/10/13/why-ruby-app-servers-break-on-macos-high-sierra-and-what-can-be-done-about-it/)
 were helpful for me to understand this.

I am still debugging this but my guts say it's difficult to fix or 
workaround within Spark side.

## How was this patch tested?

Manually tested:

Before:

```

/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py:766:
 ResourceWarning: subprocess 27563 is still running
  ResourceWarning, source=self)
[Stage 0:>  (0 + 1) 
/ 1]objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
progress in another thread when fork() was called.
objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
progress in another thread when fork() was called. We cannot safely call it or 
ignore it in the fork() child process. Crashing instead. Set a breakpoint on 
objc_initializeAfterForkError to debug.
ERROR

==
ERROR: test_streaming_foreach_with_simple_function 
(pyspark.sql.tests.SQLTests)
--
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling 
o54.processAllAvailable.
: org.apache.spark.sql.streaming.StreamingQueryException: Writing job 
aborted.
=== Streaming Query ===
Identifier: [id = f508d634-407c-4232-806b-70e54b055c42, runId = 
08d1435b-5358-4fb6-b167-811584a3163e]
Current Committed Offsets: {}
Current Available Offsets: 
{FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]:
 {"logOffset":0}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:

FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Writing job aborted.
at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:91)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
```

After:

```
test_streaming_foreach_with_simple_function (pyspark.sql.tests.SQLTests) ...
ok
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-25473

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22480.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22480


commit 97e95afeba368dd06f747665c41f96a50141305a
Author: hyukjinkwon 
Date:   2018-09-20T03:03:42Z

Add a note for streaming forech tests




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional