[GitHub] spark pull request #22242: Branch 2.3

2018-09-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22242


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22242: Branch 2.3

2018-08-27 Thread ArunkumarRamanan
GitHub user ArunkumarRamanan opened a pull request:

https://github.com/apache/spark/pull/22242

Branch 2.3

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22242.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22242


commit a4eb1e47ad2453b41ebb431272c92e1ac48bb310
Author: hyukjinkwon 
Date:   2018-02-28T15:44:13Z

[SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the 
trace from Java side by Py4JJavaError

## What changes were proposed in this pull request?

This PR proposes for `pyspark.util._exception_message` to produce the trace 
from Java side by `Py4JJavaError`.

Currently, in Python 2, it uses `message` attribute which `Py4JJavaError` 
didn't happen to have:

```python
>>> from pyspark.util import _exception_message
>>> try:
... sc._jvm.java.lang.String(None)
... except Exception as e:
... pass
...
>>> e.message
''
```

Seems we should use `str` instead for now:

 
https://github.com/bartdag/py4j/blob/aa6c53b59027925a426eb09b58c453de02c21b7c/py4j-python/src/py4j/protocol.py#L412

but this doesn't address the problem with non-ascii string from Java side -
 `https://github.com/bartdag/py4j/issues/306`

So, we could directly call `__str__()`:

```python
>>> e.__str__()
u'An error occurred while calling None.java.lang.String.\n: 
java.lang.NullPointerException\n\tat 
java.lang.String.(String.java:588)\n\tat 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n\tat 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat
 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat
 java.lang.reflect.Constructor.newInstance(Constructor.java:422)\n\tat 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)\n\tat 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat 
py4j.Gateway.invoke(Gateway.java:238)\n\tat 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat
 py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat 
py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat 
java.lang.Thread.run(Thread.java:745)\n'
```

which doesn't type coerce unicodes to `str` in Python 2.

This can be actually a problem:

```python
from pyspark.sql.functions import udf
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.range(1).select(udf(lambda x: [[]])()).toPandas()
```

**Before**

```
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError:
Note: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to 
disable this.
```

**After**

```
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError: An error occurred while calling o47.collectAsArrowToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 
(TID 7, localhost, executor driver): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/.../spark/python/pyspark/worker.py", line 245, in main
process()
  File "/.../spark/python/pyspark/worker.py", line 240, in process
...
Note: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to 
disable this.
```

## How was this patch tested?

Manually tested and unit tests were added.

Author: hyukjinkwon 

Closes #20680 from HyukjinKwon/SPARK-23517.

(cherry picked from commit fab563b9bd1581112462c0fc0b299ad6510b6564)
Signed-off-by: hyukjinkwon 

commit