GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/18521
[SPARK-19507][SPARK-21296][PYTHON] Avoid per-record type dispatch in schema
verification and improve exception message
## What changes were proposed in this pull request?
**Context**
While reviewing https://github.com/apache/spark/pull/17227, I realised here
we type-dispatch per record. The PR itself is fine in terms of performance as
is but this prints a prefix, `"obj"` in exception message as below:
```
from pyspark.sql.types import *
schema = StructType([StructField('s', IntegerType(), nullable=False)])
spark.createDataFrame([["1"]], schema)
...
TypeError: obj.s: IntegerType can not accept object '1' in type <type 'str'>
```
I suggested to get rid of this but during investigating this, I realised my
approach might bring a performance regression as it is a hot path.
Only for SPARK-19507 and https://github.com/apache/spark/pull/17227, It
needs more changes to cleanly get rid of the prefix and I rather decided to fix
both issues together.
**Propersal**
This PR tried to
- get rid of per-record type dispatch as we do in many code paths in
Scala - SPARK-21296
- Improve error message so that it improves the performance (roughly ~25%
improvement) - SPARK-19507
## How was this patch tested?
Manually tested and unit tests were added in `python/pyspark/sql/tests.py`.
Benchmark - codes:
https://gist.github.com/HyukjinKwon/c3397469c56cb26c2d7dd521ed0bc5a3
Error message - codes:
https://gist.github.com/HyukjinKwon/b1b2c7f65865444c4a8836435100e398
**Before**
Benchmark:
- Results:
https://gist.github.com/HyukjinKwon/4a291dab45542106301a0c1abcdca924
Error message
- Results:
https://gist.github.com/HyukjinKwon/57b1916395794ce924faa32b14a3fe19
**After**
Benchmark
- Results:
https://gist.github.com/HyukjinKwon/21496feecc4a920e50c4e455f836266e
Error message
- Results:
https://gist.github.com/HyukjinKwon/7a494e4557fe32a652ce1236e504a395
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark python-type-dispatch
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18521.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18521
----
commit a400d5534641813c3ebe75ca553361337766f04b
Author: David Gingrich <[email protected]>
Date: 2017-02-28T08:05:00Z
Show field name in _verify_type error
commit 22311f1b4a1a51813904e0be1673b9e8466cac13
Author: hyukjinkwon <[email protected]>
Date: 2017-07-03T05:41:01Z
Fix default obj parent name issue
commit d7f677830cb423a5da5e428bc3211f348795a2b2
Author: hyukjinkwon <[email protected]>
Date: 2017-07-03T16:03:50Z
Fix type dispatch
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]