GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/18521

    [SPARK-19507][SPARK-21296][PYTHON] Avoid per-record type dispatch in schema 
verification and improve exception message

    ## What changes were proposed in this pull request?
    **Context**
    
    While reviewing https://github.com/apache/spark/pull/17227, I realised here 
we type-dispatch per record. The PR itself is fine in terms of performance as 
is but this prints a prefix, `"obj"` in exception message as below:
    
    ```
    from pyspark.sql.types import *
    schema = StructType([StructField('s', IntegerType(), nullable=False)])
    spark.createDataFrame([["1"]], schema)
    ...
    TypeError: obj.s: IntegerType can not accept object '1' in type <type 'str'>
    ```
    
    I suggested to get rid of this but during investigating this, I realised my 
approach might bring a performance regression as it is a hot path.
    
    Only for SPARK-19507 and https://github.com/apache/spark/pull/17227, It 
needs more changes to cleanly get rid of the prefix and I rather decided to fix 
both issues together.
    
    **Propersal**
    
    This PR tried to
    
      - get rid of per-record type dispatch as we do in many code paths in 
Scala - SPARK-21296 
    
      - Improve error message so that it improves the performance (roughly ~25% 
improvement) - SPARK-19507
    
    ## How was this patch tested?
    
    Manually tested and unit tests were added in `python/pyspark/sql/tests.py`.
    
    Benchmark - codes: 
https://gist.github.com/HyukjinKwon/c3397469c56cb26c2d7dd521ed0bc5a3
    Error message - codes: 
https://gist.github.com/HyukjinKwon/b1b2c7f65865444c4a8836435100e398
    
    **Before**
    
    Benchmark:
      - Results: 
https://gist.github.com/HyukjinKwon/4a291dab45542106301a0c1abcdca924
    
    Error message
      - Results: 
https://gist.github.com/HyukjinKwon/57b1916395794ce924faa32b14a3fe19
     
    **After**
    
    Benchmark  
      - Results: 
https://gist.github.com/HyukjinKwon/21496feecc4a920e50c4e455f836266e
    
    Error message
      - Results: 
https://gist.github.com/HyukjinKwon/7a494e4557fe32a652ce1236e504a395


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark python-type-dispatch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18521.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18521
    
----
commit a400d5534641813c3ebe75ca553361337766f04b
Author: David Gingrich <[email protected]>
Date:   2017-02-28T08:05:00Z

    Show field name in _verify_type error

commit 22311f1b4a1a51813904e0be1673b9e8466cac13
Author: hyukjinkwon <[email protected]>
Date:   2017-07-03T05:41:01Z

    Fix default obj parent name issue

commit d7f677830cb423a5da5e428bc3211f348795a2b2
Author: hyukjinkwon <[email protected]>
Date:   2017-07-03T16:03:50Z

    Fix type dispatch

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to