[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

viirya Tue, 02 Oct 2018 01:25:28 -0700

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/22610


    [WIP][SPARK-25461][PySpark][SQL] Print warning when return type of 
Pandas.Series mismatches the arrow return type of pandas udf

    ## What changes were proposed in this pull request?
    
    For Pandas UDFs, we get arrow type from defined Catalyst return data type 
of UDFs. We use this arrow type to do serialization of data. If the defined 
return data type doesn't match with actual return type of Pandas.Series 
returned by Pandas UDFs, it has a risk to return incorrect data from Python 
side.
    
    This WIP work proposes to check if returned Pandas.Series's dtype matches 
with defined return type of Pandas UDFs.
    
    Although we can disallow it by throwing an exception to let users know they 
might need to set correct return type. But looks like we leverage such behavior 
in current codebase. For example, there is a test 
`test_vectorized_udf_null_short`:
    
    ```python
    data = [(None,), (2,), (3,), (4,)]
    schema = StructType().add("short", ShortType())
    df = self.spark.createDataFrame(data, schema)
    short_f = pandas_udf(lambda x: x, ShortType())
    res = df.select(short_f(col('short')))
    self.assertEquals(df.collect(), res.collect())
    ```
    So instead, this work for now just prints warning message if such 
mismatching is detected. So users can read this message when debugging that 
their Pandas UDFs don't produce expected results.
    
    ## How was this patch tested?
    
    Manually test by running:
    
    ```python
    from pyspark.sql.functions import pandas_udf
    import pandas as pd
    
    values = [1.0] * 5 + [2.0] * 5
    pdf = pd.DataFrame({'A': values})
    df = spark.createDataFrame(pdf)
    @pandas_udf(returnType=BooleanType())
    def to_boolean(column):
        return column
    df.select(['A']).withColumn('to_boolean', to_boolean('A')).show()
    ```
    
    Output:
    
    ```
    WARN: Arrow type double of return Pandas.Series of the user-defined 
function's dtype float64 doesn't match the arrow type bool of defined return 
type B
    ooleanType                                                                  
                                
    +---+----------+                                                            
                                                
    |  A|to_boolean|                                                            
                                        
    +---+----------+                                                            
                                
    |1.0|     false|                                                            
                                
    |1.0|     false|                                                            
                                        
    |1.0|     false|                                                            
                                                                          
    |1.0|     false|                                                            
                                     
    |1.0|     false|                                                            
                                                        
    |2.0|     false|                                                            
                                                                           
    |2.0|     false|                                                            
                                        
    |2.0|     false|                                                            
                                        
    |2.0|     false|                                                            
                                
    |2.0|     false|                                                            
                                                                           
    +---+----------+  
    ```   

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-25461

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22610.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22610
    
----
commit 2fa15bda48ba64a102f114dc9119cb3c310200c4
Author: Liang-Chi Hsieh <viirya@...>
Date:   2018-09-26T09:01:40Z

    Ensure return type of Pandas.Series matches the arrow return type of pandas 
udf.

commit d206b7cf78f898e622f539a15e45515fcbd9e54a
Author: Liang-Chi Hsieh <viirya@...>
Date:   2018-10-02T05:29:44Z

    Print warning message instead of throwing exception.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Reply via email to