[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

HyukjinKwon Wed, 07 Feb 2018 05:37:10 -0800

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/20531


    [SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs

    ## What changes were proposed in this pull request?
    
    This PR targets to explicitly specify supported types in Pandas UDFs.
    The main change here is to add a deduplicated and explicit type checking in 
`returnType` ahead with documenting this; however, it happened to fix multiple 
things.
    
    1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, 
see:
    
        ```python
        from pyspark.sql.functions import pandas_udf
        pudf = pandas_udf(lambda x: x, "binary")
        df = spark.createDataFrame([[bytearray("a")]])
        df.select(pudf("_1")).show()
        ```
        ```
        ...
        TypeError: Unsupported type in conversion to Arrow: BinaryType
        ```
    
        We can document this behaviour for its guide.
    
    2. Also, the grouped aggregate Pandas UDF fail fast on `ArrayType` but 
seems we can support this case.
    
        ```python
        from pyspark.sql.functions import pandas_udf, PandasUDFType
        foo = pandas_udf(lambda v: v.mean(), 'array<double>', 
PandasUDFType.GROUPED_AGG)
        df = spark.range(100).selectExpr("id", "array(id) as value")
        df.groupBy("id").agg(foo("value")).show()
        ```
    
        ```
        ...
         NotImplementedError: ArrayType, StructType and MapType are not 
supported with PandasUDFType.GROUPED_AGG
        ```
    
    3. Since we can check the return type ahead, we can fail fast before actual 
execution.
    
        ```python
        # we can fail fast at this stage because we know the schema ahead
        pandas_udf(lambda x: x, BinaryType())
        ```
    
    ## How was this patch tested?
    
    Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were 
added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark pudf-cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20531.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20531
    
----
commit ec708d58001be1382cabbe4357cbc68e2d51a8b6
Author: hyukjinkwon <gurwls223@...>
Date:   2018-02-07T00:18:20Z

    Explicitly specify supported types with Pandas UDFs

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Reply via email to