[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

HyukjinKwon Mon, 09 Oct 2017 22:09:22 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18732#discussion_r143629848
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2181,30 +2187,66 @@ def udf(f=None, returnType=StringType()):
     @since(2.3)
     def pandas_udf(f=None, returnType=StringType()):
         """
    -    Creates a :class:`Column` expression representing a user defined 
function (UDF) that accepts
    -    `Pandas.Series` as input arguments and outputs a `Pandas.Series` of 
the same length.
    +    Creates a vectorized user defined function (UDF).
     
    -    :param f: python function if used as a standalone function
    +    :param f: user-defined function. A python function if used as a 
standalone function
         :param returnType: a :class:`pyspark.sql.types.DataType` object
     
    -    >>> from pyspark.sql.types import IntegerType, StringType
    -    >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
    -    >>> @pandas_udf(returnType=StringType())
    -    ... def to_upper(s):
    -    ...     return s.str.upper()
    -    ...
    -    >>> @pandas_udf(returnType="integer")
    -    ... def add_one(x):
    -    ...     return x + 1
    -    ...
    -    >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", 
"age"))
    -    >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
    -    ...     .show()  # doctest: +SKIP
    -    +----------+--------------+------------+
    -    |slen(name)|to_upper(name)|add_one(age)|
    -    +----------+--------------+------------+
    -    |         8|      JOHN DOE|          22|
    -    +----------+--------------+------------+
    +    The user-defined function can define one of the following 
transformations:
    +
    +    1. One or more `pandas.Series` -> A `pandas.Series`
    +
    +       This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
    +       :meth:`pyspark.sql.DataFrame.select`.
    +       The returnType should be a primitive data type, e.g., 
`DoubleType()`.
    +       The length of the returned `pandas.Series` must be of the same as 
the input `pandas.Series`.
    +
    +       >>> from pyspark.sql.types import IntegerType, StringType
    +       >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
    +       >>> @pandas_udf(returnType=StringType())
    +       ... def to_upper(s):
    +       ...     return s.str.upper()
    +       ...
    +       >>> @pandas_udf(returnType="integer")
    +       ... def add_one(x):
    +       ...     return x + 1
    +       ...
    +       >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", 
"name", "age"))
    +       >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
    +       ...     .show()  # doctest: +SKIP
    +       +----------+--------------+------------+
    +       |slen(name)|to_upper(name)|add_one(age)|
    +       +----------+--------------+------------+
    +       |         8|      JOHN DOE|          22|
    +       +----------+--------------+------------+
    +
    +    2. A `pandas.DataFrame` -> A `pandas.DataFrame`
    +
    +       This udf is used with :meth:`pyspark.sql.GroupedData.apply`.
    --- End diff --
    
    Maybe, `This udf is used with` -> `This udf is only used with` or .. 
probably we should add a `note` here. If I didn't know the context here, I'd 
wonder why it does not work as normal pandas udf ..



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

Reply via email to