[GitHub] spark pull request #19505: [SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby()....

ueshin Mon, 16 Oct 2017 01:17:01 -0700

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19505#discussion_r144780130
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2192,67 +2195,82 @@ def pandas_udf(f=None, returnType=StringType()):
         :param f: user-defined function. A python function if used as a 
standalone function
         :param returnType: a :class:`pyspark.sql.types.DataType` object
     
    -    The user-defined function can define one of the following 
transformations:
    -
    -    1. One or more `pandas.Series` -> A `pandas.Series`
    -
    -       This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
    -       :meth:`pyspark.sql.DataFrame.select`.
    -       The returnType should be a primitive data type, e.g., 
`DoubleType()`.
    -       The length of the returned `pandas.Series` must be of the same as 
the input `pandas.Series`.
    -
    -       >>> from pyspark.sql.types import IntegerType, StringType
    -       >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
    -       >>> @pandas_udf(returnType=StringType())
    -       ... def to_upper(s):
    -       ...     return s.str.upper()
    -       ...
    -       >>> @pandas_udf(returnType="integer")
    -       ... def add_one(x):
    -       ...     return x + 1
    -       ...
    -       >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", 
"name", "age"))
    -       >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
    -       ...     .show()  # doctest: +SKIP
    -       +----------+--------------+------------+
    -       |slen(name)|to_upper(name)|add_one(age)|
    -       +----------+--------------+------------+
    -       |         8|      JOHN DOE|          22|
    -       +----------+--------------+------------+
    -
    -    2. A `pandas.DataFrame` -> A `pandas.DataFrame`
    -
    -       This udf is only used with :meth:`pyspark.sql.GroupedData.apply`.
    -       The returnType should be a :class:`StructType` describing the 
schema of the returned
    -       `pandas.DataFrame`.
    -
    -       >>> df = spark.createDataFrame(
    -       ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    -       ...     ("id", "v"))
    -       >>> @pandas_udf(returnType=df.schema)
    -       ... def normalize(pdf):
    -       ...     v = pdf.v
    -       ...     return pdf.assign(v=(v - v.mean()) / v.std())
    -       >>> df.groupby('id').apply(normalize).show()  # doctest: +SKIP
    -       +---+-------------------+
    -       | id|                  v|
    -       +---+-------------------+
    -       |  1|-0.7071067811865475|
    -       |  1| 0.7071067811865475|
    -       |  2|-0.8320502943378437|
    -       |  2|-0.2773500981126146|
    -       |  2| 1.1094003924504583|
    -       +---+-------------------+
    -
    -       .. note:: This type of udf cannot be used with functions such as 
`withColumn` or `select`
    -                 because it defines a `DataFrame` transformation rather 
than a `Column`
    -                 transformation.
    -
    -       .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
    +    The user-defined function can define the following transformation:
    +
    +    One or more `pandas.Series` -> A `pandas.Series`
    +
    +    This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
    +    :meth:`pyspark.sql.DataFrame.select`.
    +    The returnType should be a primitive data type, e.g., `DoubleType()`.
    +    The length of the returned `pandas.Series` must be of the same as the 
input `pandas.Series`.
    +
    +    >>> from pyspark.sql.types import IntegerType, StringType
    +    >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
    +    >>> @pandas_udf(returnType=StringType())
    +    ... def to_upper(s):
    +    ...     return s.str.upper()
    +    ...
    +    >>> @pandas_udf(returnType="integer")
    +    ... def add_one(x):
    +    ...     return x + 1
    +    ...
    +    >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", 
"age"))
    +    >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
    +    ...     .show()  # doctest: +SKIP
    +    +----------+--------------+------------+
    +    |slen(name)|to_upper(name)|add_one(age)|
    +    +----------+--------------+------------+
    +    |         8|      JOHN DOE|          22|
    +    +----------+--------------+------------+
    +
    +    .. note:: The user-defined function must be deterministic.
    +    """
    +    return _create_udf(f, returnType=returnType, vectorized=True, 
grouped=False)
    +
    +
    +@since(2.3)
    +def pandas_grouped_udf(f=None, returnType=StructType()):
    --- End diff --
    
    The fields of the return type is used as the output of the plan. I guess 
the field names are also useful for users.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19505: [SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby()....

Reply via email to