[jira] [Updated] (SPARK-18358) Multiple Aggregation Using 'countDistinct' and 'first' result in error

Herman van Hovell (JIRA) Wed, 16 Nov 2016 08:03:23 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-18358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Herman van Hovell updated SPARK-18358:
--------------------------------------
    Target Version/s: 2.1.0

> Multiple Aggregation Using 'countDistinct' and 'first' result in error 
> -----------------------------------------------------------------------
>
>                 Key: SPARK-18358
>                 URL: https://issues.apache.org/jira/browse/SPARK-18358
>             Project: Spark
>          Issue Type: Bug
>         Environment: Mac OS X 10.9.5
> Apache Spark 2.0.1
> Hadoop 1.4
>            Reporter: Chris Nasrallah
>
> Using pyspark, when I attempt to perform multiple aggregations on the same 
> groupBy object using the functions 'first' and 'countDistinct' it results in 
> a Py4JJavaError.
> {code:borderStyle=solid}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as sfn
> sparkSession = SparkSession.builder.master('local').getOrCreate()
> df = spark.createDataFrame([
>         (1, 'a', 'z'),
>         (1, 'b', 'x'),
>         (1, 'a', 'y'),
>         (1, 'a', 'x'),
>         (2, 'b', 'z'),
>         (2, 'b', 'z')
>     ], ['id', 'var1', 'var2'])
> ## Using two 'first' and one 'countDistinct' aggregations works
> df.groupby('id')    \
>         .agg(sfn.first('var1'),  \
>                 sfn.first('var2'),  \
>                 sfn.countDistinct('var1')).show()
>                          
> ## Using one 'max' with both 'countDistinct' works:
> df.groupby('id')    \
>          .agg(sfn.max('var2'),                \
>                  sfn.countDistinct('var1'),   \
>                  sfn.countDistinct('var2')).show()
> ## But using both 'countDistinct' with at least one 'first' crashes
> df.groupby('id')    \
>         .agg(sfn.first('var1'),   \
>                 sfn.first('var2'),   \
>                 sfn.countDistinct('var1'), \
>                 sfn.countDistinct('var2')) \
>         .show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18358) Multiple Aggregation Using 'countDistinct' and 'first' result in error

Reply via email to