[ https://issues.apache.org/jira/browse/SPARK-18358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Herman van Hovell updated SPARK-18358: -------------------------------------- Target Version/s: 2.1.0 > Multiple Aggregation Using 'countDistinct' and 'first' result in error > ----------------------------------------------------------------------- > > Key: SPARK-18358 > URL: https://issues.apache.org/jira/browse/SPARK-18358 > Project: Spark > Issue Type: Bug > Environment: Mac OS X 10.9.5 > Apache Spark 2.0.1 > Hadoop 1.4 > Reporter: Chris Nasrallah > > Using pyspark, when I attempt to perform multiple aggregations on the same > groupBy object using the functions 'first' and 'countDistinct' it results in > a Py4JJavaError. > {code:borderStyle=solid} > from pyspark.sql import SparkSession > import pyspark.sql.functions as sfn > sparkSession = SparkSession.builder.master('local').getOrCreate() > df = spark.createDataFrame([ > (1, 'a', 'z'), > (1, 'b', 'x'), > (1, 'a', 'y'), > (1, 'a', 'x'), > (2, 'b', 'z'), > (2, 'b', 'z') > ], ['id', 'var1', 'var2']) > ## Using two 'first' and one 'countDistinct' aggregations works > df.groupby('id') \ > .agg(sfn.first('var1'), \ > sfn.first('var2'), \ > sfn.countDistinct('var1')).show() > > ## Using one 'max' with both 'countDistinct' works: > df.groupby('id') \ > .agg(sfn.max('var2'), \ > sfn.countDistinct('var1'), \ > sfn.countDistinct('var2')).show() > ## But using both 'countDistinct' with at least one 'first' crashes > df.groupby('id') \ > .agg(sfn.first('var1'), \ > sfn.first('var2'), \ > sfn.countDistinct('var1'), \ > sfn.countDistinct('var2')) \ > .show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org