Wes McKinney created SPARK-13943: ------------------------------------ Summary: The behavior of sum(booleantype) in Spark DataFrames is not intuitive Key: SPARK-13943 URL: https://issues.apache.org/jira/browse/SPARK-13943 Project: Spark Issue Type: Bug Components: PySpark Reporter: Wes McKinney
In NumPy and pandas, summing boolean data produces an integer indicating the number of True values: {code} In [1]: import numpy as np In [2]: arr = np.random.randn(1000000) In [3]: (arr > 0).sum() Out[3]: 499065 {code} In PySpark, {{sql.functions.sum(expr)}} results in an error: {code} AnalysisException: u"cannot resolve 'sum((`data0` > CAST(0 AS DOUBLE)))' due to data type mismatch: function sum requires numeric types, not BooleanType;" {code} FWIW, R is the same: {code} > sum(rnorm(1000000) > 0) [1] 499139 {code} Spark should consider emulating the behavior of R and Python in those environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org