[ https://issues.apache.org/jira/browse/SPARK-13943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204219#comment-15204219 ]
Liang-Chi Hsieh commented on SPARK-13943: ----------------------------------------- Yes, I think so. > The behavior of sum(booleantype) in Spark DataFrames is not intuitive > --------------------------------------------------------------------- > > Key: SPARK-13943 > URL: https://issues.apache.org/jira/browse/SPARK-13943 > Project: Spark > Issue Type: Bug > Components: PySpark > Reporter: Wes McKinney > > In NumPy and pandas, summing boolean data produces an integer indicating the > number of True values: > {code} > In [1]: import numpy as np > In [2]: arr = np.random.randn(1000000) > In [3]: (arr > 0).sum() > Out[3]: 499065 > {code} > In PySpark, {{sql.functions.sum(expr)}} results in an error: > {code} > AnalysisException: u"cannot resolve 'sum((`data0` > CAST(0 AS DOUBLE)))' due > to data type mismatch: function sum requires numeric types, not BooleanType;" > {code} > FWIW, R is the same: > {code} > > sum(rnorm(1000000) > 0) > [1] 499139 > {code} > Spark should consider emulating the behavior of R and Python in those > environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org