[ https://issues.apache.org/jira/browse/SPARK-13946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270151#comment-15270151 ]
Niranjan Molkeri` commented on SPARK-13946: ------------------------------------------- Hi, I ran the following code. {noformat} import numpy as np import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext(appName="fooAPP") sqlContext = SQLContext(sc) df = pd.DataFrame({'foo': np.random.randn(1000000),'bar': np.random.randn(1000000)}) sdf = sqlContext.createDataFrame(df) sdf2 = sdf[sdf.bar > 0] #sdf.agg(F.count(sdf2.foo)).show() sdfCount = sdf.count() sdf2Count = sdf2.count() {noformat} sdf.count() returns 1000000 sdf2.count() returns avg around 500000 can you tell me what is "F" in {noformat} sdf.agg(F.count(sdf2.foo)).show() {noformat} So that I can further test have a look into the issue. Thank you. > PySpark DataFrames allows you to silently use aggregate expressions derived > from different table expressions > ------------------------------------------------------------------------------------------------------------ > > Key: SPARK-13946 > URL: https://issues.apache.org/jira/browse/SPARK-13946 > Project: Spark > Issue Type: Bug > Components: PySpark > Reporter: Wes McKinney > > In my opinion, this code should raise an exception rather than silently > discarding the predicate: > {code} > import numpy as np > import pandas as pd > df = pd.DataFrame({'foo': np.random.randn(1000000), > 'bar': np.random.randn(1000000)}) > sdf = sqlContext.createDataFrame(df) > sdf2 = sdf[sdf.bar > 0] > sdf.agg(F.count(sdf2.foo)).show() > +----------+ > |count(foo)| > +----------+ > | 1000000| > +----------+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org