For what it's worth, I get the expected result that "filter" behaves like "group by" when I run the same experiment against a DataFrame which was loaded from a relational store:
import org.apache.spark.sql._ import org.apache.spark.sql.types._ val df = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1", "dbtable" -> "app.outcomes")).load() df.select("OUTCOME").groupBy("OUTCOME").count.show # # returns: # # +-------+-----+ # |OUTCOME|count| # +-------+-----+ # | A| 128| # | B| 256| # +-------+-----+ df.filter("OUTCOME = 'A'").count # # returns: # # res1: Long = 128 df.registerTempTable("test_data") sqlContext.sql("select OUTCOME, count( OUTCOME ) from test_data group by OUTCOME").show # # returns: # # +-------+---+ # |OUTCOME|_c1| # +-------+---+ # | A|128| # | B|256| # +-------+---+ Thanks, -Rick Michael Kelly <michaelkellycl...@gmail.com> wrote on 09/21/2015 08:06:29 AM: > From: Michael Kelly <michaelkellycl...@gmail.com> > To: user@spark.apache.org > Date: 09/21/2015 08:08 AM > Subject: Count for select not matching count for group by > > Hi, > > I'm seeing some strange behaviour with spark 1.5, I have a dataframe > that I have built from loading and joining some hive tables stored in > s3. > > The dataframe is cached in memory, using df.cache. > > What I'm seeing is that the counts I get when I do a group by on a > column are different from what I get when I filter/select and count. > > df.select("outcome").groupBy("outcome").count.show > outcome | count > ---------------------- > 'A' | 100 > 'B' | 200 > > df.filter("outcome = 'A'").count > # 50 > > df.filter(df("outcome") === "A").count > # 50 > > I expect the count of columns that match 'A' in the groupBy to match > the count when filtering. Any ideas what might be happening? > > Thanks, > > Michael > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >