For what it's worth, I get the expected result that "filter" behaves like
"group by" when I run the same experiment against a DataFrame which was
loaded from a relational store:

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val df = sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
  "dbtable" -> "app.outcomes")).load()

df.select("OUTCOME").groupBy("OUTCOME").count.show
#
# returns:
#
# +-------+-----+
# |OUTCOME|count|
# +-------+-----+
# |      A|  128|
# |      B|  256|
# +-------+-----+

df.filter("OUTCOME = 'A'").count
#
# returns:
#
# res1: Long = 128


df.registerTempTable("test_data")
sqlContext.sql("select OUTCOME, count( OUTCOME ) from test_data group by
OUTCOME").show
#
# returns:
#
# +-------+---+
# |OUTCOME|_c1|
# +-------+---+
# |      A|128|
# |      B|256|
# +-------+---+

Thanks,
-Rick

Michael Kelly <michaelkellycl...@gmail.com> wrote on 09/21/2015 08:06:29
AM:

> From: Michael Kelly <michaelkellycl...@gmail.com>
> To: user@spark.apache.org
> Date: 09/21/2015 08:08 AM
> Subject: Count for select not matching count for group by
>
> Hi,
>
> I'm seeing some strange behaviour with spark 1.5, I have a dataframe
> that I have built from loading and joining some hive tables stored in
> s3.
>
> The dataframe is cached in memory, using df.cache.
>
> What I'm seeing is that the counts I get when I do a group by on a
> column are different from what I get when I filter/select and count.
>
> df.select("outcome").groupBy("outcome").count.show
> outcome | count
> ----------------------
> 'A'           |  100
> 'B'           |  200
>
> df.filter("outcome = 'A'").count
> # 50
>
> df.filter(df("outcome") === "A").count
> # 50
>
> I expect the count of columns that match 'A' in the groupBy to match
> the count when filtering. Any ideas what might be happening?
>
> Thanks,
>
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Reply via email to