Re: Count for select not matching count for group by

2015-09-22 Thread Michael Armbrust
This looks like something is wrong with predicate pushdown. Can you include the output of calling explain, and tell us what format the data is stored in? On Mon, Sep 21, 2015 at 8:06 AM, Michael Kelly wrote: > Hi, > > I'm seeing some strange behaviour with spark

Re: Count for select not matching count for group by

2015-09-21 Thread Richard Hillegas
# # returns: # # +---+---+ # |OUTCOME|_c1| # +---+---+ # | A|128| # | B|256| # +---+---+ Thanks, -Rick Michael Kelly <michaelkellycl...@gmail.com> wrote on 09/21/2015 08:06:29 AM: > From: Michael Kelly <michaelkellycl...@gmail.com> > To: user@spark.

Count for select not matching count for group by

2015-09-21 Thread Michael Kelly
Hi, I'm seeing some strange behaviour with spark 1.5, I have a dataframe that I have built from loading and joining some hive tables stored in s3. The dataframe is cached in memory, using df.cache. What I'm seeing is that the counts I get when I do a group by on a column are different from what