RE: Group by order by

Ulanov, Alexander Mon, 27 Apr 2015 13:25:01 -0700

Thanks, it should be
“select id, time, min(x1), min(x2), … from data group by id, time order by time”


(“min” or other aggregate function to pick other fields)

Forgot to mention that (id, time) is my primary key and I took for granted that 
it worked in my MySQL example.

Best regards, Alexander


From: Richard Marscher [mailto:rmarsc...@localytics.com]
Sent: Monday, April 27, 2015 12:47 PM
To: Ulanov, Alexander
Cc: user@spark.apache.org
Subject: Re: Group by order by

It's not related to Spark, but the concept of what you are trying to do with 
the data. Grouping by ID means consolidating data for each ID down to 1 row per 
ID. You can sort by time after this point yes, but you would need to either 
take each ID and time value pair OR do some aggregate operation on the time. 
That's what the error message is explaining. Maybe you can describe what you 
want your results to look like?

Here is some detail about the underlying operations here:

Example Data:
ID |  Time |  SomeVal

1    02-02-15    4
1   02-03-15     5
2   02-02-15     4
2   02-02-15     5
2   02-05-15     2

A.

So if you do Group By ID this means 1 row per ID like below:

ID

1
2

To include Time in this projection you need to aggregate it with a function to 
a single value. Then and only then can you use it in the projection and sort on 
it.

"SELECT id, max(time) FROM sample GROUP BY id SORT BY max(time) desc;"

ID  | max(time)
2     02-05-15
1     02-03-15

B.

Or if you do Group by ID, time then you get 1 row per ID and time pair:

ID | Time
1    02-02-15
1    02-03-15
2    02-02-15
2    02-05-15

Notice both rows with ID `2` and time `02-02-15` group down to 1 row in the 
results here. In this case you can sort the results by time without using an 
aggregate function.

"SELECT id, time FROM sample GROUP BY id, time SORT BY time desc;"

ID | Time
2    02-05-15
1    02-03-15
1    02-02-15
2    02-02-15

On Mon, Apr 27, 2015 at 3:28 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Hi Richard,

There are several values of time per id. Is there a way to perform group by id 
and sort by time in Spark?

Best regards, Alexander

From: Richard Marscher 
[mailto:rmarsc...@localytics.com<mailto:rmarsc...@localytics.com>]
Sent: Monday, April 27, 2015 12:20 PM
To: Ulanov, Alexander
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Group by order by

Hi,

that error seems to indicate the basic query is not properly expressed. If you 
group by just ID, then that means it would need to aggregate all the time 
values into one value per ID, so you can't sort by it. Thus it tries to suggest 
an aggregate function for time so you can have 1 value per ID and properly sort 
it.

On Mon, Apr 27, 2015 at 3:07 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Hi,

Could you suggest what is the best way to do “group by x order by y” in Spark?

When I try to perform it with Spark SQL I get the following error (Spark 1.3):

val results = sqlContext.sql("select * from sample group by id order by time")
org.apache.spark.sql.AnalysisException: expression 'time' is neither present in 
the group by, nor is it an aggregate function. Add to group by or wrap in 
first() if you don't care which value you get.;
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:37)

Is there a way to do it with just RDD?

Best regards, Alexander

RE: Group by order by

Reply via email to