Re: Group by order by

Richard Marscher Mon, 27 Apr 2015 12:48:07 -0700

It's not related to Spark, but the concept of what you are trying to do
with the data. Grouping by ID means consolidating data for each ID down to
1 row per ID. You can sort by time after this point yes, but you would need
to either take each ID and time value pair OR do some aggregate operation
on the time. That's what the error message is explaining. Maybe you can
describe what you want your results to look like?


Here is some detail about the underlying operations here:

Example Data:
ID |  Time |  SomeVal

1    02-02-15    4
1   02-03-15     5
2   02-02-15     4
2   02-02-15     5
2   02-05-15     2

A.

So if you do Group By ID this means 1 row per ID like below:

ID

1
2

To include Time in this projection you need to aggregate it with a function
to a single value. Then and only then can you use it in the projection and
sort on it.

"SELECT id, max(time) FROM sample GROUP BY id SORT BY max(time) desc;"

ID  | max(time)
2     02-05-15
1     02-03-15

B.

Or if you do Group by ID, time then you get 1 row per ID and time pair:

ID | Time
1    02-02-15
1    02-03-15
2    02-02-15
2    02-05-15

Notice both rows with ID `2` and time `02-02-15` group down to 1 row in the
results here. In this case you can sort the results by time without using
an aggregate function.

"SELECT id, time FROM sample GROUP BY id, time SORT BY time desc;"

ID | Time
2    02-05-15
1    02-03-15
1    02-02-15
2    02-02-15


On Mon, Apr 27, 2015 at 3:28 PM, Ulanov, Alexander <alexander.ula...@hp.com>
wrote:

>  Hi Richard,
>
>
>
> There are several values of time per id. Is there a way to perform group
> by id and sort by time in Spark?
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Richard Marscher [mailto:rmarsc...@localytics.com]
> *Sent:* Monday, April 27, 2015 12:20 PM
> *To:* Ulanov, Alexander
> *Cc:* user@spark.apache.org
> *Subject:* Re: Group by order by
>
>
>
> Hi,
>
>
>
> that error seems to indicate the basic query is not properly expressed. If
> you group by just ID, then that means it would need to aggregate all the
> time values into one value per ID, so you can't sort by it. Thus it tries
> to suggest an aggregate function for time so you can have 1 value per ID
> and properly sort it.
>
>
>
> On Mon, Apr 27, 2015 at 3:07 PM, Ulanov, Alexander <
> alexander.ula...@hp.com> wrote:
>
>  Hi,
>
>
>
> Could you suggest what is the best way to do “group by x order by y” in
> Spark?
>
>
>
> When I try to perform it with Spark SQL I get the following error (Spark
> 1.3):
>
>
>
> val results = sqlContext.sql("select * from sample group by id order by
> time")
>
> org.apache.spark.sql.AnalysisException: expression 'time' is neither
> present in the group by, nor is it an aggregate function. Add to group by
> or wrap in first() if you don't care which value you get.;
>
>         at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:37)
>
>
>
> Is there a way to do it with just RDD?
>
>
>
> Best regards, Alexander
>
>
>

Re: Group by order by

Reply via email to