I have to take the first 25 IDs ranked by count(*). But the following is
not allowed in Hive

select id from T order by count(*) desc limit 25;

Which yields a "NOt yet supported place for UDAF count". The way around it
is the following

select id, count(*) as cnt from T group by id order by cnt desc limit 25;

However I need to put this query in a subquery like so

select id, XXX from T t join .... where t.id in (select id, count(*) as cnt
from T group by id order by cnt desc limit 25) group by ... ;

which does NOT work because a subquery is allowed to return only one thing.
Here XXX are complex constructs like distance calculation and binning.
These are time consuming and complex operations.

The only way I see this would work is if I use the following sub-sub query

select id, XXX from T t join ... where t.id in (select sub.id from (select
id, count(*) as cnt from T group by id order by cnt desc limit 25)sub)
group by ... ;

The reason I don't want to put the limit in the outermost query is because
those XXX queries are expensive and I don't want them to be performed on
the entire result only to retain the top 25 and throw away the rest of the
results. The count(*) operation of the rest of the IDs is not very
expensive or time consuming.

Is there some other elegant way of handling this without using a
sub-sub-query approach?

Reply via email to