Re: Top-K optimization

Namit Jain Mon, 19 Nov 2012 01:47:15 -0800

Hi Siva,


Take a look at https://issues.apache.org/jira/browse/HIVE-3562.

It is in my todo list, but I have not been able to review this.

I think, this addresses a very similar problem. If yes, can you also
review the
above patch ?


Thanks,
-namit


On 11/19/12 3:10 PM, "Sivaramakrishnan Narayanan" <tarb...@gmail.com>
wrote:

>Hi All,
>
>I'm a developer at Qubole (http://www.qubole.com) looking at Hadoop and
>Hive. In my past life, I was on the optimizer team of Greenplum Parallel
>Database. I'm a newbie to the Hive mailing list, so apologies for any
>missteps. I've done some searching in the Hive mailing list and JIRA and
>have not found any discussions around this topic - please feel free to
>redirect me to any old discussions I might've missed.
>
>A class of queries we're interested in optimizing are top-k queries i.e.
>queries of the form:
>
>(1) SELECT x, y from T order by z limit 10
>
>You can imagine similar query with aggregates:
>
>(2) SELECT x, y, count(*) as c from T group by x, y order by c desc limit
>10
>
>I'll continue my discussion with example (1) for simplicity. The way such
>a query is executed, every mapper sorts all rows from T and writes it to
>local files. Reducers (in this example, singular) read these files and
>merge them. These rows are fed to the limit operator which stops after 10
>rows. 
>
>The change I'm proposing is a combination of Hive and Hadoop changes
>which will greatly improve the performance of such queries:
>
>Hadoop change:
>       - New parameter map.sort.limitrecords which determines how many records
>each mapper in a job will send to every reducer
>       - When writing out local files after sorting, map-task stops after
>map.sort.limitrecords records for each reducer
>       - Effectively, each mapper sends out its top-K records
>
>Hive change:
>       - Determining when the Top-K optimization is applicable and setting K in
>ReduceSinkDesc
>       - Passing the K value along to MapredWork
>       - ExecDriver sets map.sort.limitrecords before executing the job
>corresponding to the MapredWork
>
>This change will reduce the amount of I/O that happens on the map-side
>(writing only 10 rows per reducer as opposed to entire table) and can
>have a big effect on performance. Furthermore, it is possible to make the
>sort on the mapper side a top-k sort which can further improve
>performance - but the deep pocket is really the I/O savings. In my
>experiments, I see a 5x performance improvement for such queries.
>
>Please let me know if this is of general interest - I'll be happy to
>contribute this back to the community. I'll also be mailing the Hadoop
>mailing list about this.
>
>Thanks
>Siva

Re: Top-K optimization

Reply via email to