Hi All,
I'm a developer at Qubole (http://www.qubole.com) looking at Hadoop and Hive.
In my past life, I was on the optimizer team of Greenplum Parallel Database.
I'm a newbie to the Hive mailing list, so apologies for any missteps. I've done
some searching in the Hive mailing list and JIRA and have not found any
discussions around this topic - please feel free to redirect me to any old
discussions I might've missed.
A class of queries we're interested in optimizing are top-k queries i.e.
queries of the form:
(1) SELECT x, y from T order by z limit 10
You can imagine similar query with aggregates:
(2) SELECT x, y, count(*) as c from T group by x, y order by c desc limit 10
I'll continue my discussion with example (1) for simplicity. The way such a
query is executed, every mapper sorts all rows from T and writes it to local
files. Reducers (in this example, singular) read these files and merge them.
These rows are fed to the limit operator which stops after 10 rows.
The change I'm proposing is a combination of Hive and Hadoop changes which will
greatly improve the performance of such queries:
Hadoop change:
- New parameter map.sort.limitrecords which determines how many records
each mapper in a job will send to every reducer
- When writing out local files after sorting, map-task stops after
map.sort.limitrecords records for each reducer
- Effectively, each mapper sends out its top-K records
Hive change:
- Determining when the Top-K optimization is applicable and setting K
in ReduceSinkDesc
- Passing the K value along to MapredWork
- ExecDriver sets map.sort.limitrecords before executing the job
corresponding to the MapredWork
This change will reduce the amount of I/O that happens on the map-side (writing
only 10 rows per reducer as opposed to entire table) and can have a big effect
on performance. Furthermore, it is possible to make the sort on the mapper side
a top-k sort which can further improve performance - but the deep pocket is
really the I/O savings. In my experiments, I see a 5x performance improvement
for such queries.
Please let me know if this is of general interest - I'll be happy to contribute
this back to the community. I'll also be mailing the Hadoop mailing list about
this.
Thanks
Siva