If someone is interested in adding parallel ORDER BY to Hive (using TotalOrderPartitioner), here's a good starting point:
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad The goal would be to take that manual two-step sample-then-sort process and turn it into an automatic plan within Hive. I have a better example for the sampling query which I haven't published yet. We would also need to name the final output files in such a way that the total order could be iterated via the filenames. JVS ________________________________________ From: Ning Zhang [nzh...@facebook.com] Sent: Friday, June 11, 2010 12:40 PM To: 'hive-u...@hadoop.apache.org' Cc: 'hive-dev@hadoop.apache.org' Subject: Re: Is anybody working on the globally "order by" of hive ? Good idea Edward. It would definitely better if it is what it sounds to be. Btw Jeff, order by is supported in trunk with certain limititions in strict mode (has to have a limit). I will be able to update the wiki when I come back. Thanks, Ning ------ Sent from my blackberry ________________________________ From: Edward Capriolo <edlinuxg...@gmail.com> To: hive-u...@hadoop.apache.org <hive-u...@hadoop.apache.org> Cc: hive-dev@hadoop.apache.org <hive-dev@hadoop.apache.org> Sent: Fri Jun 11 11:13:57 2010 Subject: Re: Is anybody working on the globally "order by" of hive ? On Fri, Jun 11, 2010 at 5:24 AM, Jeff Zhang <zjf...@gmail.com<mailto:zjf...@gmail.com>> wrote: Hi all, >From the wiki of hive, Hive do not have the feature of globally "order by", the sort by of hive is for each reducer. Our team think the globally "order by" is an important feature for users, so wondering is anybody working it ? I am very interested to been involved. -- Best Regards Jeff Zhang Jeff, I was wondering if TotalOrderPartitioner in hadoop 20 could play a role in this. As of now order by sets reduce tasks to 1 :) Edward