[
https://issues.apache.org/jira/browse/HIVE-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717840#comment-13717840
]
Edward Capriolo commented on HIVE-4660:
---------------------------------------
Thanks for uploading that. I am still getting up to speed a bit, silly question:
I am looking through the tez source code and attempting to understand it's
basic optimizations.
I am looking at GroupByOrderByMRRTest.
/**
* Simple example that does a GROUP BY ORDER BY in an MRR job
* Consider a query such as
* Select DeptName, COUNT(*) as cnt FROM EmployeeTable
* GROUP BY DeptName ORDER BY cnt;
I notice that this test essentially runs the job single reducer.
job.setNumReduceTasks(1);
/**
* Shuffle ensures ordering based on count of employees per department
* hence the final reducer is a no-op and just emits the department name
* with the employee count per department.
*/
What mechanism makes the above optimization happen? Do all shuffles have a
natural total order sort with Tez?
> Let there be Tez
> ----------------
>
> Key: HIVE-4660
> URL: https://issues.apache.org/jira/browse/HIVE-4660
> Project: Hive
> Issue Type: New Feature
> Reporter: Gunther Hagleitner
> Assignee: Gunther Hagleitner
>
> Tez is a new application framework built on Hadoop Yarn that can execute
> complex directed acyclic graphs of general data processing tasks. Here's the
> project's page: http://incubator.apache.org/projects/tez.html
> The interesting thing about Tez from Hive's perspective is that it will over
> time allow us to overcome inefficiencies in query processing due to having to
> express every algorithm in the map-reduce paradigm.
> The barrier to entry is pretty low as well: Tez can actually run unmodified
> MR jobs; But as a first step we can without much trouble start using more of
> Tez' features by taking advantage of the MRR pattern.
> MRR simply means that there can be any number of reduce stages following a
> single map stage - without having to write intermediate results to HDFS and
> re-read them in a new job. This is common when queries require multiple
> shuffles on keys without correlation (e.g.: join - grp by - window function -
> order by)
> For more details see the design doc here:
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira