[ 
https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040986#comment-14040986
 ] 

Gopal V commented on HIVE-5775:
-------------------------------

[~xuefuz]: The CBO model rewrites queries using cardinality statistics.

The tuple count and distinct value count should not affect which physical layer 
it runs on - having the CBO split up/reorder a 3-way map-join into 2 phases (or 
vertices) should generate identical plans in both.

MR would run 2 Map-only phases with their own local tasks and hashtable 
uploads, Tez would run 2 vertices with their own broadcast tasks.

Tez can reduce runtimes further by removing the intermediate IO cost & 
co-schedule the second vertex in the same container as the first - but that is 
not assumed as it is not a strong guarantee in a busy cluster.

The Tez runtime model is faster, but the logical cost does not change as the 
number of rows read off disk, written to disk and distinct keys remain the same.

In fact as it exists today, because it applies equally to both Tez & MR, it 
ignores a lot of Tez's opportunistic/runtime optimizations like container-reuse 
- e.g. "Each vertex in Tez is a different process".

It is upto the Tez DAG planner to attend to such runtime optimization details.

> Introduce Cost Based Optimizer to Hive
> --------------------------------------
>
>                 Key: HIVE-5775
>                 URL: https://issues.apache.org/jira/browse/HIVE-5775
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Laljo John Pullokkaran
>            Assignee: Laljo John Pullokkaran
>         Attachments: CBO-2.pdf, HIVE-5775.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to