[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215900#comment-14215900 ] Lefty Leverenz commented on HIVE-5775: -- Thanks [~jpullokkaran], I removed the TODOC14 label on the assumption that no updates are needed at this time. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Components: CBO Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Fix For: 0.14.0 Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214977#comment-14214977 ] Laljo John Pullokkaran commented on HIVE-5775: -- [leftylev] - Moved DS spec from In Progress to Completed. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Components: CBO Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Labels: TODOC14 Fix For: 0.14.0 Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213886#comment-14213886 ] Lefty Leverenz commented on HIVE-5775: -- Doc note: The design doc should be moved from the In Progress section to the Completed section. Does the design doc also need to be updated? * [Design Docs -- In Progress | https://cwiki.apache.org/confluence/display/Hive/DesignDocs#DesignDocs-InProgress] * [Cost-based optimization in Hive | https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive] Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Components: CBO Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Labels: TODOC14 Fix For: 0.14.0 Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040970#comment-14040970 ] Laljo John Pullokkaran commented on HIVE-5775: -- The cost model as described in the doc assumes TEZ as the execution layer. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040986#comment-14040986 ] Gopal V commented on HIVE-5775: --- [~xuefuz]: The CBO model rewrites queries using cardinality statistics. The tuple count and distinct value count should not affect which physical layer it runs on - having the CBO split up/reorder a 3-way map-join into 2 phases (or vertices) should generate identical plans in both. MR would run 2 Map-only phases with their own local tasks and hashtable uploads, Tez would run 2 vertices with their own broadcast tasks. Tez can reduce runtimes further by removing the intermediate IO cost co-schedule the second vertex in the same container as the first - but that is not assumed as it is not a strong guarantee in a busy cluster. The Tez runtime model is faster, but the logical cost does not change as the number of rows read off disk, written to disk and distinct keys remain the same. In fact as it exists today, because it applies equally to both Tez MR, it ignores a lot of Tez's opportunistic/runtime optimizations like container-reuse - e.g. Each vertex in Tez is a different process. It is upto the Tez DAG planner to attend to such runtime optimization details. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041053#comment-14041053 ] Laljo John Pullokkaran commented on HIVE-5775: -- Following may help in reducing the confusion: 1. In design doc the cost formula is for choosing Join Algorithm. The cost formula as described in the doc assumes Tez execution. 2. However current work on CBO doesn’t include Join algorithm selection. Instead it rearranges Join based on Join cardinality NDV. In other words Join reordering is not depended on Physical Execution Layer (Tez or MR). 3. When we decide to do Join Algorithm Selection we can fit in cost formula for both a) MR b) Tez. This way, based on the physical execution layer we can select best Join Algorithm/Order. 4. The cost formula for Join Algorithm selection is not that different between MR Tez (except for intermediate HDFS writes). So assume that CBO can support both execution layers rather easily. 5. CBO framework allows you to plug and play any cost model. There is no hard coupling. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041056#comment-14041056 ] Xuefu Zhang commented on HIVE-5775: --- Thanks for the clarification, [~gopalv]. We are in total agreement if what is put in the logical layer is the optimization that's applicable to either execution engine and if execution engine specific optimization is put in the execution layer. Maybe the document can be updated to make this explicit to avoid confusion/misunderstanding from others. {quote} The cost model as described in the doc assumes TEZ as the execution layer. {quote} Not sure if I understand [~jpullokkaran] correctly. If the cost model is based on Tez, then we shall only use a model that's common for both Tez and MR when rewriting the query, right? Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041085#comment-14041085 ] Laljo John Pullokkaran commented on HIVE-5775: -- Cost Model described doesn't apply to current CBO work and for the proposed branch. It will apply only for Join Algorithm selection which is not part of the current work. IMO moving join reordering to physical optimizer is the not the correct solution. I would rather leave it in logical, since after doing join reordering you may able to do other optimizations like, new predicate push down, transitive inferences…. When we get around to do Join Algorithm selection there will be two cost formulas one for MR and one for Tez. I think best solution is to support both cost models and decide which one to apply based on physical execution layer. I will update the doc. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041115#comment-14041115 ] Xuefu Zhang commented on HIVE-5775: --- Cool. Thanks for the clarifications. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039987#comment-14039987 ] Xuefu Zhang commented on HIVE-5775: --- Thanks to all for working on this. I'm not sure if this has ever surfaced, but I'm wondering if this cost based optimization is specific to Tez. From the design doc it seems that this new optimizer was plugged in the logical layer, while certain cost estimations are based on Tez such as vertex. Obviously the cost for a given query would be different for MapReduce vs Tez, but cost based optimization is equally valuable to both MR and Tez. However, applying an optimization based on one execution engine may cause adverse result when the configured engine is of another type. Therefore, I'd like know if any thoughts has been given and what's plan to address this. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971936#comment-13971936 ] Vaibhav Gumashta commented on HIVE-5775: Hi [~jpullokkaran]; wanted to go through the code - can you please upload to review board? Thanks! Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971970#comment-13971970 ] Laljo John Pullokkaran commented on HIVE-5775: -- I don't think this should go in to trunk yet. I need to remove some of the limitations (outer join, union) before it can go on to trunk. Also a better algorithm for join permutations is also being worked on. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13970241#comment-13970241 ] Laljo John Pullokkaran commented on HIVE-5775: -- First rev of CBO. This is a limited version that does not support: 1. Outer Joins 2. Union 3. All of the UDFs 4. Doesn't play all permutations of joins Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13970417#comment-13970417 ] Laljo John Pullokkaran commented on HIVE-5775: -- Thanks Julian Hyde, Harish Bhutani for help with CBO V1. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf, HIVE-5775.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13816237#comment-13816237 ] Laljo John Pullokkaran commented on HIVE-5775: -- Attached is the first version of the CBO spec. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13816299#comment-13816299 ] Brock Noland commented on HIVE-5775: Hi, Thanks for the design document! The document should also be uploaded to this location: https://cwiki.apache.org/confluence/display/Hive/DesignDocs Brock Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HIVE-5775) Introduce Cost Based Optimizer to Hive
[ https://issues.apache.org/jira/browse/HIVE-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13816325#comment-13816325 ] Laljo John Pullokkaran commented on HIVE-5775: -- sure will do. Thanks John -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. Introduce Cost Based Optimizer to Hive -- Key: HIVE-5775 URL: https://issues.apache.org/jira/browse/HIVE-5775 Project: Hive Issue Type: New Feature Reporter: Laljo John Pullokkaran Assignee: Laljo John Pullokkaran Attachments: CBO-2.pdf -- This message was sent by Atlassian JIRA (v6.1#6144)