[ 
https://issues.apache.org/jira/browse/HIVE-29257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029429#comment-18029429
 ] 

Stamatis Zampetakis commented on HIVE-29257:
--------------------------------------------

[~thomas.rebele] Since we already have options to display the CBO plan (as text 
and json) via other EXPLAIN commands, I would refrain from adding more 
redundancy in EXPLAIN EXTENDED. By doing this we also avoid too much coupling 
between "logical" and "physical" plans.

> Remove OPTIMIZED SQL entry from EXPLAIN EXTENDED
> ------------------------------------------------
>
>                 Key: HIVE-29257
>                 URL: https://issues.apache.org/jira/browse/HIVE-29257
>             Project: Hive
>          Issue Type: Task
>          Components: CBO
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>
> The {{EXPLAIN EXTENDED}} statement among other things prints a serialization 
> of the CBO plan in the form of SQL (so called OPTIMIZED SQL). This feature 
> was introduced in HIVE-19360 for diagnosability purposes to assist in the 
> understanding of the DAG plan with respect to join ordering.
> However, in most cases when developers want to understand the join order they 
> look into the CBO plan and not in the OPTIMIZED SQL output. Determining the 
> join order from SQL becomes challenging for queries that contain more than a 
> few tables and nested SQL constructs make the situation worse. The CBO 
> RelNode tree is usually easier to navigate.
> In various cases the SQL serialization is an invalid SQL query. This led 
> people to believe that the CBO plan is wrong. As a consequence, they attempt 
> to fix the CBO plan while in fact the plan is perfectly valid. The bug 
> usually lies only in the serialization part but the latter does not have any 
> effect on the execution of the query.
> Users (and sometimes developers) believe that the OPTIMIZED SQL is the query 
> that is actually run by Hive. There have been various cases where people 
> copy-paste the OPTIMIZED SQL entry and try to run it when they have issues 
> with the original query and when that fails as well it adds up to the 
> confusion.
> In addition, since the serializer is buggy it sometimes raises exceptions and 
> the OPTIMIZED SQL does not appear in the output. People will raise bug fixes 
> and invest effort in a feature that is not very useful.
> Currently, there are 282 .q.out files that use {{EXPLAIN EXTENDED}} and thus 
> contain an entry of OPTIMIZED SQL. Even minor changes in the CBO plan do 
> affect the OPTIMIZED SQL necessitating updates and reviews on multiple files. 
> Typically calcite upgrades trigger many changes in OPTIMIZED SQL.
> {noformat}
> $ find . -name "*.q.out" -exec grep -a "OPTIMIZED SQL" {} \; | wc -l
> 820
> $ find . -name "*.q.out" -exec grep -l "OPTIMIZED SQL" {} \; | wc -l
> 282
> {noformat}
> I propose to remove the {{OPTIMIZED SQL}} feature and related code to reduce:
> * maintenance overhead
> * plan/file (e.g., .q.out) size
> * users/dev confusion about its meaning/usage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to