[ 
https://issues.apache.org/jira/browse/HIVE-29257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028760#comment-18028760
 ] 

Thomas Rebele commented on HIVE-29257:
--------------------------------------

I agree with the reasoning, printing an invalid SQL query leads to confusion.

Do you think it makes sense to replace OPTIMIZED SQL by something else, e.g., 
OPTIMIZED PLAN, printing the CBO plan? Not sure whether that's helpful, as the 
CBO plan can already be obtained by EXPLAIN CBO.

> Remove OPTIMIZED SQL entry from EXPLAIN EXTENDED
> ------------------------------------------------
>
>                 Key: HIVE-29257
>                 URL: https://issues.apache.org/jira/browse/HIVE-29257
>             Project: Hive
>          Issue Type: Task
>          Components: CBO
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>
> The {{EXPLAIN EXTENDED}} statement among other things prints a serialization 
> of the CBO plan in the form of SQL (so called OPTIMIZED SQL). This feature 
> was introduced in HIVE-19360 for diagnosability purposes to assist in the 
> understanding of the DAG plan with respect to join ordering.
> However, in most cases when developers want to understand the join order they 
> look into the CBO plan and not in the OPTIMIZED SQL output. Determining the 
> join order from SQL becomes challenging for queries that contain more than a 
> few tables and nested SQL constructs make the situation worse. The CBO 
> RelNode tree is usually easier to navigate.
> In various cases the SQL serialization is an invalid SQL query. This led 
> people to believe that the CBO plan is wrong. As a consequence, they attempt 
> to fix the CBO plan while in fact the plan is perfectly valid. The bug 
> usually lies only in the serialization part but the latter does not have any 
> effect on the execution of the query.
> Users (and sometimes developers) believe that the OPTIMIZED SQL is the query 
> that is actually run by Hive. There have been various cases where people 
> copy-paste the OPTIMIZED SQL entry and try to run it when they have issues 
> with the original query and when that fails as well it adds up to the 
> confusion.
> In addition, since the serializer is buggy it sometimes raises exceptions and 
> the OPTIMIZED SQL does not appear in the output. People will raise bug fixes 
> and invest effort in a feature that is not very useful.
> Currently, there are 282 .q.out files that use {{EXPLAIN EXTENDED}} and thus 
> contain an entry of OPTIMIZED SQL. Even minor changes in the CBO plan do 
> affect the OPTIMIZED SQL necessitating updates and reviews on multiple files. 
> Typically calcite upgrades trigger many changes in OPTIMIZED SQL.
> {noformat}
> $ find . -name "*.q.out" -exec grep -a "OPTIMIZED SQL" {} \; | wc -l
> 820
> $ find . -name "*.q.out" -exec grep -l "OPTIMIZED SQL" {} \; | wc -l
> 282
> {noformat}
> I propose to remove the {{OPTIMIZED SQL}} feature and related code to reduce:
> * maintenance overhead
> * plan/file (e.g., .q.out) size
> * users/dev confusion about its meaning/usage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to