[ https://issues.apache.org/jira/browse/HIVE-24606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
okumin reassigned HIVE-24606: ----------------------------- > Multi-stage materialized CTEs can lost intermediate data > -------------------------------------------------------- > > Key: HIVE-24606 > URL: https://issues.apache.org/jira/browse/HIVE-24606 > Project: Hive > Issue Type: Bug > Components: Query Planning > Affects Versions: 3.1.2, 2.3.7, 4.0.0 > Reporter: okumin > Assignee: okumin > Priority: Major > > With complex multi-stage CTEs, Hive can start a latter stage before its > previous stage finishes. > That's because `SemanticAnalyzer#toRealRootTasks` can fail to resolve > dependency between multistage materialized CTEs when a non-materialized CTE > cuts in. > > [https://github.com/apache/hive/blob/425e1ff7c054f87c4db87e77d004282d529599ae/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L1414] > > For example, when submitting this query, > {code:sql} > SET hive.optimize.cte.materialize.threshold=2; > SET hive.optimize.cte.materialize.full.aggregate.only=false; > WITH x AS ( SELECT 'x' AS id ), -- not materialized > a1 AS ( SELECT 'a1' AS id ), -- materialized by a2 and the root > a2 AS ( SELECT 'a2 <- ' || id AS id FROM a1) -- materialized by the root > SELECT * FROM a1 > UNION ALL > SELECT * FROM x > UNION ALL > SELECT * FROM a2 > UNION ALL > SELECT * FROM a2; > {code} > `toRealRootTask` will traverse the CTEs in order of `a1`, `x`, and `a2`. It > means the dependency between `a1` and `a2` will be ignored and `a2` can start > without waiting for `a1`. As a result, the above query returns the following > result. > {code:java} > +-----+ > | id | > +-----+ > | a1 | > | x | > +-----+ > {code} > For your information, I ran this test with revision = > 425e1ff7c054f87c4db87e77d004282d529599ae. -- This message was sent by Atlassian Jira (v8.3.4#803005)