[ https://issues.apache.org/jira/browse/HIVE-28408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875517#comment-17875517 ]
Stamatis Zampetakis commented on HIVE-28408: -------------------------------------------- To better understand the purpose of this ticket let's use a simpler and more readable example. {code:sql} CREATE TABLE book ( bid int, title string, author struct< aid:int, name:string, addresses: array<struct< street:string, num:int, gcs:struct<latitude:double, longitude: double>>> > ) STORED AS PARQUET; INSERT INTO TABLE book VALUES ( 1, 'Les Miserables', named_struct('aid', 100, 'name', 'Victor-Hugo', 'addresses', array( named_struct('street', 'Avenue Champs-Elysees', 'num', 42, 'gcs', named_struct('latitude', 48.8701431D, 'longitude', 2.3051376D)), named_struct('street', 'Rue de Rivoli', 'num', 8, 'gcs', named_struct('latitude', 48.8554165D, 'longitude', 2.3582763D)) ))); SELECT author.addresses.gcs.latitude FROM book; {code} The query returns the following result. {noformat} [48.8701431,48.8554165] {noformat} Observe that the "addresses" is a complex/struct ARRAY type. The addresses.gcs.latitude expression aims to drill in/navigate/extract specific fields from the ARRAY while keeping the structure intact. This operation is similar to XPath and JSON navigational patterns and is not part of the SQL standard although supported by some DBMS. Currently, when a query contains such expressions CBO fails and we fallback to legacy optimizer. {noformat} org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Unexpected rexnode : org.apache.calcite.rex.RexFieldAccess at org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createNestedColumnRefExpr(RexNodeExprFactory.java:629) at org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createNestedColumnRefExpr(RexNodeExprFactory.java:97) at org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:903) at org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1481) at org.apache.hadoop.hive.ql.lib.CostLessRuleDispatcher.dispatch(CostLessRuleDispatcher.java:66) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:101) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120) at org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:231) at org.apache.hadoop.hive.ql.parse.type.RexNodeTypeCheck.genExprNode(RexNodeTypeCheck.java:40) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genAllRexNode(CalcitePlanner.java:5376) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genRexNode(CalcitePlanner.java:5333) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.internalGenSelectLogicalPlan(CalcitePlanner.java:4660) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genSelectLogicalPlan(CalcitePlanner.java:4418) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genLogicalPlan(CalcitePlanner.java:5087) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1629) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1572) at org.apache.calcite.tools.Frameworks.lambda$withPlanner$0(Frameworks.java:131) at org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:914) at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:180) at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:126) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1324) {noformat} Since the physical layer of Hive is able to handle field access over ARRAY types we have to find a CBO (RexNode) expression that allows to express field access over ARRAY expressions. This can be achieved by introducing a new Hive-specific operator (e.g. COMPONENT_ACCESS) that takes an expression of an ARRAY type and alters its type so that we can perform field access as if it was a regular STRUCT. The CBO plan for the query above would look like the following: {noformat} CBO PLAN: HiveProject(latitude=[COMPONENT_ACCESS($2.addresses).gcs.latitude]) HiveTableScan(table=[[default, book]], table:alias=[book]) {noformat} The new operator acts mainly as syntactic sugar so when we translate it back to AST we can treat it as NOOP. A proof of concept using the COMPONENT_ACCESS operator is attached in [^HIVE-28408.patch]. The proposal may need a bit of fine-tuning (documentation, naming, etc.) but as it is it allows all queries in nested_column_pruning.q file to pass through CBO. > Support ARRAY field access in CBO > --------------------------------- > > Key: HIVE-28408 > URL: https://issues.apache.org/jira/browse/HIVE-28408 > Project: Hive > Issue Type: Sub-task > Reporter: Ramesh Kumar Thangarajan > Assignee: Ramesh Kumar Thangarajan > Priority: Major > Attachments: CBO Fallback - Nested column pruning item.docx, > HIVE-28408.patch > > > fname=nested_column_pruning.q > {code:sql} > EXPLAIN > SELECT count(s1.f6), s5.f16.f18.f19 > FROM nested_tbl_1_n1 > GROUP BY s5.f16.f18.f19 > {code} > {noformat} > org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: > Unexpected rexnode : org.apache.calcite.rex.RexFieldAccess{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)