[ https://issues.apache.org/jira/browse/BEAM-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989162#comment-16989162 ]
Kirill Kozlov commented on BEAM-8896: ------------------------------------- After some investigating I have a feeling that this problem has to do with the way leaves (Array in SqlToRelConverter class) are registered in Calcite. Before joins are created left and right paths are pared first. For the 1st query above they are as follows: {code:java} Left: LogicalProject with RecordType(VARBINARY id, VARCHAR fA1, VARCHAR fA1_2) LogicalTableScan with RecordType(VARBINARY id, VARCHAR fA1) Right: LogicalTableScan with RecordType(VARCHAR id, VARCHAR fB1){code} As they are processed - they are registered as leaves (added to the Array). When Join node is being created it knows what the `condition expressions` is: {code:java} =(TO_HEX($0), $3) {code} Since TO_HEX is not computed anywhere - it modifies the left input to be as follows: {code:java} LogicalProject with RecordType(VARBINARY id, VARCHAR fA1, VARCHAR fA1_2, VARCHAR $f3) {code} where `VARCHAR $f3` is a result of TO_HEX. Note that the list of leaves is not updated. Finally, when identifier "query.fA1_2" is being converted (via SqlToRelConverter#convertIdentifier) for the top-most node {code:java} top-most node: LogicalProject with RecordType(VARBINARY id, VARCHAR fA1, VARCHAR fA1_2, VARCHAR id0, VARCHAR fB1) LogicalJoin with RecordType(VARBINARY id, VARCHAR fA1, VARCHAR fA1_2, VARCHAR $f3, VARCHAR id0, VARCHAR fB1) LogicalProject with RecordType(VARBINARY id, VARCHAR fA1, VARCHAR fA1_2, VARCHAR $f3) LogicalTableScan with RecordType(VARBINARY id, VARCHAR fA1) LogicalTableScan with RecordType(VARCHAR id, VARCHAR fB1){code} Blackboard perform a lookup (via SqlToRelConverter#lookupExp), in process of which LookupContext is created. In a constructor, LookupContext performs flatten, which recursively traverses tree of node (from above codeblock) and checks the leaves to see if they contain such expression. When it does get to the modified left input of a join it does not get a match on it and continues further down to a TableScan. When it finally flattens the result, TableScan's RecordType knows nothing about a duplicated field `fA1_2`, casing an error above. I think a viable solution would be to modify Join creation to register a resulting join inputs as leaves (when they are modified ). Alternative approach would be to add an additional Project when a Join needs to modify an input. > WITH query AS + SELECT query JOIN other throws invalid type > ----------------------------------------------------------- > > Key: BEAM-8896 > URL: https://issues.apache.org/jira/browse/BEAM-8896 > Project: Beam > Issue Type: Bug > Components: dsl-sql > Affects Versions: 2.16.0 > Reporter: fdiazgon > Assignee: Andrew Pilloud > Priority: Major > > The first one of the three following queries fails, despite queries being > equivalent: > {code:java} > Pipeline p = Pipeline.create(); > Schema schemaA = > Schema.of( > Schema.Field.of("id", Schema.FieldType.BYTES), > Schema.Field.of("fA1", Schema.FieldType.STRING)); > Schema schemaB = > Schema.of( > Schema.Field.of("id", Schema.FieldType.STRING), > Schema.Field.of("fB1", Schema.FieldType.STRING)); > PCollection<Row> inputA = > > p.apply(Create.of(ImmutableList.<Row>of()).withCoder(SchemaCoder.of(schemaA))); > PCollection<Row> inputB = > > p.apply(Create.of(ImmutableList.<Row>of()).withCoder(SchemaCoder.of(schemaB))); > // Fails > String query1 = > "WITH query AS " > + "( " > + " SELECT id, fA1, fA1 AS fA1_2 " > + " FROM tblA" > + ") " > + "SELECT fA1, fB1, fA1_2 " > + "FROM query " > + "JOIN tblB ON (TO_HEX(query.id) = tblB.id)"; > // Ok > String query2 = > "WITH query AS " > + "( " > + " SELECT fA1, fB1, fA1 AS fA1_2 " > + " FROM tblA " > + " JOIN tblB " > + " ON (TO_HEX(tblA.id) = tblB.id) " > + ")" > + "SELECT fA1, fB1, fA1_2 " > + "FROM query "; > // Ok > String query3 = > "WITH query AS " > + "( " > + " SELECT TO_HEX(id) AS id, fA1, fA1 AS fA1_2 " > + " FROM tblA" > + ") " > + "SELECT fA1, fB1, fA1_2 " > + "FROM query " > + "JOIN tblB ON (query.id = tblB.id)"; > Schema transform3 = > PCollectionTuple.of("tblA", inputA) > .and("tblB", inputB) > .apply(SqlTransform.query(query3)) > .getSchema(); > System.out.println(transform3); > Schema transform2 = > PCollectionTuple.of("tblA", inputA) > .and("tblB", inputB) > .apply(SqlTransform.query(query2)) > .getSchema(); > System.out.println(transform2); > Schema transform1 = > PCollectionTuple.of("tblA", inputA) > .and("tblB", inputB) > .apply(SqlTransform.query(query1)) > .getSchema(); > System.out.println(transform1); > {code} > > The error is: > {noformat} > Exception in thread "main" java.lang.AssertionError: Field ordinal 2 is > invalid for type 'RecordType(VARBINARY id, VARCHAR fA1)'Exception in thread > "main" java.lang.AssertionError: Field ordinal 2 is invalid for type > 'RecordType(VARBINARY id, VARCHAR fA1)' at > org.apache.beam.repackaged.sql.org.apache.calcite.rex.RexBuilder.makeFieldAccess(RexBuilder.java:197){noformat} > > If I change `schemaB.id` to `BYTES` (while also avoid using `TO_HEX`), all > queries work fine. -- This message was sent by Atlassian Jira (v8.3.4#803005)