[ https://issues.apache.org/jira/browse/IMPALA-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525776#comment-17525776 ]
ASF subversion and git services commented on IMPALA-10839: ---------------------------------------------------------- Commit c802be42b66cf69a362f7d2c89175932bbe226cf in impala's branch refs/heads/master from Daniel Becker [ https://gitbox.apache.org/repos/asf?p=impala.git;h=c802be42b ] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC) When querying a non-toplevel nested struct from an ORC file, the NULL values are displayed at an incorrect level. E.g.: select id, outer_struct.inner_struct3 from functional_orc_def.complextypes_nested_structs where id >= 4; +----+----------------------------+ | id | outer_struct.inner_struct3 | +----+----------------------------+ | 4 | {"s":{"i":null,"s":null}} | | 5 | {"s":null} | +----+----------------------------+ However, in the first row it is expected that 's' should be null and not its members; in the second row the result should be 'NULL', i.e. 'outer_struct.inner_struct3' is null. For reference see what is returned when querying 'outer_struct' instead of 'outer_struct.inner_struct3': +----+-------------------------------------------------------------------------------------------------------------------------------+ | 4 | {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} | | 5 | {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null} | +----+-------------------------------------------------------------------------------------------------------------------------------+ The problem comes from the incorrect handling of the different depths of the following trees: - the ORC type hierarchy (schema) - the tuple descriptor / slot descriptor hierarchy as the ORC type hierarchy contains a node for every level in the schema but the tuple/slot descriptor hierarchy omits the levels of structs that are not in the select list (but an ancestor of theirs is), as these structs are not materialised. In the case of the example query, the two hierarchies are the following: ORC: root --> outer_struct -> inner_struct3 -> s --> i | \-> s \-> id Tuple/slot descriptors: main_tuple --> inner_struct3 -> s --> i | \-> s \-> id We create 'OrcColumnReader's for each node in the ORC type tree. Each OrcColumnReader is assigned an ORC type node and a slot descriptor. The incorrect behaviour comes from the incorrect pairing of ORC type nodes with slot descriptors. The old behaviour is described below: Starting from the root, going along a path in both trees (for example the path leading to outer_struct.inner_struct3.s.i), for each step we consume a level in both trees until no more nodes remain in the tuple/slot desc tree, and then we pair the last element from that tree with the remaining ORC type node(s). In the example, we get the following pairs: (root, main_tuple) -> (outer_struct, inner_struct3) -> (inner_struct3, s) -> (s, i) -> (i, i) When we run out of structs in the tuple/slot desc tree, we still create OrcStructReaders (because the ORC type is still a struct, but the slot descriptor now refers to an INT), but we mark them incorrectly as non-materialised. Also, the OrcStructReaders for non-materialised structs do not need to check for null-ness as they are not present in the select list, only their descendants, and the ORC batch object stores null information also for the descendants of null values. Let's look at the row with id 4 in the example: Because of the bug, the non-materialising OrcStructReader appears at the level of the (s, i) pair, so the 's' struct is not checked for null-ness, although it is actually null. One level lower, for 'i' (and the inner 's' string field), the ORC batch object tells us that the values are null (because their parent is). Therefore the nulls appear one level lower than they should. The correct behaviour is that ORC type nodes are paired with slot descriptors if either - the ORC type node matches the slot descriptor (they refer to the same node in the schema) or - the slot descriptor is a descendant of the schema node that the ORC type node refers to. This patch fixes the incorrect pairing of ORC types and slot descriptors, so we have the following pairs: (root, main_tuple) -> (outer_struct, main_tuple) -> (inner_struct3, inner_struct3) -> (s, s) -> (i, i) In this case the OrcStructReader for the pair (outer_struct, main_tuple) becomes non-materialising and the one for (s, s) will be materialising, so the 's' struct will also be null-checked, recognising null-ness at the correct level. This commit also fixes some comments in be/src/exec/orc-column-readers.h and be/src/exec/hdfs-orc-scanner.h mentioning the field HdfsOrcScanner::col_id_path_map_, which has been removed by "IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution". Testing: - added tests to testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test that query various levels of the struct 'outer_struct' to check that NULLs are at the correct level. Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040 Reviewed-on: http://gerrit.cloudera.org:8080/18403 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > NULL values are displayed on a wrong level for nested structs (ORC) > ------------------------------------------------------------------- > > Key: IMPALA-10839 > URL: https://issues.apache.org/jira/browse/IMPALA-10839 > Project: IMPALA > Issue Type: Bug > Components: Backend > Reporter: Gabor Kaszab > Assignee: Daniel Becker > Priority: Major > Labels: ORC, complextype, correctness, nested_types, scanner > > When querying a non-toplevel nested struct then the NULL values are displayed > in an incorrect level. E.g.: > {code:java} > select id, outer_struct.inner_struct3 from > functional_orc_def.complextypes_nested_structs where id >= 4; > {code} > {code:java} > +----+----------------------------+ > | id | outer_struct.inner_struct3 | > +----+----------------------------+ > | 4 | {"s":{"i":null,"s":null}} | > | 5 | {"s":null} | > +----+----------------------------+ > {code} > However, here in the first row the expected would be that 's' is null and not > its members and in the second line the result should be 'NULL'. > For reference see what is returned when querying 'outer_struct' instead of > 'outer_struct.inner_struct3': > {code:java} > +----+-------------------------------------------------------------------------------------------------------------------------------+ > | 4 | > {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} > | > | 5 | > {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null} > | > +----+-------------------------------------------------------------------------------------------------------------------------------+ > {code} > Note, this issues is with ORC format. > After some digging I found that these incorrect null values are already > present in the ORC scanner where OrcStructReader reads the rows in > ReadValue() and ReadValueBatch() functions. > As a first step it would be nice to verify that the external ORC reader we > use for reading the actual values from the files gives correct results. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org