Hello Quanlong Huang, Qifan Chen, Daniel Becker, Csaba Ringhofer, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17638 to look at the new patch set (#13). Change subject: IMPALA-9495: Support struct in select list for ORC tables ...................................................................... IMPALA-9495: Support struct in select list for ORC tables This patch implements the functionality to allow structs in the select list of inline views, topmost blocks. When displaying the value of a struct it is formatted into a JSON value and returned as a string. An example of such a value: SELECT struct_col FROM some_table; '{"int_struct_member":12,"string_struct_member":"string value"}' Another example where we query a nested struct: SELECT outer_struct_col FROM some_table; '{"inner_struct":{"string_member":"string value","int_member":12}}' Note, the conversion from struct to JSON happens on the server side before sending out the value in HS2 to the client. However, HS2 is capable of handling struct values as well so in a later change we might want to add a functionality to send the struct in thrift to the client so that the client can use the struct directly. -- Internal representation of a struct: When scanning a struct the rowbatch will hold the values of the struct's children as if they were queried one by one directly in the select list. E.g. Taking the following table: CREATE TABLE tbl (id int, s struct<a:int,b:string>) STORED AS ORC And running the following query: SELECT id, s FROM tbl; After scanning a row in a row batch will hold the following values: (note the biggest size comes first) 1: The pointer for the string in s.b 2: The length for the string in s.b 3: The int value for s.a 4: The int value of id 5: A single null byte for all the slots: id, s, s.a, s.b The size of a struct has an effect on the order of the memory layout of a row batch. The struct size is calculated by summing the size of its fields and then the struct gets a place in the row batch to precede all smaller slots by size. Note, all the fields of a struct are consecutive to each other in the row batch. Inside a struct the order of the fields is also based on their size as it does in a regular case for primitives. When evaluating a struct as a SlotRef a newly introduced StructVal will be used to refer to the actual values of a struct in the row batch. This StructVal holds a vector of pointers where each pointer represents a member of the struct. Following the above example the StructVal would keep two pointers, one to point to an IntVal and one to point to a StringVal. -- Changes related to tuple and slot descriptors: When providing a struct in the select list there is going to be a SlotDescriptor for the struct slot in the topmost TupleDescriptor. Additionally, another TupleDesriptor is created to hold SlotDescriptors for each of the struct's children. The struct SlotDescriptor points to the newly introduced TupleDescriptor using 'itemTupleId'. The offsets for the children of the struct is calculated from the beginning of the topmost TupleDescriptor and not from the TupleDescriptor that directly holds the struct's children. The null indicator bytes as well are stored on the level of the topmost TupleDescriptor. -- Changes related to scalar expressions: A struct in the select list is translated into an expression tree where the top of this tree is a SlotRef for the struct itself and its children in the tree are SlotRefs for the members of the struct. When evaluating a struct SlotRef after the null checks the evaluation is delegated to the children SlotRefs. -- Restrictions: - Codegen support is not included in this patch. - Only ORC file format is supported by this patch. - Only HS2 client supports returning structs. Beeswax support is not implemented as it is going to be deprecated anyway. Currently we receive an error when trying to query a struct through Beeswax. -- Tests added: - The ORC and Parquet functional database is extended with 2 new tables: A table with one level structs, holding different kind of primitive types as members and another table with 2 and 3 level nested structs. - struct-in-select-list.test and nested-struct-in-select-list.test uses these new tables to query structs directly or through an inline view. Change-Id: I0fbe56bdcd372b72e99c0195d87a818e7fa4bc3a --- M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/parquet-collection-column-reader.cc M be/src/exprs/anyval-util.cc M be/src/exprs/expr-value.h M be/src/exprs/scalar-expr-evaluator.cc M be/src/exprs/scalar-expr-evaluator.h M be/src/exprs/scalar-expr.cc M be/src/exprs/scalar-expr.h M be/src/exprs/scalar-expr.inline.h M be/src/exprs/slot-ref.cc M be/src/exprs/slot-ref.h M be/src/runtime/buffered-tuple-stream-test.cc M be/src/runtime/buffered-tuple-stream.cc M be/src/runtime/buffered-tuple-stream.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/raw-value.cc M be/src/runtime/raw-value.h M be/src/runtime/row-batch-serialize-test.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/impala-beeswax-server.cc M be/src/service/query-result-set.cc M be/src/udf/udf-internal.h M be/src/udf/udf.cc M be/src/udf/udf.h M be/src/util/debug-util.cc M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/analysis/SortInfo.java M fe/src/main/java/org/apache/impala/analysis/Subquery.java M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/catalog/StructType.java M fe/src/main/java/org/apache/impala/common/TreeNode.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeUpsertStmtTest.java A testdata/ComplexTypesTbl/structs.orc A testdata/ComplexTypesTbl/structs.parq A testdata/ComplexTypesTbl/structs_nested.orc A testdata/ComplexTypesTbl/structs_nested.parq M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-query/queries/QueryTest/compute-stats-with-structs.test A testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test A testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_struct_in_select_list.test A testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test M tests/authorization/test_ranger.py M tests/common/test_dimensions.py M tests/query_test/test_nested_types.py 63 files changed, 2,080 insertions(+), 351 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/38/17638/13 -- To view, visit http://gerrit.cloudera.org:8080/17638 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0fbe56bdcd372b72e99c0195d87a818e7fa4bc3a Gerrit-Change-Number: 17638 Gerrit-PatchSet: 13 Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>