[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 15: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2153/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 18 Feb 2019 14:48:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12168 to look at the new patch set (#15). Change subject: IMPALA-6503: Support reading complex types from ORC format files .. IMPALA-6503: Support reading complex types from ORC format files We've supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in io buffer read from DiskIoMgr). The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch). Then we transform values in orc::ColumnVectorBatch into impala::Tuples in hdfs-orc-scanner. We don't need to do anything about decoding/decompression since they are handled by the ORC lib. Fortunately, the ORC lib already supports complex types, we can still leverage it to support complex types. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need in the form required by the ORC lib (Get list of ORC type ids from tuple descriptors) 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala's representation (Slots/Tuples/RowBatches) To format the materialization, we implement several ORC column readers in hdfs-orc-scanner. Each kind of reader treats a column type and transforms outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h A be/src/exec/orc-column-readers.cc A be/src/exec/orc-column-readers.h A be/src/exec/orc-metadata-utils.cc A be/src/exec/orc-metadata-utils.h M be/src/runtime/descriptors.cc M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java A testdata/ComplexTypesTbl/README A testdata/ComplexTypesTbl/nonnullable.orc A testdata/ComplexTypesTbl/nullable.orc M testdata/bin/create-load-data.sh M testdata/bin/load_nested.py M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test M testdata/workloads/tpch_nested/tpch_nested_core.csv M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv M tests/query_test/test_nested_types.py M tests/query_test/test_tpch_nested_queries.py 29 files changed, 1,864 insertions(+), 461 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/15 -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 14: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 14 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 13 Feb 2019 05:06:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 14: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3768/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 14 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 13 Feb 2019 01:00:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2091/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 13 Feb 2019 00:33:11 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 13: (15 comments) Thanks for your new feedbacks! Let's keep reviewing it so things could be simpler:) http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363 PS9, Line 363: // We only deal with collection columns (ARRAY/MAP) and primitive columns here. > Yep that's the rule - use braces if the whole thing doesn't fit on a single Sure http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@198 PS12, Line 198: niqu > nit:std:: not needed since we import it into the namespace with common/name Done http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@316 PS12, Line 316: VLOG(3) << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple " > Can we remove these VLOG_FILE statements here and below? It could get quite Yes, they're mainly used for debugging. VLOG(3) is equal to VLOG_ROW but this function will only be called once for an orc-scanner, not for each row or RowBatch. Isn't it a VLOG(2) level info? However, I still change it to VLOG(3) to respect your opinions :) http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@380 PS12, Line 380: const list& selected_type_ids) { > Can this argument be const? Yes. Should be const. http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@593 PS12, Line 593: > How do we know that we're finished processing the previous batch? What happ If we come here, the previous batch should have been drained. The relative codes are in GetNextInternal: // Transfer remaining values in current orc batches tracked by readers. if (!orc_root_reader_->EndOfBatch()) { assemble_rows_timer_.Start(); RETURN_IF_ERROR(TransferTuples(orc_root_reader_, row_batch)); assemble_rows_timer_.Stop(); if (row_batch->AtCapacity()) return Status::OK(); DCHECK(orc_root_reader_->EndOfBatch()); } If we exited the while() loop because row_batch() was at capacity, the remaining rows will be processed in the above codes. Since the batch size equals to the capacity of RowBatch, the remaining rows should no more than the capacity of RowBatch. Thus calling TransferTuples once can drain the original batch. http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638 PS12, Line 638: > typo: materialization Done http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638 PS12, Line 638: alizeTuple()) { > grammar: "that are not materializing tuples" Done http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@688 PS12, Line 688: > Pass in complex_col_reader by pointer since it's mutable. It can be const. We don't change any states of it in this function. I just add 'const' for it and add const for functions we called here. http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@133 PS12, Line 133: } > VLOG_FILE is also probably too verbose here - VLOG(3) would be more appropr Done http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@219 PS12, Line 219: materiali > same comment about logging Done http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc File be/src/runtime/descriptors.cc: http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc@65 PS12, Line 65: const int SchemaPathConstants::ARRAY_ITEM; > I think it's better to define the constant value in the header and just ins Done http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh@581 PS12, Line 581: run-step "Loading nested parquet data" load-nested.log \ : ${IMPALA_HOME}/testdata/bin/load_nested.py \ : -t tpch_nested_parquet -f parquet/none ${LOAD_NESTED_ARGS:-} : run-step "Loading nested orc data" load-nested.log \ : ${IMPALA_HOME}/testdata/bin/load_nested.py \ > I think it would make sense to specify the arguments for the call to run th Sure. It's clearer. http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py File testdata/bin/load_nested.py: http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@286 PS12, Line 286: hive.execute(""" :
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12168 to look at the new patch set (#13). Change subject: IMPALA-6503: Support reading complex types from ORC format files .. IMPALA-6503: Support reading complex types from ORC format files We've supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch), so we don't need to do anything about decoding/decompression in hdfs-orc-scanner. Since it already supports complex types, we'll still depend on it. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need to the ORC lib 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala's representation To format the materialization, we implement several ORC column readers in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC column readers differ from the Parquet readers (used in hdfs-parquet-scanner) which materializes Parquet column binaries into tuple values directly. They just need to transform outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h A be/src/exec/orc-column-readers.cc A be/src/exec/orc-column-readers.h A be/src/exec/orc-metadata-utils.cc A be/src/exec/orc-metadata-utils.h M be/src/runtime/descriptors.cc M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java A testdata/ComplexTypesTbl/README A testdata/ComplexTypesTbl/nonnullable.orc A testdata/ComplexTypesTbl/nullable.orc M testdata/bin/create-load-data.sh M testdata/bin/load_nested.py M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test M testdata/workloads/tpch_nested/tpch_nested_core.csv M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv M tests/query_test/test_nested_types.py M tests/query_test/test_tpch_nested_queries.py 29 files changed, 1,864 insertions(+), 461 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/13 -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 12: (12 comments) It's a big change so it's taking a while to get my head around! http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363 PS9, Line 363: // We only deal with collection columns (ARRAY/MAP) and primitive columns here. > Tim will correct me if I'm wrong, but I guess he was also thinking about th Yep that's the rule - use braces if the whole thing doesn't fit on a single line. http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@198 PS12, Line 198: td:: nit:std:: not needed since we import it into the namespace with common/names.h http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@316 PS12, Line 316: VLOG_FILE << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple " Can we remove these VLOG_FILE statements here and below? It could get quite noisy. I think we could either remove them or make them VLOG(3) since they're mainly for debugging I think? http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@380 PS12, Line 380: list& selected_type_ids) { Can this argument be const? http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@593 PS12, Line 593: // We're going to free the previous batch. Clear the reference first. How do we know that we're finished processing the previous batch? What happens if we exited the while() loop because row_batch() was at capacity? Don't we lose some rows? http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638 PS12, Line 638: hat not materializing tuples grammar: "that are not materializing tuples" http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638 PS12, Line 638: materializtion typo: materialization http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@688 PS12, Line 688: OrcComplexColumnReader& complex_col_reade Pass in complex_col_reader by pointer since it's mutable. http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@133 PS12, Line 133: VLOG_FILE << "Created reader for " << PrintNode(orc_type) << ": slot_desc_=" VLOG_FILE is also probably too verbose here - VLOG(3) would be more appropriate. http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@170 PS12, Line 170: TryAllocate This is minor, but TryAllocateUnaligned() is better for strings - it is marginally more efficient and avoids wasting space. http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@219 PS12, Line 219: VLOG_FILE same comment about logging http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc File be/src/runtime/descriptors.cc: http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc@65 PS12, Line 65: const int SchemaPathConstants::ARRAY_ITEM = 0; I think it's better to define the constant value in the header and just instantiate it here, i.e. revert the change in the header and make these like: const int SchemaPathConstants::ARRAY_ITEM; That just means the constant value can be inlined in callers. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 12 Feb 2019 00:08:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 12: (4 comments) A couple comments about the data loading and tests. http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh@581 PS12, Line 581: run-step "Loading nested parquet data" load-nested.log \ : ${IMPALA_HOME}/testdata/bin/load_nested.py ${LOAD_NESTED_ARGS:-} : run-step "Loading nested orc data" load-nested.log \ : ${IMPALA_HOME}/testdata/bin/load_nested.py \ : -t tpch_nested_orc_def -f orc/def ${LOAD_NESTED_ARGS:-} I think it would make sense to specify the arguments for the call to run the load_nested.py for parquet. i.e. run-step "Loading nested parquet data" load-nested.log \ ${IMPALA_HOME}/testdata/bin/load_nested.py \ -t tpch_nested_parquet -f parquet/none ${LOAD_NESTED_ARGS:-} http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py File testdata/bin/load_nested.py: http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@286 PS12, Line 286: # For ORC format, we create the 'part' table by Hive : if file_format == "orc": : hive.execute(""" : CREATE TABLE part : STORED AS ORC : AS SELECT * FROM {source_db}.part""".format(**sql_params)) I would prefer this to be right next to the create statement for parquet, even if you need to open a separate hive cursor. http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@340 PS12, Line 340: file_format, compression_value = args.table_format.split("/") : if file_format == "parquet" and compression_value == "none": : compression_key = "parquet.compression" : compression_value = "SNAPPY" : elif file_format == "orc" and compression_value == "def": : compression_key = "orc.compress" : compression_value = "ZLIB" Please throw an error if the file_format/compression_value is not one of the supported combinations. i.e. add an else and raise an error. http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py File tests/query_test/test_nested_types.py: http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py@611 PS12, Line 611: if file_format == 'parquet': : self.__create_parquet_tables(unique_database, True) : elif file_format == 'orc': : # Creating ORC tables from ORC files (IMPALA-8046) has not been supported. : # We create the Parquet tables first and then transform them into ORC tables. : self.__create_parquet_tables(unique_database, False) : self.__create_orc_tables(unique_database) It might be cleaner to have __create_orc_table() call __create_parquet_tables(), so that this code becomes: if file_format == 'parquet': self.__create_parquet_tables(unique_database) elif file_format == 'orc': self.__create_orc_tables(unique_database) The comment here about "Creating ORC tables from..." becomes part of __create_orc_tables(), and the as_target parameter is only needed from __create_orc_tables(). -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 06 Feb 2019 18:44:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 12: Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3726/ -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 06 Feb 2019 04:35:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 12: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3726/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 06 Feb 2019 00:37:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 12: The jenkins job still checkout the cdh5-trunk branch of Impala-lzo, which cause the build failures. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 04 Feb 2019 01:31:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 12: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3702/ -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 04 Feb 2019 00:55:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 12: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3702/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sun, 03 Feb 2019 23:50:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 11: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/1974/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 11 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sun, 03 Feb 2019 13:50:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12168 to look at the new patch set (#11). Change subject: IMPALA-6503: Support reading complex types from ORC format files .. IMPALA-6503: Support reading complex types from ORC format files We've supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch), so we don't need to do anything about decoding/decompression in hdfs-orc-scanner. Since it already supports complex types, we'll still depend on it. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need to the ORC lib 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala's representation To format the materialization, we implement several ORC column readers in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC column readers differ from the Parquet readers (used in hdfs-parquet-scanner) which materializes Parquet column binaries into tuple values directly. They just need to transform outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h A be/src/exec/orc-column-readers.cc A be/src/exec/orc-column-readers.h A be/src/exec/orc-metadata-utils.cc A be/src/exec/orc-metadata-utils.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java A testdata/ComplexTypesTbl/README A testdata/ComplexTypesTbl/nonnullable.orc A testdata/ComplexTypesTbl/nullable.orc M testdata/bin/create-load-data.sh M testdata/bin/load_nested.py M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test M testdata/workloads/tpch_nested/tpch_nested_core.csv M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv M tests/query_test/test_nested_types.py M tests/query_test/test_tpch_nested_queries.py 30 files changed, 1,828 insertions(+), 462 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/11 -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 11 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 10: (1 comment) > Patch Set 9: > > (2 comments) http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh@34 PS10, Line 34: set -x > I guess you left it here accidentally This is introduced by a recent commit: 236b9194d345bbfdfb177dd7ef4908170eaff259 I just rebase and it appears :) -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 28 Jan 2019 22:54:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 9: (2 comments) http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363 PS9, Line 363: for (uint64_t id : selected_type_ids) > I'm quite confused about the definition of "multi-line". The if statement j Tim will correct me if I'm wrong, but I guess he was also thinking about the braces of the for-loop. I should've pasted the url of Impala's style guide since we diverged a little from the Google style guide: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65868536 My experience is that we don't like any control statements without braces, unless the whole statement fits into a single line. http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh@34 PS10, Line 34: set -x I guess you left it here accidentally -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 28 Jan 2019 16:46:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1904/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 28 Jan 2019 06:03:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12168 to look at the new patch set (#10). Change subject: IMPALA-6503: Support reading complex types from ORC format files .. IMPALA-6503: Support reading complex types from ORC format files We've supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch), so we don't need to do anything about decoding/decompression in hdfs-orc-scanner. Since it already supports complex types, we'll still depend on it. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need to the ORC lib 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala's representation To format the materialization, we implement several ORC column readers in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC column readers differ from the Parquet readers (used in hdfs-parquet-scanner) which materializes Parquet column binaries into tuple values directly. They just need to transform outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h A be/src/exec/orc-column-readers.cc A be/src/exec/orc-column-readers.h A be/src/exec/orc-metadata-utils.cc A be/src/exec/orc-metadata-utils.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java A testdata/ComplexTypesTbl/README A testdata/ComplexTypesTbl/nonnullable.orc A testdata/ComplexTypesTbl/nullable.orc M testdata/bin/create-load-data.sh M testdata/bin/load_nested.py M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test M testdata/workloads/tpch_nested/tpch_nested_core.csv M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv M tests/query_test/test_nested_types.py M tests/query_test/test_tpch_nested_queries.py 30 files changed, 1,789 insertions(+), 454 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/10 -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 9: (22 comments) Thanks for your feedback! I've added more comments and refactor some codes as required. http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG@25 PS8, Line 25: Don’t like > nit: 'They are not like', or 'They differ from' Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h@175 PS9, Line 175: std::list selected_type_ids_; > Mention that RowReaderOptions.includeTypes() expects a list (otherwise it's Yeah, I have no ideas about this too. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h@160 PS8, Line 160: orc::RowReaderOptions row_reader_options; > nit: put '_' at the end of the variable name. Oops! Missed this in the previous patch. Done. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@181 PS8, Line 181: base > nit: based Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@329 PS8, Line 329: *template_tuple = : Tuple::Create(tuple_desc.byte_size(), template_tuple_pool_.get()); > nit: put curly braces around it Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@364 PS8, Line 364: if (id >= node.getColumnId() && id <= node.getMaximumColumnId()) return true; > nit: put curly braces around it Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@681 PS8, Line 681: RETURN_IF_ERROR(AssembleCollection(*child_reader, child_batch_offset + i, : coll_value_builder)); > nit: add curly braces around it Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@301 PS9, Line 301: VLOG_QUERY << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple " > This logging will be too verbose for most purposes - I'd suggest making it This is quite helpful in debug. I think VLOG_FILE is more suiteable since this function will only be called once for each file. http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363 PS9, Line 363: for (uint64_t id : selected_type_ids) > nit: braces for multi-line if I'm quite confused about the definition of "multi-line". The if statement just occupies one line. Isn't it a single-line statement? Zoltan also comments that the for loop needs a curly brace. Aren't braces optional for single-line statement loops? http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@395 PS9, Line 395: selected_type_ids_.size() > It doesn't really matter here, but with gcc4.9.2 list::size() is actually O Sure http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@680 PS9, Line 680: for (int i = 0; i < total_tuples; ++i) > Use braces for multi-line if Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@19 PS9, Line 19: #ifndef IMPALA_EXEC_ORC_COLUMN_READERS_H > We started using "#pragma once" instead of the traditional include guards i Sure. Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@61 PS9, Line 61: virtual void UpdateInputBatch(orc::ColumnVectorBatch* orc_batch) = 0; > Maybe mention the relationship with ReadValue() - do you need to call it be Sure. Added class comments. Feel free to polish them since they're not written by a native speaker :) http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@101 PS8, Line 101: batch_ = dynamic_cast(orc_batch); > We avoid doing dynamic casts in release mode. Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@126 PS8, Line 126: int64_t val = batch_->data.data()[row_idx]; > Maybe you could add a DCHECK that batch_ is not null. Sure. Add DCHECK_NOTNULL at the first use of batch_. Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc:
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 9: (10 comments) I looked through the headers and test files initially. I think this makes sense at a high level. I haven't looked at the .cc files in parallel - I think Zoltan will probably get to that before me. http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h@175 PS9, Line 175: std::list selected_type_ids_; Mention that RowReaderOptions.includeTypes() expects a list (otherwise it's confusing why this is not a vector). Actually it's still confusing why it's not a vector but at least we can blame the ORC library for that decision :) http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@301 PS9, Line 301: VLOG_QUERY << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple " This logging will be too verbose for most purposes - I'd suggest making it VLOG(3) or removing it. http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363 PS9, Line 363: for (uint64_t id : selected_type_ids) nit: braces for multi-line if http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@395 PS9, Line 395: selected_type_ids_.size() It doesn't really matter here, but with gcc4.9.2 list::size() is actually O(n). We've been burned by that before. http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@680 PS9, Line 680: for (int i = 0; i < total_tuples; ++i) Use braces for multi-line if http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@19 PS9, Line 19: #ifndef IMPALA_EXEC_ORC_COLUMN_READERS_H We started using "#pragma once" instead of the traditional include guards in many places - it would be nice to switch. http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@61 PS9, Line 61: virtual void UpdateInputBatch(orc::ColumnVectorBatch* orc_batch) = 0; Maybe mention the relationship with ReadValue() - do you need to call it before calling ReadValue()? Maybe it would help to document how callers should use this in a class comment for OrcColumnReader? http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@330 PS9, Line 330: for (OrcColumnReader* child : children_) Need braces around multi-line for loop - here and elsewhere http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@331 PS9, Line 331: RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool)); This probably performs similarly to the previous code with the switch, but ultimately to make this performance we'd want to do batched calls that read multiple values in one try. We changed the Parquet reader to do things that way a while back and it makes it faster automatically and unlocks more optimisation opportunities. Nothing to fix here, just sharing http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py File tests/query_test/test_nested_types.py: http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py@128 PS9, Line 128: # TODO(IMPALA-6505): support predicates push down on ORC stripe statistics Maybe move this to a string argument to skip() -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 26 Jan 2019 01:19:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 8: (12 comments) Started to look at it. Currently mostly have comments about style. I plan to look at it deeper. http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG@25 PS8, Line 25: Don’t like nit: 'They are not like', or 'They differ from' http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h@160 PS8, Line 160: orc::RowReaderOptions row_reader_options; nit: put '_' at the end of the variable name. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@181 PS8, Line 181: base nit: based http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@329 PS8, Line 329: *template_tuple = : Tuple::Create(tuple_desc.byte_size(), template_tuple_pool_.get()); nit: put curly braces around it http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@364 PS8, Line 364: if (id >= node.getColumnId() && id <= node.getMaximumColumnId()) return true; nit: put curly braces around it http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@681 PS8, Line 681: RETURN_IF_ERROR(AssembleCollection(*child_reader, child_batch_offset + i, : coll_value_builder)); nit: add curly braces around it http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@101 PS8, Line 101: batch_ = dynamic_cast(orc_batch); We avoid doing dynamic casts in release mode. You can do a dynamic_cast inside a DCHECK, then use static_cast here. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@126 PS8, Line 126: int64_t val = batch_->data.data()[row_idx]; Maybe you could add a DCHECK that batch_ is not null. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@190 PS8, Line 190: *slot = TimestampValue::FromUnixTimeNanos(secs, nanos, Do we know that orc timestamps are timezone-aware or not? http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@326 PS8, Line 326: children_[c]->UpdateInputBatch(batch_->fields[children_fields_[c]]); nit: add curly brackets around it since it is a multi-line for stmt. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@331 PS8, Line 331: RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool)); nit: add curly brackets around it http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@395 PS8, Line 395: } else CreateChildForSlot(node, slot_desc); nit: it's against the Google style guide to have braces on only one branch of an if-stmt: https://google.github.io/styleguide/cppguide.html#Conditionals -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 8 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 25 Jan 2019 17:25:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1854/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 23 Jan 2019 16:15:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Hello Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12168 to look at the new patch set (#9). Change subject: IMPALA-6503: Support reading complex types from ORC format files .. IMPALA-6503: Support reading complex types from ORC format files We’ve supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch), so we don’t need to do anything about decoding/decompression in hdfs-orc-scanner. Since it already supports complex types, we’ll still depend on it. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need to the ORC lib 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala’s representation To format the materialization, we implement several ORC column readers used in hdfs-orc-scanner. Each kind of reader treats a column type. Don’t like the Parquet readers (used in hdfs-parquet-scanner) which materializes Parquet column binaries into tuple values directly, the ORC readers (in hdfs-orc-scanner) just need to transform outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h A be/src/exec/orc-column-readers.cc A be/src/exec/orc-column-readers.h A be/src/exec/orc-metadata-utils.cc A be/src/exec/orc-metadata-utils.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java A testdata/ComplexTypesTbl/README A testdata/ComplexTypesTbl/nonnullable.orc A testdata/ComplexTypesTbl/nullable.orc M testdata/bin/create-load-data.sh M testdata/bin/load_nested.py M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test M testdata/workloads/tpch_nested/tpch_nested_core.csv M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv M tests/query_test/test_nested_types.py M tests/query_test/test_tpch_nested_queries.py 30 files changed, 1,698 insertions(+), 453 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/9 -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1813/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 8 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 18 Jan 2019 14:41:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Hello Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12168 to look at the new patch set (#8). Change subject: IMPALA-6503: Support reading complex types from ORC format files .. IMPALA-6503: Support reading complex types from ORC format files We’ve supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch), so we don’t need to do anything about decoding/decompression in hdfs-orc-scanner. Since it already supports complex types, we’ll still depend on it. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need to the ORC lib 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala’s representation To format the materialization, we implement several ORC column readers used in hdfs-orc-scanner. Each kind of reader treats a column type. Don’t like the Parquet readers (used in hdfs-parquet-scanner) which materializes Parquet column binaries into tuple values directly, the ORC readers (in hdfs-orc-scanner) just need to transform outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h A be/src/exec/orc-column-readers.cc A be/src/exec/orc-column-readers.h A be/src/exec/orc-metadata-utils.cc A be/src/exec/orc-metadata-utils.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java A testdata/ComplexTypesTbl/README A testdata/ComplexTypesTbl/nonnullable.orc A testdata/ComplexTypesTbl/nullable.orc M testdata/bin/create-load-data.sh M testdata/bin/load_nested.py M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test M testdata/workloads/tpch_nested/tpch_nested_core.csv M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv M tests/query_test/test_nested_types.py M tests/query_test/test_tpch_nested_queries.py 29 files changed, 1,695 insertions(+), 449 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/8 -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 8 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1728/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 7 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 08 Jan 2019 00:43:54 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Hello Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12168 to look at the new patch set (#7). Change subject: IMPALA-6503: Support reading complex types from ORC format files .. IMPALA-6503: Support reading complex types from ORC format files We’ve supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch), so we don’t need to do anything about decoding/decompression in hdfs-orc-scanner. Since it already supports complex types, we’ll still depend on it. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need to the ORC lib 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala’s representation To format the materialization, we implement several ORC column readers used in hdfs-orc-scanner. Each kind of reader treats a column type. Don’t like the Parquet readers (used in hdfs-parquet-scanner) which materializes Parquet column binaries into tuple values directly, the ORC readers (in hdfs-orc-scanner) just need to transform outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h A be/src/exec/orc-column-readers.cc A be/src/exec/orc-column-readers.h A be/src/exec/orc-metadata-utils.cc A be/src/exec/orc-metadata-utils.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java A testdata/ComplexTypesTbl/README A testdata/ComplexTypesTbl/nonnullable.orc A testdata/ComplexTypesTbl/nullable.orc M testdata/bin/create-load-data.sh M testdata/bin/load_nested.py M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test M testdata/workloads/tpch_nested/tpch_nested_core.csv M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv M tests/query_test/test_nested_types.py M tests/query_test/test_tpch_nested_queries.py 29 files changed, 1,685 insertions(+), 444 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/7 -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 7 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 5: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/1723/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 07 Jan 2019 15:28:52 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12168 Change subject: IMPALA-6503: Support reading complex types from ORC format files .. IMPALA-6503: Support reading complex types from ORC format files We’ve supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch), so we don’t need to do anything about decoding/decompression in hdfs-orc-scanner. Since it already supports complex types, we’ll still depend on it. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need to the ORC lib 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala’s representation To format the materialization, we implement several ORC column readers used in hdfs-orc-scanner. Each kind of reader treats a column type. Don’t like the Parquet readers (used in hdfs-parquet-scanner) which materializes Parquet column binaries into tuple values directly, the ORC readers (in hdfs-orc-scanner) just need to transform outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h A be/src/exec/orc-column-readers.cc A be/src/exec/orc-column-readers.h A be/src/exec/orc-metadata-utils.cc A be/src/exec/orc-metadata-utils.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java A testdata/ComplexTypesTbl/README A testdata/ComplexTypesTbl/nonnullable.orc A testdata/ComplexTypesTbl/nullable.orc M testdata/bin/create-load-data.sh M testdata/bin/load_nested.py M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test M testdata/workloads/tpch_nested/tpch_nested_core.csv M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv M tests/query_test/test_nested_types.py M tests/query_test/test_tpch_nested_queries.py 29 files changed, 1,656 insertions(+), 442 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/5 -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong Huang
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/12168/5/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/5/be/src/exec/hdfs-orc-scanner.cc@291 PS5, Line 291: RETURN_IF_ERROR(schema_resolver_->ResolveColumn(tuple_desc.tuple_path(), , _field, line too long (93 > 90) -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 07 Jan 2019 14:55:52 + Gerrit-HasComments: Yes