[Impala-ASF-CR] IMPALA-7540. Intern most repetitive strings and network addresses in catalog
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/11158 ) Change subject: IMPALA-7540. Intern most repetitive strings and network addresses in catalog .. Patch Set 3: Code-Review+1 (4 comments) lgtm (pending nits). I'll let Todd take a pass in case he has some comments. http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java File fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java: http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@62 PS2, Line 62:*/ > Documentation of this kind tends to get out of date quickly. Looking at the Fair point. http://gerrit.cloudera.org:8080/#/c/11158/3/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java File fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java: http://gerrit.cloudera.org:8080/#/c/11158/3/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@22 PS3, Line 22: import org.apache.curator.shaded.com.google.common.base.Preconditions; Switch to guava jar. http://gerrit.cloudera.org:8080/#/c/11158/3/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@115 PS3, Line 115: Preconditions.checkNotNull(sd); I think we should be conservative here (and other places) and do something like "if (sd == null) return;". Otherwise this unchecked exception can cause table loading failures and it not recoverable unless we patch it again. http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java File fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java: http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java@35 PS2, Line 35: columnQualifier > According to this link: https://www.dummies.com/programming/big-data/hadoop Thanks for the info. -- To view, visit http://gerrit.cloudera.org:8080/11158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26 Gerrit-Change-Number: 11158 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Paul Rogers Gerrit-Comment-Date: Mon, 28 Jan 2019 06:37:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/1904/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 28 Jan 2019 06:03:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12168 to look at the new patch set (#10). Change subject: IMPALA-6503: Support reading complex types from ORC format files .. IMPALA-6503: Support reading complex types from ORC format files We've supported reading primitive types from ORC files (IMPALA-5717). In this patch we add support for complex types (struct/array/map). In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC lib can materialize ORC column binaries into its representation (orc::ColumnVectorBatch), so we don't need to do anything about decoding/decompression in hdfs-orc-scanner. Since it already supports complex types, we'll still depend on it. What we need to add in IMPALA-6503 are two things: 1. Specify which nested columns we need to the ORC lib 2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into Impala's representation To format the materialization, we implement several ORC column readers in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC column readers differ from the Parquet readers (used in hdfs-parquet-scanner) which materializes Parquet column binaries into tuple values directly. They just need to transform outputs of the ORC lib into tuple/slot values. Tests: * Enable existing tests for complex types (test_nested_types.py, test_tpch_nested_queries.py) for ORC. Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h A be/src/exec/orc-column-readers.cc A be/src/exec/orc-column-readers.h A be/src/exec/orc-metadata-utils.cc A be/src/exec/orc-metadata-utils.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java A testdata/ComplexTypesTbl/README A testdata/ComplexTypesTbl/nonnullable.orc A testdata/ComplexTypesTbl/nullable.orc M testdata/bin/create-load-data.sh M testdata/bin/load_nested.py M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test M testdata/workloads/tpch_nested/tpch_nested_core.csv M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv M tests/query_test/test_nested_types.py M tests/query_test/test_tpch_nested_queries.py 30 files changed, 1,789 insertions(+), 454 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/10 -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 ) Change subject: IMPALA-6503: Support reading complex types from ORC format files .. Patch Set 9: (22 comments) Thanks for your feedback! I've added more comments and refactor some codes as required. http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG@25 PS8, Line 25: Don’t like > nit: 'They are not like', or 'They differ from' Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h@175 PS9, Line 175: std::list selected_type_ids_; > Mention that RowReaderOptions.includeTypes() expects a list (otherwise it's Yeah, I have no ideas about this too. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h@160 PS8, Line 160: orc::RowReaderOptions row_reader_options; > nit: put '_' at the end of the variable name. Oops! Missed this in the previous patch. Done. http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@181 PS8, Line 181: base > nit: based Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@329 PS8, Line 329: *template_tuple = : Tuple::Create(tuple_desc.byte_size(), template_tuple_pool_.get()); > nit: put curly braces around it Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@364 PS8, Line 364: if (id >= node.getColumnId() && id <= node.getMaximumColumnId()) return true; > nit: put curly braces around it Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@681 PS8, Line 681: RETURN_IF_ERROR(AssembleCollection(*child_reader, child_batch_offset + i, : coll_value_builder)); > nit: add curly braces around it Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@301 PS9, Line 301: VLOG_QUERY << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple " > This logging will be too verbose for most purposes - I'd suggest making it This is quite helpful in debug. I think VLOG_FILE is more suiteable since this function will only be called once for each file. http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363 PS9, Line 363: for (uint64_t id : selected_type_ids) > nit: braces for multi-line if I'm quite confused about the definition of "multi-line". The if statement just occupies one line. Isn't it a single-line statement? Zoltan also comments that the for loop needs a curly brace. Aren't braces optional for single-line statement loops? http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@395 PS9, Line 395: selected_type_ids_.size() > It doesn't really matter here, but with gcc4.9.2 list::size() is actually O Sure http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@680 PS9, Line 680: for (int i = 0; i < total_tuples; ++i) > Use braces for multi-line if Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@19 PS9, Line 19: #ifndef IMPALA_EXEC_ORC_COLUMN_READERS_H > We started using "#pragma once" instead of the traditional include guards i Sure. Done http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@61 PS9, Line 61: virtual void UpdateInputBatch(orc::ColumnVectorBatch* orc_batch) = 0; > Maybe mention the relationship with ReadValue() - do you need to call it be Sure. Added class comments. Feel free to polish them since they're not written by a native speaker :) http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@101 PS8, Line 101: batch_ = dynamic_cast(orc_batch); > We avoid doing dynamic casts in release mode. Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@126 PS8, Line 126: int64_t val = batch_->data.data()[row_idx]; > Maybe you could add a DCHECK that batch_ is not null. Sure. Add DCHECK_NOTNULL at the first use of batch_. Done http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12168/8/be/sr