[Impala-ASF-CR] IMPALA-7540. Intern most repetitive strings and network addresses in catalog

2019-01-27 Thread Bharath Vissapragada (Code Review)
Bharath Vissapragada has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11158 )

Change subject: IMPALA-7540. Intern most repetitive strings and network 
addresses in catalog
..


Patch Set 3: Code-Review+1

(4 comments)

lgtm (pending nits). I'll let Todd take a pass in case he has some comments.

http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java
File fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java:

http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@62
PS2, Line 62:*/
> Documentation of this kind tends to get out of date quickly. Looking at the
Fair point.


http://gerrit.cloudera.org:8080/#/c/11158/3/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java
File fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java:

http://gerrit.cloudera.org:8080/#/c/11158/3/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@22
PS3, Line 22: import 
org.apache.curator.shaded.com.google.common.base.Preconditions;
Switch to guava jar.


http://gerrit.cloudera.org:8080/#/c/11158/3/fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java@115
PS3, Line 115: Preconditions.checkNotNull(sd);
I think we should be conservative here (and other places) and do something like 
"if (sd == null) return;". Otherwise this unchecked exception can cause table 
loading failures and it not recoverable unless we patch it again.


http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java
File fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java:

http://gerrit.cloudera.org:8080/#/c/11158/2/fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java@35
PS2, Line 35: columnQualifier
> According to this link: https://www.dummies.com/programming/big-data/hadoop
Thanks for the info.



--
To view, visit http://gerrit.cloudera.org:8080/11158
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26
Gerrit-Change-Number: 11158
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Paul Rogers 
Gerrit-Comment-Date: Mon, 28 Jan 2019 06:37:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1904/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 28 Jan 2019 06:03:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-27 Thread Quanlong Huang (Code Review)
Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#10).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..

IMPALA-6503: Support reading complex types from ORC format files

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don't need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we'll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC
column readers differ from the Parquet readers (used in
hdfs-parquet-scanner) which materializes Parquet column binaries into
tuple values directly. They just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
30 files changed, 1,789 insertions(+), 454 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/10
--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-27 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 9:

(22 comments)

Thanks for your feedback! I've added more comments and refactor some codes as 
required.

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG@25
PS8, Line 25: Don’t like
> nit: 'They are not like', or 'They differ from'
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h@175
PS9, Line 175:   std::list selected_type_ids_;
> Mention that RowReaderOptions.includeTypes() expects a list (otherwise it's
Yeah, I have no ideas about this too.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h@160
PS8, Line 160:   orc::RowReaderOptions row_reader_options;
> nit: put '_' at the end of the variable name.
Oops! Missed this in the previous patch. Done.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@181
PS8, Line 181: base
> nit: based
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@329
PS8, Line 329: *template_tuple =
 : Tuple::Create(tuple_desc.byte_size(), 
template_tuple_pool_.get());
> nit: put curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@364
PS8, Line 364: if (id >= node.getColumnId() && id <= 
node.getMaximumColumnId()) return true;
> nit: put curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@681
PS8, Line 681:   RETURN_IF_ERROR(AssembleCollection(*child_reader, 
child_batch_offset + i,
 :   coll_value_builder));
> nit: add curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@301
PS9, Line 301: VLOG_QUERY << "Add ORC column " << 
node->getMaximumColumnId() << " for empty tuple "
> This logging will be too verbose for most purposes - I'd suggest making it
This is quite helpful in debug. I think VLOG_FILE is more suiteable since this 
function will only be called once for each file.


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:   for (uint64_t id : selected_type_ids)
> nit: braces for multi-line if
I'm quite confused about the definition of "multi-line". The if statement just 
occupies one line. Isn't it a single-line statement? Zoltan also comments that 
the for loop needs a curly brace. Aren't braces optional for single-line 
statement loops?


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@395
PS9, Line 395: selected_type_ids_.size()
> It doesn't really matter here, but with gcc4.9.2 list::size() is actually O
Sure


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@680
PS9, Line 680: for (int i = 0; i < total_tuples; ++i)
> Use braces for multi-line if
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@19
PS9, Line 19: #ifndef IMPALA_EXEC_ORC_COLUMN_READERS_H
> We started using "#pragma once" instead of the traditional include guards i
Sure. Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@61
PS9, Line 61:   virtual void UpdateInputBatch(orc::ColumnVectorBatch* 
orc_batch) = 0;
> Maybe mention the relationship with ReadValue() - do you need to call it be
Sure. Added class comments. Feel free to polish them since they're not written 
by a native speaker :)


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@101
PS8, Line 101: batch_ = dynamic_cast(orc_batch);
> We avoid doing dynamic casts in release mode.
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@126
PS8, Line 126: int64_t val = batch_->data.data()[row_idx];
> Maybe you could add a DCHECK that batch_ is not null.
Sure. Add DCHECK_NOTNULL at the first use of batch_. Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/sr