[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 15:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2153/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 18 Feb 2019 14:48:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-18 Thread Quanlong Huang (Code Review)
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#15).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..

IMPALA-6503: Support reading complex types from ORC format files

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we leverage the ORC lib to parse ORC binaries (data in
io buffer read from DiskIoMgr). The ORC lib can materialize ORC column
binaries into its representation (orc::ColumnVectorBatch). Then we
transform values in orc::ColumnVectorBatch into impala::Tuples in
hdfs-orc-scanner. We don't need to do anything about decoding/decompression
since they are handled by the ORC lib. Fortunately, the ORC lib already
supports complex types, we can still leverage it to support complex types.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need in the form required by the ORC
  lib (Get list of ORC type ids from tuple descriptors)
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation (Slots/Tuples/RowBatches)

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type and
transforms outputs of the ORC lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,864 insertions(+), 461 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/15
--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 14: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 13 Feb 2019 05:06:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 14:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3768/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 13 Feb 2019 01:00:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2091/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 13 Feb 2019 00:33:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-12 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 13:

(15 comments)

Thanks for your new feedbacks! Let's keep reviewing it so things could be 
simpler:)

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363: // We only deal with collection columns (ARRAY/MAP) and 
primitive columns here.
> Yep that's the rule - use braces if the whole thing doesn't fit on a single
Sure


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@198
PS12, Line 198: niqu
> nit:std:: not needed since we import it into the namespace with common/name
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@316
PS12, Line 316: VLOG(3) << "Add ORC column " << node->getMaximumColumnId() 
<< " for empty tuple "
> Can we remove these VLOG_FILE statements here and below? It could get quite
Yes, they're mainly used for debugging. VLOG(3) is equal to VLOG_ROW but this 
function will only be called once for an orc-scanner, not for each row or 
RowBatch. Isn't it a VLOG(2) level info?

However, I still change it to VLOG(3) to respect your opinions :)


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@380
PS12, Line 380: const list& selected_type_ids) {
> Can this argument be const?
Yes. Should be const.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@593
PS12, Line 593:
> How do we know that we're finished processing the previous batch? What happ
If we come here, the previous batch should have been drained. The relative 
codes are in GetNextInternal:

  // Transfer remaining values in current orc batches tracked by readers.
  if (!orc_root_reader_->EndOfBatch()) {
assemble_rows_timer_.Start();
RETURN_IF_ERROR(TransferTuples(orc_root_reader_, row_batch));
assemble_rows_timer_.Stop();
if (row_batch->AtCapacity()) return Status::OK();
DCHECK(orc_root_reader_->EndOfBatch());
  }

If we exited the while() loop because row_batch() was at capacity, the 
remaining rows will be processed in the above codes. Since the batch size 
equals to the capacity of RowBatch, the remaining rows should no more than the 
capacity of RowBatch. Thus calling TransferTuples once can drain the original 
batch.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638
PS12, Line 638:
> typo: materialization
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638
PS12, Line 638: alizeTuple()) {
> grammar: "that are not materializing tuples"
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@688
PS12, Line 688:
> Pass in complex_col_reader by pointer since it's mutable.
It can be const. We don't change any states of it in this function. I just add 
'const' for it and add const for functions we called here.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@133
PS12, Line 133:   }
> VLOG_FILE is also probably too verbose here - VLOG(3) would be more appropr
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@219
PS12, Line 219: materiali
> same comment about logging
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc@65
PS12, Line 65: const int SchemaPathConstants::ARRAY_ITEM;
> I think it's better to define the constant value in the header and just ins
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh@581
PS12, Line 581:   run-step "Loading nested parquet data" load-nested.log \
  : ${IMPALA_HOME}/testdata/bin/load_nested.py \
  : -t tpch_nested_parquet -f parquet/none ${LOAD_NESTED_ARGS:-}
  :   run-step "Loading nested orc data" load-nested.log \
  : ${IMPALA_HOME}/testdata/bin/load_nested.py \
> I think it would make sense to specify the arguments for the call to run th
Sure. It's clearer.


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@286
PS12, Line 286:   hive.execute("""
  : 

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-12 Thread Quanlong Huang (Code Review)
Hello Zoltan Borok-Nagy, Tim Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#13).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..

IMPALA-6503: Support reading complex types from ORC format files

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don't need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we'll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC
column readers differ from the Parquet readers (used in
hdfs-parquet-scanner) which materializes Parquet column binaries into
tuple values directly. They just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,864 insertions(+), 461 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/13
--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-11 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 12:

(12 comments)

It's a big change so it's taking a while to get my head around!

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363: // We only deal with collection columns (ARRAY/MAP) and 
primitive columns here.
> Tim will correct me if I'm wrong, but I guess he was also thinking about th
Yep that's the rule - use braces if the whole thing doesn't fit on a single 
line.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@198
PS12, Line 198: td::
nit:std:: not needed since we import it into the namespace with common/names.h


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@316
PS12, Line 316: VLOG_FILE << "Add ORC column " << 
node->getMaximumColumnId() << " for empty tuple "
Can we remove these VLOG_FILE statements here and below? It could get quite 
noisy. I think we could either remove them or make them VLOG(3) since they're 
mainly for debugging I think?


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@380
PS12, Line 380: list& selected_type_ids) {
Can this argument be const?


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@593
PS12, Line 593:   // We're going to free the previous batch. Clear the 
reference first.
How do we know that we're finished processing the previous batch? What happens 
if we exited the while() loop because row_batch() was at capacity? Don't we 
lose some rows?


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638
PS12, Line 638: hat not materializing tuples
grammar: "that are not materializing tuples"


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638
PS12, Line 638: materializtion
typo: materialization


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@688
PS12, Line 688: OrcComplexColumnReader& complex_col_reade
Pass in complex_col_reader by pointer since it's mutable.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@133
PS12, Line 133:   VLOG_FILE << "Created reader for " << PrintNode(orc_type) << 
": slot_desc_="
VLOG_FILE is also probably too verbose here - VLOG(3) would be more appropriate.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@170
PS12, Line 170: TryAllocate
This is minor, but TryAllocateUnaligned() is better for strings - it is 
marginally more efficient and avoids wasting space.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@219
PS12, Line 219: VLOG_FILE
same comment about logging


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc@65
PS12, Line 65: const int SchemaPathConstants::ARRAY_ITEM = 0;
I think it's better to define the constant value in the header and just 
instantiate it here, i.e. revert the change in the header and make these like:

  const int SchemaPathConstants::ARRAY_ITEM;

That just means the constant value can be inlined in callers.



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 12 Feb 2019 00:08:33 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-06 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 12:

(4 comments)

A couple comments about the data loading and tests.

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh@581
PS12, Line 581:   run-step "Loading nested parquet data" load-nested.log \
  : ${IMPALA_HOME}/testdata/bin/load_nested.py 
${LOAD_NESTED_ARGS:-}
  :   run-step "Loading nested orc data" load-nested.log \
  : ${IMPALA_HOME}/testdata/bin/load_nested.py \
  : -t tpch_nested_orc_def -f orc/def ${LOAD_NESTED_ARGS:-}
I think it would make sense to specify the arguments for the call to run the 
load_nested.py for parquet. i.e.
  run-step "Loading nested parquet data" load-nested.log \
${IMPALA_HOME}/testdata/bin/load_nested.py \
-t tpch_nested_parquet -f parquet/none ${LOAD_NESTED_ARGS:-}


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@286
PS12, Line 286: # For ORC format, we create the 'part' table by Hive
  : if file_format == "orc":
  :   hive.execute("""
  : CREATE TABLE part
  : STORED AS ORC
  : AS SELECT * FROM 
{source_db}.part""".format(**sql_params))
I would prefer this to be right next to the create statement for parquet, even 
if you need to open a separate hive cursor.


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@340
PS12, Line 340:   file_format, compression_value = args.table_format.split("/")
  :   if file_format == "parquet" and compression_value == "none":
  : compression_key = "parquet.compression"
  : compression_value = "SNAPPY"
  :   elif file_format == "orc" and compression_value == "def":
  : compression_key = "orc.compress"
  : compression_value = "ZLIB"
Please throw an error if the file_format/compression_value is not one of the 
supported combinations. i.e. add an else and raise an error.


http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py@611
PS12, Line 611: if file_format == 'parquet':
  :   self.__create_parquet_tables(unique_database, True)
  : elif file_format == 'orc':
  :   # Creating ORC tables from ORC files (IMPALA-8046) has 
not been supported.
  :   # We create the Parquet tables first and then transform 
them into ORC tables.
  :   self.__create_parquet_tables(unique_database, False)
  :   self.__create_orc_tables(unique_database)
It might be cleaner to have __create_orc_table() call 
__create_parquet_tables(), so that this code becomes:

if file_format == 'parquet':
  self.__create_parquet_tables(unique_database)
elif file_format == 'orc':
  self.__create_orc_tables(unique_database)

The comment here about "Creating ORC tables from..." becomes part of 
__create_orc_tables(), and the as_target parameter is only needed from 
__create_orc_tables().



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 06 Feb 2019 18:44:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 12:

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3726/


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 06 Feb 2019 04:35:25 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3726/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 06 Feb 2019 00:37:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-03 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 12:

The jenkins job still checkout the cdh5-trunk branch of Impala-lzo, which cause 
the build failures.


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 04 Feb 2019 01:31:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 12: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3702/


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 04 Feb 2019 00:55:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3702/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sun, 03 Feb 2019 23:50:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 11:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/1974/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sun, 03 Feb 2019 13:50:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-02-03 Thread Quanlong Huang (Code Review)
Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#11).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..

IMPALA-6503: Support reading complex types from ORC format files

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don't need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we'll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC
column readers differ from the Parquet readers (used in
hdfs-parquet-scanner) which materializes Parquet column binaries into
tuple values directly. They just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
30 files changed, 1,828 insertions(+), 462 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/11
--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-28 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 10:

(1 comment)

> Patch Set 9:
>
> (2 comments)

http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh@34
PS10, Line 34: set -x
> I guess you left it here accidentally
This is introduced by a recent commit: 236b9194d345bbfdfb177dd7ef4908170eaff259
I just rebase and it appears :)



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 28 Jan 2019 22:54:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-28 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:   for (uint64_t id : selected_type_ids)
> I'm quite confused about the definition of "multi-line". The if statement j
Tim will correct me if I'm wrong, but I guess he was also thinking about the 
braces of the for-loop.

I should've pasted the url of Impala's style guide since we diverged a little 
from the Google style guide: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65868536

My experience is that we don't like any control statements without braces, 
unless the whole statement fits into a single line.


http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/12168/10/testdata/bin/create-load-data.sh@34
PS10, Line 34: set -x
I guess you left it here accidentally



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 28 Jan 2019 16:46:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1904/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 28 Jan 2019 06:03:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-27 Thread Quanlong Huang (Code Review)
Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#10).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..

IMPALA-6503: Support reading complex types from ORC format files

We've supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don't need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we'll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala's representation

To format the materialization, we implement several ORC column readers
in hdfs-orc-scanner. Each kind of reader treats a column type. The ORC
column readers differ from the Parquet readers (used in
hdfs-parquet-scanner) which materializes Parquet column binaries into
tuple values directly. They just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
30 files changed, 1,789 insertions(+), 454 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/10
--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-27 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 9:

(22 comments)

Thanks for your feedback! I've added more comments and refactor some codes as 
required.

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG@25
PS8, Line 25: Don’t like
> nit: 'They are not like', or 'They differ from'
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h@175
PS9, Line 175:   std::list selected_type_ids_;
> Mention that RowReaderOptions.includeTypes() expects a list (otherwise it's
Yeah, I have no ideas about this too.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h@160
PS8, Line 160:   orc::RowReaderOptions row_reader_options;
> nit: put '_' at the end of the variable name.
Oops! Missed this in the previous patch. Done.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@181
PS8, Line 181: base
> nit: based
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@329
PS8, Line 329: *template_tuple =
 : Tuple::Create(tuple_desc.byte_size(), 
template_tuple_pool_.get());
> nit: put curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@364
PS8, Line 364: if (id >= node.getColumnId() && id <= 
node.getMaximumColumnId()) return true;
> nit: put curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@681
PS8, Line 681:   RETURN_IF_ERROR(AssembleCollection(*child_reader, 
child_batch_offset + i,
 :   coll_value_builder));
> nit: add curly braces around it
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@301
PS9, Line 301: VLOG_QUERY << "Add ORC column " << 
node->getMaximumColumnId() << " for empty tuple "
> This logging will be too verbose for most purposes - I'd suggest making it
This is quite helpful in debug. I think VLOG_FILE is more suiteable since this 
function will only be called once for each file.


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:   for (uint64_t id : selected_type_ids)
> nit: braces for multi-line if
I'm quite confused about the definition of "multi-line". The if statement just 
occupies one line. Isn't it a single-line statement? Zoltan also comments that 
the for loop needs a curly brace. Aren't braces optional for single-line 
statement loops?


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@395
PS9, Line 395: selected_type_ids_.size()
> It doesn't really matter here, but with gcc4.9.2 list::size() is actually O
Sure


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@680
PS9, Line 680: for (int i = 0; i < total_tuples; ++i)
> Use braces for multi-line if
Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@19
PS9, Line 19: #ifndef IMPALA_EXEC_ORC_COLUMN_READERS_H
> We started using "#pragma once" instead of the traditional include guards i
Sure. Done


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@61
PS9, Line 61:   virtual void UpdateInputBatch(orc::ColumnVectorBatch* 
orc_batch) = 0;
> Maybe mention the relationship with ReadValue() - do you need to call it be
Sure. Added class comments. Feel free to polish them since they're not written 
by a native speaker :)


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@101
PS8, Line 101: batch_ = dynamic_cast(orc_batch);
> We avoid doing dynamic casts in release mode.
Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@126
PS8, Line 126: int64_t val = batch_->data.data()[row_idx];
> Maybe you could add a DCHECK that batch_ is not null.
Sure. Add DCHECK_NOTNULL at the first use of batch_. Done


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-25 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 9:

(10 comments)

I looked through the headers and test files initially. I think this makes sense 
at a high level. I haven't looked at the .cc files in parallel - I think Zoltan 
will probably get to that before me.

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.h@175
PS9, Line 175:   std::list selected_type_ids_;
Mention that RowReaderOptions.includeTypes() expects a list (otherwise it's 
confusing why this is not a vector). Actually it's still confusing why it's not 
a vector but at least we can blame the ORC library for that decision :)


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@301
PS9, Line 301: VLOG_QUERY << "Add ORC column " << 
node->getMaximumColumnId() << " for empty tuple "
This logging will be too verbose for most purposes - I'd suggest making it 
VLOG(3) or removing it.


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:   for (uint64_t id : selected_type_ids)
nit: braces for multi-line if


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@395
PS9, Line 395: selected_type_ids_.size()
It doesn't really matter here, but with gcc4.9.2 list::size() is actually O(n). 
We've been burned by that before.


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@680
PS9, Line 680: for (int i = 0; i < total_tuples; ++i)
Use braces for multi-line if


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@19
PS9, Line 19: #ifndef IMPALA_EXEC_ORC_COLUMN_READERS_H
We started using "#pragma once" instead of the traditional include guards in 
many places - it would be nice to switch.


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.h@61
PS9, Line 61:   virtual void UpdateInputBatch(orc::ColumnVectorBatch* 
orc_batch) = 0;
Maybe mention the relationship with ReadValue() - do you need to call it before 
calling ReadValue()? Maybe it would help to document how callers should use 
this in a class comment for OrcColumnReader?


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@330
PS9, Line 330:   for (OrcColumnReader* child : children_)
Need braces around multi-line for loop - here and elsewhere


http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/orc-column-readers.cc@331
PS9, Line 331: RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool));
This probably performs similarly to the previous code with the switch, but 
ultimately to make this performance we'd want to do batched calls that read 
multiple values in one try.

We changed the Parquet reader to do things that way a while back and it makes 
it faster automatically and unlocks more optimisation opportunities.

Nothing to fix here, just sharing


http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/12168/9/tests/query_test/test_nested_types.py@128
PS9, Line 128: # TODO(IMPALA-6505): support predicates push down on ORC 
stripe statistics
Maybe move this to a string argument to skip()



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 26 Jan 2019 01:19:46 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-25 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 8:

(12 comments)

Started to look at it. Currently mostly have comments about style.
I plan to look at it deeper.

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12168/8//COMMIT_MSG@25
PS8, Line 25: Don’t like
nit: 'They are not like', or 'They differ from'


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.h@160
PS8, Line 160:   orc::RowReaderOptions row_reader_options;
nit: put '_' at the end of the variable name.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@181
PS8, Line 181: base
nit: based


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@329
PS8, Line 329: *template_tuple =
 : Tuple::Create(tuple_desc.byte_size(), 
template_tuple_pool_.get());
nit: put curly braces around it


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@364
PS8, Line 364: if (id >= node.getColumnId() && id <= 
node.getMaximumColumnId()) return true;
nit: put curly braces around it


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/hdfs-orc-scanner.cc@681
PS8, Line 681:   RETURN_IF_ERROR(AssembleCollection(*child_reader, 
child_batch_offset + i,
 :   coll_value_builder));
nit: add curly braces around it


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@101
PS8, Line 101: batch_ = dynamic_cast(orc_batch);
We avoid doing dynamic casts in release mode.
You can do a dynamic_cast inside a DCHECK, then use static_cast here.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.h@126
PS8, Line 126: int64_t val = batch_->data.data()[row_idx];
Maybe you could add a DCHECK that batch_ is not null.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@190
PS8, Line 190:   *slot = TimestampValue::FromUnixTimeNanos(secs, nanos,
Do we know that orc timestamps are timezone-aware or not?


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@326
PS8, Line 326: 
children_[c]->UpdateInputBatch(batch_->fields[children_fields_[c]]);
nit: add curly brackets around it since it is a multi-line for stmt.


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@331
PS8, Line 331: RETURN_IF_ERROR(child->ReadValue(row_idx_, tuple, pool));
nit: add curly brackets around it


http://gerrit.cloudera.org:8080/#/c/12168/8/be/src/exec/orc-column-readers.cc@395
PS8, Line 395:   } else CreateChildForSlot(node, slot_desc);
nit: it's against the Google style guide to have braces on only one branch of 
an if-stmt:
https://google.github.io/styleguide/cppguide.html#Conditionals



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 25 Jan 2019 17:25:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1854/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 23 Jan 2019 16:15:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-23 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#9).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..

IMPALA-6503: Support reading complex types from ORC format files

We’ve supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don’t need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we’ll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala’s representation

To format the materialization, we implement several ORC column readers
used in hdfs-orc-scanner. Each kind of reader treats a column type.
Don’t like the Parquet readers (used in hdfs-parquet-scanner) which
materializes Parquet column binaries into tuple values directly, the ORC
readers (in hdfs-orc-scanner) just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit-single-node.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
30 files changed, 1,698 insertions(+), 453 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/9
--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1813/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 18 Jan 2019 14:41:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-18 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#8).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..

IMPALA-6503: Support reading complex types from ORC format files

We’ve supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don’t need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we’ll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala’s representation

To format the materialization, we implement several ORC column readers
used in hdfs-orc-scanner. Each kind of reader treats a column type.
Don’t like the Parquet readers (used in hdfs-parquet-scanner) which
materializes Parquet column binaries into tuple values directly, the ORC
readers (in hdfs-orc-scanner) just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,695 insertions(+), 449 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/8
--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1728/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 08 Jan 2019 00:43:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-07 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12168

to look at the new patch set (#7).

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..

IMPALA-6503: Support reading complex types from ORC format files

We’ve supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don’t need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we’ll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala’s representation

To format the materialization, we implement several ORC column readers
used in hdfs-orc-scanner. Each kind of reader treats a column type.
Don’t like the Parquet readers (used in hdfs-parquet-scanner) which
materializes Parquet column binaries into tuple values directly, the ORC
readers (in hdfs-orc-scanner) just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,685 insertions(+), 444 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/7
--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 5:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/1723/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 07 Jan 2019 15:28:52 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-07 Thread Quanlong Huang (Code Review)
Quanlong Huang has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/12168


Change subject: IMPALA-6503: Support reading complex types from ORC format files
..

IMPALA-6503: Support reading complex types from ORC format files

We’ve supported reading primitive types from ORC files (IMPALA-5717).
In this patch we add support for complex types (struct/array/map).

In IMPALA-5717, we depend on the ORC lib to read ORC binaries. The ORC
lib can materialize ORC column binaries into its representation
(orc::ColumnVectorBatch), so we don’t need to do anything about
decoding/decompression in hdfs-orc-scanner. Since it already supports
complex types, we’ll still depend on it.

What we need to add in IMPALA-6503 are two things:
1. Specify which nested columns we need to the ORC lib
2. Transform outputs of ORC lib (nested orc::ColumnVectorBatch) into
  Impala’s representation

To format the materialization, we implement several ORC column readers
used in hdfs-orc-scanner. Each kind of reader treats a column type.
Don’t like the Parquet readers (used in hdfs-parquet-scanner) which
materializes Parquet column binaries into tuple values directly, the ORC
readers (in hdfs-orc-scanner) just need to transform outputs of the ORC
lib into tuple/slot values.

Tests:
* Enable existing tests for complex types (test_nested_types.py,
test_tpch_nested_queries.py) for ORC.

Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
A be/src/exec/orc-column-readers.cc
A be/src/exec/orc-column-readers.h
A be/src/exec/orc-metadata-utils.cc
A be/src/exec/orc-metadata-utils.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
A testdata/ComplexTypesTbl/README
A testdata/ComplexTypesTbl/nonnullable.orc
A testdata/ComplexTypesTbl/nullable.orc
M testdata/bin/create-load-data.sh
M testdata/bin/load_nested.py
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/queries/QueryTest/max-nesting-depth.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-subplan.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch-mem-limit.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
M testdata/workloads/tpch_nested/tpch_nested_core.csv
M testdata/workloads/tpch_nested/tpch_nested_dimensions.csv
M testdata/workloads/tpch_nested/tpch_nested_exhaustive.csv
M testdata/workloads/tpch_nested/tpch_nested_pairwise.csv
M tests/query_test/test_nested_types.py
M tests/query_test/test_tpch_nested_queries.py
29 files changed, 1,656 insertions(+), 442 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12168/5
--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 


[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

2019-01-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )

Change subject: IMPALA-6503: Support reading complex types from ORC format files
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12168/5/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/5/be/src/exec/hdfs-orc-scanner.cc@291
PS5, Line 291:   
RETURN_IF_ERROR(schema_resolver_->ResolveColumn(tuple_desc.tuple_path(), , 
_field,
line too long (93 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 07 Jan 2019 14:55:52 +
Gerrit-HasComments: Yes