[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

Tim Armstrong (Code Review) Thu, 01 Feb 2018 17:54:56 -0800

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
......................................................................

Patch Set 3:

(44 comments)

This is really impressive - I was able to build the patch, load TPC-H ORC data
and run a bunch of queries.

I think there are still meta-questions about the approach of importing the ORC
library that I want input from other community members on, but I spent some
time understanding your code and commenting on specific things.

http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG@15
PS3, Line 15: Instead of linking the orc-reader as a third party library, it's
Which version or commit hash of ORC did you import?

I think we need to think about this carefully. It might be best to use it as a
third-party library at least to start off with so that we can pick up any
improvements made to the Apache ORC implementation by upgrading our library
version. Otherwise it's a lot more code for the Impala project to maintain.

We can also contribute improvements like predicate pushdown support to the ORC
library for now.

At some point, if we have multiple people committed to maintaining it and we
want to make larger changes to the ORC code, we could revisit the decision.

http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG@25
PS3, Line 25: tests.
We should also add ORC for TPC-H and TPC-DS so that we have some larger data
sets to test on.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner-test.cc
File be/src/exec/hdfs-orc-scanner-test.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner-test.cc@32
PS3, Line 32: uint8_t empty_data_one_col_orc[] = {
Can you comment how this data was generated, so that it could be reproduced if
needed?

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner-test.cc@292
PS3, Line 292: // TODO Codec::CreateDecompressor cannot create a LZO
decompressor, how to support LZO?
We weren't able to include Lzo integration in Apache Impala since the original
implementation was GPL-licensed. There's a Cloudera-developed plugin with a
different license to read LZO text files.

It looks like ORC actually has an Apache-licensed LZO implementation so that
might be an alternative.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@28
PS3, Line 28: class CollectionValueBuilder;
Not needed?

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@259
PS3, Line 259: ProcessFileTail
It might be helpful to define what the "footer", "file tail" and "postscript"
are, since the names are similar.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@1
PS3, Line 1: // Copyright 2012 Cloudera Inc.
Don't need cloudera copyrights!

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@24
PS3, Line 24: #include "common/object-pool.h"
Many of these headers look unused.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@53
PS3, Line 53: using boost::algorithm::split;
Some of these boost "using" declarations don't seem to be needed.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@59
PS3, Line 59: DEFINE_double(orc_min_filter_reject_ratio, 0.1, "(Advanced) If
the percentage of "
I don't know why we made this flag option parquet-specific. Having an option
per file format will get out of control.

Can we just define a generic flag like min_filter_reject_ratio, e.g. in
global-flags.cc and use it in ORC? We could keep the old parquet flag around
for compatibility.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@73
PS3, Line 73: for (int i = 0; i < files.size(); ++i) {
Can we convert this to a range for? We generally prefer that in new code.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@130
PS3, Line 130: ScanRange* HdfsOrcScanner::FindFooterSplit(HdfsFileDesc* file) {
We could move this to HdfsScanner and share the code - the logic is generic.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@142
PS3, Line 142: mem_tracker_.reset(new MemTracker(-1, "OrcReader",
mem_tracker));
I think we should track it against the HdfsScanNode's MemTracker instead of
creating a MemTracker per scanner thread.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@151
PS3, Line 151: mem_tracker_->CloseAndUnregisterFromParent();
We only want to use CloseAndUnregisterFromParent() for the query-level
MemTracker for now (see the comment). I refactored some of that code to make
the lifetime of MemTrackers easier to reason about.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@157
PS3, Line 157: std::
Don't need std:: prefix. We generally prefer avoiding it when it isn't needed
and adding "using " declarations in .cc.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@162
PS3, Line 162: if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) {
You should be able to assume it's non-NULL. A lot of older code checks if
metrics are NULL because the metrics weren't initialised in backend tests. We
switched to fixing the backend tests to always initialise metrics so this
should be avoided.

TBH updating the metric isn't even necessary though - I've never found it to be
useful so we don't need to expand the definition.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@187
PS3, Line 187: std::f
Same here, don't need std:: prefix.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@196
PS3, Line 196: void HdfsOrcScanner::ScanRangeInputStream::read(void* buf,
uint64_t length,
It's unfortunate that the ORC code was designed to issue only synchronous
reads. It doesn't really align that well with Impala's I/O management but this
is probably good for an initial version. We won't get the same performance as
Parquet unfortunately right now.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@223
PS3, Line 223: memset(buf, 0, length);
Is the memset needed? If so, should document why in a comment.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@224
PS3, Line 224: throw std::runtime_error("Cannot read from file: " +
status.GetDetail());
I need to think about the exception-throwing. There's a mismatch here since
Impala's C++ code avoids exceptions like the plague but the ORC C++ code uses
them as the main mode of error handling.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@224
PS3, Line 224: throw std::runtime_error("Cannot read from file: " +
status.GetDetail());
The fact we need to use exceptions is unfortunate. I need to think a bit more
about whether this is the best way to do things (it seems like the ORC-reader
code is very exception-oriented so it may be unavoidable).

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@230
PS3, Line 230: stripe_idx_(-1),
For the constants here, we should initialise them in the class definition
(we've been switching code to using that syntax). E.g.

int64_t stripe_rows_read_ = 0

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@257
PS3, Line 257: num_cols_counter_ =
What do you think about adding Orc to some of the counter names? At least the
ones that reference ORC-specific things like Columns, Strips, Footers, etc. It
would be a little confusing otherwise, particularly on tables with mixed
formats. I don't think the Parquet scanner does the right thing here - it
should probably include Parquet in some of the counter names.

E.g. NumOrcColumns, NumStatsFilteredOrcStripes, NumOrcStripes,
OrcFooterProcessingTime.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@259
PS3, Line 259: num_stats_filtered_stripes_counter_ =
This isn't used - let's remove it until we add the stats filtering feature.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@333
PS3, Line 333: if (col_type.type == TYPE_ARRAY) {
Are there any tests to make sure the ORC reader won't crash when scanning
complex types?

It might be helpful to create separate JIRAs for features that won't be in this
patch so we can understand the scope too.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@484
PS3, Line 484: if (start_with_first_stripe && misaligned_stripe_skipped) {
I think we had a specific Parquet test for this code path, so we're probably
missing coverage for ORC at the moment. In general it would be helpful to look
at the Parquet-specific scanner tests in test_scanners.py and test_parquet.py
and see if there should be corresponding ORC tests.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@498
PS3, Line 498: // TODO ValidateColumnOffsets
Is this still needed?

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@500
PS3, Line 500: google::uint64 stripe_offset = stripe.offset();
Why not uint64_t? Are they not the same underlying type?

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@513
PS3, Line 513: // TODO: check if this stripe can be skipped by stats.
File a follow-on JIRA for this enhancement?

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@517
PS3, Line 517: unique_ptr<orc::InputStream> input_stream(new
ScanRangeInputStream(this));
I haven't fully thought through this idea but I wonder if we should have an
alternative InputStream implementation that wraps ScannerContext::Stream, so
that we can make use of the DiskIoMgr's prefetching.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@677
PS3, Line 677: vector<uint8_t> decompressed_footer_buffer;
I'm concerned that this buffer memory isn't tracked and could be arbitrarily
large. Can we use ScopedBuffer instead of vector<uint8_t> for cases like this.

I've seen bugs before where an invalid file can cause code like this to crash.
E.g. 'length = -1' in a file is converted to a huge positive unsigned number,
then vector.resize() throws std::bad_alloc and the Impala process dies.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@679
PS3, Line 679: isCompressed
is_compressed

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@751
PS3, Line 751: // TODO
We should do this in the initial patch if it's possible to crash Impala with a
bad ORC file.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@820
PS3, Line 820: while (row_id < capacity && ScratchBatchNotEmpty()) {
We should file a follow-on JIRA to codegen the runtime filter + conjunct
evaluation loop.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@859
PS3, Line 859: bool HdfsOrcScanner::EvalRuntimeFilters(TupleRow* row) {
Can you add a TODO to combine this with the Parquet implementation?

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@881
PS3, Line 881: inline void HdfsOrcScanner::ReadRow(const orc::ColumnVectorBatch
&batch, int row_idx,
We don't need to do this in this patch, but it would probably be faster to do
the copying column-by-column for a batch then do a second pass evaluating the
predicates and adjust tuple pointers accordingly.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@884
PS3, Line 884: dynamic_cast
Can this be a static_cast? AFAICT dynamic_cast is unnecessary in most places in
this function since it's always the right type. I ran a release build of this
code scanning the TPCH data and it looks like it spends a few % of time in
dynamic_cast for some queries.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@913
PS3, Line 913: case
TYPE_TINYINT:*(reinterpret_cast<int8_t*>(slot_val_ptr)) = val;
It might be more readable to put the case body on a separate line.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@954
PS3, Line 954:
dst_batch->tuple_data_pool()->Allocate(val_ptr->len));
Should use TryAllocate() here and return an error on failure to allocate.
Allocate() is a legacy interface we're trying to remove eventually.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@975
PS3, Line 975: // TODO warn if slot_desc->type().GetByteSize() != 16
You could DCHECK in that case, the only valid values are 4, 8 and 16

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@1038
PS3, Line 1038: NULL
Prefer nullptr in new code (I know NULL is all over the existing code :)).

http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/bin/run-hive-server.sh
File testdata/bin/run-hive-server.sh:

http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/bin/run-hive-server.sh@75
PS3, Line 75: HADOOP_HEAPSIZE="2048" hive --service hiveserver2 >
${LOGDIR}/hive-server2.out 2>&1 &
Can you comment how 2048 was chosen for future reference?

http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv
File testdata/workloads/functional-query/functional-query_exhaustive.csv:

http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv@25
PS3, Line 25: file_format: orc, dataset: functional, compression_codec: none,
compression_type: none
Does this actually load uncompressed ORC, or does it use the default ORC codec?
We should make sure that compression_codec actually matches what is used.

parquet/none is actually snappy-compressed because it was the default, but that
was a mistake and confusing.

http://gerrit.cloudera.org:8080/#/c/9134/3/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/9134/3/tests/query_test/test_scanners.py@a108
PS3, Line 108:
oops!

--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Fri, 02 Feb 2018 01:54:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

Reply via email to