[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

Quanlong Huang (Code Review) Fri, 09 Mar 2018 18:53:34 -0800

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )


Change subject: IMPALA-5717: Support for ORC data files
......................................................................


Patch Set 3:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@130
PS3, Line 130: ScanRange* HdfsOrcScanner::FindFooterSplit(HdfsFileDesc* file) {
> We could move this to HdfsScanner and share the code - the logic is generic
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@142
PS3, Line 142:   mem_tracker_.reset(new MemTracker(-1, "OrcReader", 
mem_tracker));
> I think we should track it against the HdfsScanNode's MemTracker instead of
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@151
PS3, Line 151:   mem_tracker_->CloseAndUnregisterFromParent();
> Just add the comment since I made it crashed when use Close at first...
Removed these since we use HdfsScanNode's MemTracker


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@162
PS3, Line 162:   if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) {
> So the non-NULL checks in mem-pool.cc are redundant too? I learn this from
To be consistent with logics in MemPool, let's update the metric.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@333
PS3, Line 333:     if (col_type.type == TYPE_ARRAY) {
> Complex types will be skipped here. Their column ids won't be set into the
Have added tests for this. Replaced these with DCHECK.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@484
PS3, Line 484:       if (start_with_first_stripe && misaligned_stripe_skipped) {
> I think we had a specific Parquet test for this code path, so we're probabl
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@498
PS3, Line 498:     // TODO ValidateColumnOffsets
> Is this still needed?
Removed this. The orc library will handle this so we just need to catch 
exceptions from the orc lib.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@500
PS3, Line 500:     google::uint64 stripe_offset = stripe.offset();
> Why not uint64_t? Are they not the same underlying type?
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@513
PS3, Line 513:     // TODO: check if this stripe can be skipped by stats.
> File a follow-on JIRA for this enhancement?
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@517
PS3, Line 517:     unique_ptr<orc::InputStream> input_stream(new 
ScanRangeInputStream(this));
> I haven't fully thought through this idea but I wonder if we should have an
Though ORC-262 has no progress, I think we can still prefech data and let the 
ORC lib reading from an in-memory InputStream. Created a JIRA for this: 
IMPALA-6636.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@677
PS3, Line 677:   vector<uint8_t> decompressed_footer_buffer;
> You're right! Will review these logics deeper.
The logics of parsing the file tail are now replaced by using the orc lib.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@751
PS3, Line 751:   // TODO
> We should do this in the initial patch if it's possible to crash Impala wit
Handled by the ORC lib


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@820
PS3, Line 820:   while (row_id < capacity && ScratchBatchNotEmpty()) {
> We should file a follow-on JIRA to codegen the runtime filter + conjunct ev
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@859
PS3, Line 859: bool HdfsOrcScanner::EvalRuntimeFilters(TupleRow* row) {
> Can you add a TODO to combine this with the Parquet implementation?
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@881
PS3, Line 881: inline void HdfsOrcScanner::ReadRow(const orc::ColumnVectorBatch 
&batch, int row_idx,
> We don't need to do this in this patch, but it would probably be faster to
Add a TODO for this.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@884
PS3, Line 884: dynamic_cast
> Sure! Excited to know that you start to test the scanner's performance!
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@975
PS3, Line 975:           // TODO warn if slot_desc->type().GetByteSize() != 16
> You could DCHECK in that case, the only valid values are 4, 8 and 16
Add DCHECKs in default


http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv
File testdata/workloads/functional-query/functional-query_exhaustive.csv:

http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv@25
PS3, Line 25: file_format: orc, dataset: functional, compression_codec: none, 
compression_type: none
> Does this actually load uncompressed ORC, or does it use the default ORC co
Yeah, the default ORC codec is zlib (deflate in Impala).



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Sat, 10 Mar 2018 02:53:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

Reply via email to