Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/9134 )
Change subject: IMPALA-5717: Support for ORC data files ...................................................................... Patch Set 3: (18 comments) http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@130 PS3, Line 130: ScanRange* HdfsOrcScanner::FindFooterSplit(HdfsFileDesc* file) { > We could move this to HdfsScanner and share the code - the logic is generic Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@142 PS3, Line 142: mem_tracker_.reset(new MemTracker(-1, "OrcReader", mem_tracker)); > I think we should track it against the HdfsScanNode's MemTracker instead of Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@151 PS3, Line 151: mem_tracker_->CloseAndUnregisterFromParent(); > Just add the comment since I made it crashed when use Close at first... Removed these since we use HdfsScanNode's MemTracker http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@162 PS3, Line 162: if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) { > So the non-NULL checks in mem-pool.cc are redundant too? I learn this from To be consistent with logics in MemPool, let's update the metric. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@333 PS3, Line 333: if (col_type.type == TYPE_ARRAY) { > Complex types will be skipped here. Their column ids won't be set into the Have added tests for this. Replaced these with DCHECK. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@484 PS3, Line 484: if (start_with_first_stripe && misaligned_stripe_skipped) { > I think we had a specific Parquet test for this code path, so we're probabl Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@498 PS3, Line 498: // TODO ValidateColumnOffsets > Is this still needed? Removed this. The orc library will handle this so we just need to catch exceptions from the orc lib. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@500 PS3, Line 500: google::uint64 stripe_offset = stripe.offset(); > Why not uint64_t? Are they not the same underlying type? Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@513 PS3, Line 513: // TODO: check if this stripe can be skipped by stats. > File a follow-on JIRA for this enhancement? Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@517 PS3, Line 517: unique_ptr<orc::InputStream> input_stream(new ScanRangeInputStream(this)); > I haven't fully thought through this idea but I wonder if we should have an Though ORC-262 has no progress, I think we can still prefech data and let the ORC lib reading from an in-memory InputStream. Created a JIRA for this: IMPALA-6636. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@677 PS3, Line 677: vector<uint8_t> decompressed_footer_buffer; > You're right! Will review these logics deeper. The logics of parsing the file tail are now replaced by using the orc lib. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@751 PS3, Line 751: // TODO > We should do this in the initial patch if it's possible to crash Impala wit Handled by the ORC lib http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@820 PS3, Line 820: while (row_id < capacity && ScratchBatchNotEmpty()) { > We should file a follow-on JIRA to codegen the runtime filter + conjunct ev Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@859 PS3, Line 859: bool HdfsOrcScanner::EvalRuntimeFilters(TupleRow* row) { > Can you add a TODO to combine this with the Parquet implementation? Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@881 PS3, Line 881: inline void HdfsOrcScanner::ReadRow(const orc::ColumnVectorBatch &batch, int row_idx, > We don't need to do this in this patch, but it would probably be faster to Add a TODO for this. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@884 PS3, Line 884: dynamic_cast > Sure! Excited to know that you start to test the scanner's performance! Done http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@975 PS3, Line 975: // TODO warn if slot_desc->type().GetByteSize() != 16 > You could DCHECK in that case, the only valid values are 4, 8 and 16 Add DCHECKs in default http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv File testdata/workloads/functional-query/functional-query_exhaustive.csv: http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv@25 PS3, Line 25: file_format: orc, dataset: functional, compression_codec: none, compression_type: none > Does this actually load uncompressed ORC, or does it use the default ORC co Yeah, the default ORC codec is zlib (deflate in Impala). -- To view, visit http://gerrit.cloudera.org:8080/9134 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Gerrit-Change-Number: 9134 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Sat, 10 Mar 2018 02:53:04 +0000 Gerrit-HasComments: Yes
