Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9134 )
Change subject: IMPALA-5717: Support for ORC data files ...................................................................... Patch Set 3: (44 comments) This is really impressive - I was able to build the patch, load TPC-H ORC data and run a bunch of queries. I think there are still meta-questions about the approach of importing the ORC library that I want input from other community members on, but I spent some time understanding your code and commenting on specific things. http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG@15 PS3, Line 15: Instead of linking the orc-reader as a third party library, it's Which version or commit hash of ORC did you import? I think we need to think about this carefully. It might be best to use it as a third-party library at least to start off with so that we can pick up any improvements made to the Apache ORC implementation by upgrading our library version. Otherwise it's a lot more code for the Impala project to maintain. We can also contribute improvements like predicate pushdown support to the ORC library for now. At some point, if we have multiple people committed to maintaining it and we want to make larger changes to the ORC code, we could revisit the decision. http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG@25 PS3, Line 25: tests. We should also add ORC for TPC-H and TPC-DS so that we have some larger data sets to test on. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner-test.cc File be/src/exec/hdfs-orc-scanner-test.cc: http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner-test.cc@32 PS3, Line 32: uint8_t empty_data_one_col_orc[] = { Can you comment how this data was generated, so that it could be reproduced if needed? http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner-test.cc@292 PS3, Line 292: // TODO Codec::CreateDecompressor cannot create a LZO decompressor, how to support LZO? We weren't able to include Lzo integration in Apache Impala since the original implementation was GPL-licensed. There's a Cloudera-developed plugin with a different license to read LZO text files. It looks like ORC actually has an Apache-licensed LZO implementation so that might be an alternative. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@28 PS3, Line 28: class CollectionValueBuilder; Not needed? http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@259 PS3, Line 259: ProcessFileTail It might be helpful to define what the "footer", "file tail" and "postscript" are, since the names are similar. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@1 PS3, Line 1: // Copyright 2012 Cloudera Inc. Don't need cloudera copyrights! http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@24 PS3, Line 24: #include "common/object-pool.h" Many of these headers look unused. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@53 PS3, Line 53: using boost::algorithm::split; Some of these boost "using" declarations don't seem to be needed. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@59 PS3, Line 59: DEFINE_double(orc_min_filter_reject_ratio, 0.1, "(Advanced) If the percentage of " I don't know why we made this flag option parquet-specific. Having an option per file format will get out of control. Can we just define a generic flag like min_filter_reject_ratio, e.g. in global-flags.cc and use it in ORC? We could keep the old parquet flag around for compatibility. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@73 PS3, Line 73: for (int i = 0; i < files.size(); ++i) { Can we convert this to a range for? We generally prefer that in new code. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@130 PS3, Line 130: ScanRange* HdfsOrcScanner::FindFooterSplit(HdfsFileDesc* file) { We could move this to HdfsScanner and share the code - the logic is generic. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@142 PS3, Line 142: mem_tracker_.reset(new MemTracker(-1, "OrcReader", mem_tracker)); I think we should track it against the HdfsScanNode's MemTracker instead of creating a MemTracker per scanner thread. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@151 PS3, Line 151: mem_tracker_->CloseAndUnregisterFromParent(); We only want to use CloseAndUnregisterFromParent() for the query-level MemTracker for now (see the comment). I refactored some of that code to make the lifetime of MemTrackers easier to reason about. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@157 PS3, Line 157: std:: Don't need std:: prefix. We generally prefer avoiding it when it isn't needed and adding "using " declarations in .cc. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@162 PS3, Line 162: if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) { You should be able to assume it's non-NULL. A lot of older code checks if metrics are NULL because the metrics weren't initialised in backend tests. We switched to fixing the backend tests to always initialise metrics so this should be avoided. TBH updating the metric isn't even necessary though - I've never found it to be useful so we don't need to expand the definition. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@187 PS3, Line 187: std::f Same here, don't need std:: prefix. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@196 PS3, Line 196: void HdfsOrcScanner::ScanRangeInputStream::read(void* buf, uint64_t length, It's unfortunate that the ORC code was designed to issue only synchronous reads. It doesn't really align that well with Impala's I/O management but this is probably good for an initial version. We won't get the same performance as Parquet unfortunately right now. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@223 PS3, Line 223: memset(buf, 0, length); Is the memset needed? If so, should document why in a comment. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@224 PS3, Line 224: throw std::runtime_error("Cannot read from file: " + status.GetDetail()); I need to think about the exception-throwing. There's a mismatch here since Impala's C++ code avoids exceptions like the plague but the ORC C++ code uses them as the main mode of error handling. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@224 PS3, Line 224: throw std::runtime_error("Cannot read from file: " + status.GetDetail()); The fact we need to use exceptions is unfortunate. I need to think a bit more about whether this is the best way to do things (it seems like the ORC-reader code is very exception-oriented so it may be unavoidable). http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@230 PS3, Line 230: stripe_idx_(-1), For the constants here, we should initialise them in the class definition (we've been switching code to using that syntax). E.g. int64_t stripe_rows_read_ = 0 http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@257 PS3, Line 257: num_cols_counter_ = What do you think about adding Orc to some of the counter names? At least the ones that reference ORC-specific things like Columns, Strips, Footers, etc. It would be a little confusing otherwise, particularly on tables with mixed formats. I don't think the Parquet scanner does the right thing here - it should probably include Parquet in some of the counter names. E.g. NumOrcColumns, NumStatsFilteredOrcStripes, NumOrcStripes, OrcFooterProcessingTime. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@259 PS3, Line 259: num_stats_filtered_stripes_counter_ = This isn't used - let's remove it until we add the stats filtering feature. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@333 PS3, Line 333: if (col_type.type == TYPE_ARRAY) { Are there any tests to make sure the ORC reader won't crash when scanning complex types? It might be helpful to create separate JIRAs for features that won't be in this patch so we can understand the scope too. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@484 PS3, Line 484: if (start_with_first_stripe && misaligned_stripe_skipped) { I think we had a specific Parquet test for this code path, so we're probably missing coverage for ORC at the moment. In general it would be helpful to look at the Parquet-specific scanner tests in test_scanners.py and test_parquet.py and see if there should be corresponding ORC tests. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@498 PS3, Line 498: // TODO ValidateColumnOffsets Is this still needed? http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@500 PS3, Line 500: google::uint64 stripe_offset = stripe.offset(); Why not uint64_t? Are they not the same underlying type? http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@513 PS3, Line 513: // TODO: check if this stripe can be skipped by stats. File a follow-on JIRA for this enhancement? http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@517 PS3, Line 517: unique_ptr<orc::InputStream> input_stream(new ScanRangeInputStream(this)); I haven't fully thought through this idea but I wonder if we should have an alternative InputStream implementation that wraps ScannerContext::Stream, so that we can make use of the DiskIoMgr's prefetching. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@677 PS3, Line 677: vector<uint8_t> decompressed_footer_buffer; I'm concerned that this buffer memory isn't tracked and could be arbitrarily large. Can we use ScopedBuffer instead of vector<uint8_t> for cases like this. I've seen bugs before where an invalid file can cause code like this to crash. E.g. 'length = -1' in a file is converted to a huge positive unsigned number, then vector.resize() throws std::bad_alloc and the Impala process dies. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@679 PS3, Line 679: isCompressed is_compressed http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@751 PS3, Line 751: // TODO We should do this in the initial patch if it's possible to crash Impala with a bad ORC file. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@820 PS3, Line 820: while (row_id < capacity && ScratchBatchNotEmpty()) { We should file a follow-on JIRA to codegen the runtime filter + conjunct evaluation loop. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@859 PS3, Line 859: bool HdfsOrcScanner::EvalRuntimeFilters(TupleRow* row) { Can you add a TODO to combine this with the Parquet implementation? http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@881 PS3, Line 881: inline void HdfsOrcScanner::ReadRow(const orc::ColumnVectorBatch &batch, int row_idx, We don't need to do this in this patch, but it would probably be faster to do the copying column-by-column for a batch then do a second pass evaluating the predicates and adjust tuple pointers accordingly. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@884 PS3, Line 884: dynamic_cast Can this be a static_cast? AFAICT dynamic_cast is unnecessary in most places in this function since it's always the right type. I ran a release build of this code scanning the TPCH data and it looks like it spends a few % of time in dynamic_cast for some queries. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@913 PS3, Line 913: case TYPE_TINYINT:*(reinterpret_cast<int8_t*>(slot_val_ptr)) = val; It might be more readable to put the case body on a separate line. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@954 PS3, Line 954: dst_batch->tuple_data_pool()->Allocate(val_ptr->len)); Should use TryAllocate() here and return an error on failure to allocate. Allocate() is a legacy interface we're trying to remove eventually. http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@975 PS3, Line 975: // TODO warn if slot_desc->type().GetByteSize() != 16 You could DCHECK in that case, the only valid values are 4, 8 and 16 http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@1038 PS3, Line 1038: NULL Prefer nullptr in new code (I know NULL is all over the existing code :)). http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/bin/run-hive-server.sh File testdata/bin/run-hive-server.sh: http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/bin/run-hive-server.sh@75 PS3, Line 75: HADOOP_HEAPSIZE="2048" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 & Can you comment how 2048 was chosen for future reference? http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv File testdata/workloads/functional-query/functional-query_exhaustive.csv: http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv@25 PS3, Line 25: file_format: orc, dataset: functional, compression_codec: none, compression_type: none Does this actually load uncompressed ORC, or does it use the default ORC codec? We should make sure that compression_codec actually matches what is used. parquet/none is actually snappy-compressed because it was the default, but that was a mistake and confusing. http://gerrit.cloudera.org:8080/#/c/9134/3/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/9134/3/tests/query_test/test_scanners.py@a108 PS3, Line 108: oops! -- To view, visit http://gerrit.cloudera.org:8080/9134 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Gerrit-Change-Number: 9134 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Fri, 02 Feb 2018 01:54:17 +0000 Gerrit-HasComments: Yes