Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/12168 )
Change subject: IMPALA-6503: Support reading complex types from ORC format files ...................................................................... Patch Set 13: (15 comments) Thanks for your new feedbacks! Let's keep reviewing it so things could be simpler:) http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363 PS9, Line 363: // We only deal with collection columns (ARRAY/MAP) and primitive columns here. > Yep that's the rule - use braces if the whole thing doesn't fit on a single Sure http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@198 PS12, Line 198: niqu > nit:std:: not needed since we import it into the namespace with common/name Done http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@316 PS12, Line 316: VLOG(3) << "Add ORC column " << node->getMaximumColumnId() << " for empty tuple " > Can we remove these VLOG_FILE statements here and below? It could get quite Yes, they're mainly used for debugging. VLOG(3) is equal to VLOG_ROW but this function will only be called once for an orc-scanner, not for each row or RowBatch. Isn't it a VLOG(2) level info? However, I still change it to VLOG(3) to respect your opinions :) http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@380 PS12, Line 380: const list<uint64_t>& selected_type_ids) { > Can this argument be const? Yes. Should be const. http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@593 PS12, Line 593: > How do we know that we're finished processing the previous batch? What happ If we come here, the previous batch should have been drained. The relative codes are in GetNextInternal: // Transfer remaining values in current orc batches tracked by readers. if (!orc_root_reader_->EndOfBatch()) { assemble_rows_timer_.Start(); RETURN_IF_ERROR(TransferTuples(orc_root_reader_, row_batch)); assemble_rows_timer_.Stop(); if (row_batch->AtCapacity()) return Status::OK(); DCHECK(orc_root_reader_->EndOfBatch()); } If we exited the while() loop because row_batch() was at capacity, the remaining rows will be processed in the above codes. Since the batch size equals to the capacity of RowBatch, the remaining rows should no more than the capacity of RowBatch. Thus calling TransferTuples once can drain the original batch. http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638 PS12, Line 638: > typo: materialization Done http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638 PS12, Line 638: alizeTuple()) { > grammar: "that are not materializing tuples" Done http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@688 PS12, Line 688: > Pass in complex_col_reader by pointer since it's mutable. It can be const. We don't change any states of it in this function. I just add 'const' for it and add const for functions we called here. http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@133 PS12, Line 133: } > VLOG_FILE is also probably too verbose here - VLOG(3) would be more appropr Done http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@219 PS12, Line 219: materiali > same comment about logging Done http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc File be/src/runtime/descriptors.cc: http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc@65 PS12, Line 65: const int SchemaPathConstants::ARRAY_ITEM; > I think it's better to define the constant value in the header and just ins Done http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh@581 PS12, Line 581: run-step "Loading nested parquet data" load-nested.log \ : ${IMPALA_HOME}/testdata/bin/load_nested.py \ : -t tpch_nested_parquet -f parquet/none ${LOAD_NESTED_ARGS:-} : run-step "Loading nested orc data" load-nested.log \ : ${IMPALA_HOME}/testdata/bin/load_nested.py \ > I think it would make sense to specify the arguments for the call to run th Sure. It's clearer. http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py File testdata/bin/load_nested.py: http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@286 PS12, Line 286: hive.execute(""" : CREATE TABLE part : STORED AS ORC : TBLPROPERTIES('{compression_key}'='{compression_value}') : AS SELECT * FROM {source_db}.part""".format(**sql_params)) : > I would prefer this to be right next to the create statement for parquet, e Done http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@340 PS12, Line 340: impala.invalidate_metadata(table_name="customer") : impala.invalidate_metadata(table_name="part") : impala.invalidate_metadata(table_name="region") : impala.invalidate_metadata(table_name="supplier") : impala.compute_stats() : : LOG.info("Done loading neste > Please throw an error if the file_format/compression_value is not one of th My original intention is that only these two cases needs transformation (i.e. parquet/none means parquet/snappy, orc/def means using zlib). Actually, "snap" needs to be transformed to "SNAPPY" and some other codecs are not supported by Hive (e.g. bzip). Based on these, I refactor this part to be able to load more table formats (e.g. parquet/gzip, orc/snap, orc/none, etc.) http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py File tests/query_test/test_nested_types.py: http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py@611 PS12, Line 611: if file_format == 'parquet': : self.__create_parquet_tables(unique_database) : elif file_format == 'orc': : self.__create_orc_tables(unique_database) : self.run_test_case('QueryTest/max-nesting-depth', vector, unique_database) : : def __create_parquet_tables(self, unique_data > It might be cleaner to have __create_orc_table() call __create_parquet_tabl Done -- To view, visit http://gerrit.cloudera.org:8080/12168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790 Gerrit-Change-Number: 12168 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 12 Feb 2019 23:56:07 +0000 Gerrit-HasComments: Yes