[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Quanlong Huang (Code Review) Tue, 12 Feb 2019 15:56:42 -0800

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12168 )


Change subject: IMPALA-6503: Support reading complex types from ORC format files
......................................................................


Patch Set 13:

(15 comments)

Thanks for your new feedbacks! Let's keep reviewing it so things could be 
simpler:)

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/9/be/src/exec/hdfs-orc-scanner.cc@363
PS9, Line 363:     // We only deal with collection columns (ARRAY/MAP) and 
primitive columns here.
> Yep that's the rule - use braces if the whole thing doesn't fit on a single
Sure


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@198
PS12, Line 198: niqu
> nit:std:: not needed since we import it into the namespace with common/name
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@316
PS12, Line 316:     VLOG(3) << "Add ORC column " << node->getMaximumColumnId() 
<< " for empty tuple "
> Can we remove these VLOG_FILE statements here and below? It could get quite
Yes, they're mainly used for debugging. VLOG(3) is equal to VLOG_ROW but this 
function will only be called once for an orc-scanner, not for each row or 
RowBatch. Isn't it a VLOG(2) level info?

However, I still change it to VLOG(3) to respect your opinions :)


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@380
PS12, Line 380:     const list<uint64_t>& selected_type_ids) {
> Can this argument be const?
Yes. Should be const.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@593
PS12, Line 593:
> How do we know that we're finished processing the previous batch? What happ
If we come here, the previous batch should have been drained. The relative 
codes are in GetNextInternal:

  // Transfer remaining values in current orc batches tracked by readers.
  if (!orc_root_reader_->EndOfBatch()) {
    assemble_rows_timer_.Start();
    RETURN_IF_ERROR(TransferTuples(orc_root_reader_, row_batch));
    assemble_rows_timer_.Stop();
    if (row_batch->AtCapacity()) return Status::OK();
    DCHECK(orc_root_reader_->EndOfBatch());
  }

If we exited the while() loop because row_batch() was at capacity, the 
remaining rows will be processed in the above codes. Since the batch size 
equals to the capacity of RowBatch, the remaining rows should no more than the 
capacity of RowBatch. Thus calling TransferTuples once can drain the original 
batch.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638
PS12, Line 638:
> typo: materialization
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@638
PS12, Line 638: alizeTuple()) {
> grammar: "that are not materializing tuples"
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/hdfs-orc-scanner.cc@688
PS12, Line 688:
> Pass in complex_col_reader by pointer since it's mutable.
It can be const. We don't change any states of it in this function. I just add 
'const' for it and add const for functions we called here.


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@133
PS12, Line 133:   }
> VLOG_FILE is also probably too verbose here - VLOG(3) would be more appropr
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/exec/orc-column-readers.cc@219
PS12, Line 219: materiali
> same comment about logging
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc
File be/src/runtime/descriptors.cc:

http://gerrit.cloudera.org:8080/#/c/12168/12/be/src/runtime/descriptors.cc@65
PS12, Line 65: const int SchemaPathConstants::ARRAY_ITEM;
> I think it's better to define the constant value in the header and just ins
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/create-load-data.sh@581
PS12, Line 581:   run-step "Loading nested parquet data" load-nested.log \
              :     ${IMPALA_HOME}/testdata/bin/load_nested.py \
              :     -t tpch_nested_parquet -f parquet/none ${LOAD_NESTED_ARGS:-}
              :   run-step "Loading nested orc data" load-nested.log \
              :     ${IMPALA_HOME}/testdata/bin/load_nested.py \
> I think it would make sense to specify the arguments for the call to run th
Sure. It's clearer.


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@286
PS12, Line 286:       hive.execute("""
              :         CREATE TABLE part
              :         STORED AS ORC
              :         TBLPROPERTIES('{compression_key}'='{compression_value}')
              :         AS SELECT * FROM 
{source_db}.part""".format(**sql_params))
              :
> I would prefer this to be right next to the create statement for parquet, e
Done


http://gerrit.cloudera.org:8080/#/c/12168/12/testdata/bin/load_nested.py@340
PS12, Line 340:     impala.invalidate_metadata(table_name="customer")
              :     impala.invalidate_metadata(table_name="part")
              :     impala.invalidate_metadata(table_name="region")
              :     impala.invalidate_metadata(table_name="supplier")
              :     impala.compute_stats()
              :
              :   LOG.info("Done loading neste
> Please throw an error if the file_format/compression_value is not one of th
My original intention is that only these two cases needs transformation (i.e. 
parquet/none means parquet/snappy, orc/def means using zlib). Actually, "snap" 
needs to be transformed to "SNAPPY" and some other codecs are not supported by 
Hive (e.g. bzip). Based on these, I refactor this part to be able to load more 
table formats (e.g. parquet/gzip, orc/snap, orc/none, etc.)


http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/12168/12/tests/query_test/test_nested_types.py@611
PS12, Line 611:     if file_format == 'parquet':
              :       self.__create_parquet_tables(unique_database)
              :     elif file_format == 'orc':
              :       self.__create_orc_tables(unique_database)
              :     self.run_test_case('QueryTest/max-nesting-depth', vector, 
unique_database)
              :
              :   def __create_parquet_tables(self, unique_data
> It might be cleaner to have __create_orc_table() call __create_parquet_tabl
Done



--
To view, visit http://gerrit.cloudera.org:8080/12168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I244dc9d2b3e425393f90e45632cb8cdbea6cf790
Gerrit-Change-Number: 12168
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Feb 2019 23:56:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6503: Support reading complex types from ORC format files

Reply via email to