[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

Lars Volker (Code Review) Thu, 17 Nov 2016 06:02:32 -0800

Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
......................................................................



Patch Set 13:

(10 comments)

Thanks for the review. Please see PS14.

http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 66:             tsink.table_sink.hdfs_table_sink.skip_header_line_count :
> if you break this up, then
This has been formatted by clang-format since I touched these lines. While I 
agree that the formatting could be nicer, I'd like to keep the automatic format 
here for simplicity.


Line 212:         GetHashTblKey(NULL, dynamic_partition_key_value_ctxs, &key);
> we're standardizing on nullptr now
I changed all occurrences in lines I touched. Once the change has a +2 I will 
rebase the change and then replace all other occurrences in the file to keep it 
consistent (and keep the risk for merge conflicts low until then).


Line 266:   // The rows of this batch may span across multiple files. We 
repeatedly pass the row
> remove "across"
Done


Line 273:     RETURN_IF_ERROR(output_partition->writer->AppendRowBatch(
> rename this function, it's not appending the entire row batch (AppendRows?)
This is part of the HdfsTableWriter virtual interface. Should I change it here 
and in all table writer classes?


Line 283: Status HdfsTableSink::WriteClusteredRowBatch(RuntimeState* state, 
RowBatch* batch) {
> to speed this up, you should:
Done. AppendRowBatch() already appends the whole batch if the row index vector 
is empty, so there was no change necessary.


http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 170:       const HdfsPartitionDescriptor& partition_descriptor, const 
TupleRow* current_row,
> mention that current_row provides the partition key values.
Done


Line 181:   /// GetHashTblKey(). Maps to an OutputPartition, which are owned by 
the object pool and
> "pool,"
Done


Line 189:   void GetHashTblKey(const TupleRow* current_row, const 
std::vector<ExprContext*>& ctxs,
> current_row -> row (it doesn't matter to this function that the row is 'cur
Done


Line 213:   /// Appends all rows in batch to the temporary Hdfs files 
corresponding to partitions.
> what "corresponding to partitions" refers to is unclear.
Rephrased it.


Line 214:   /// The input must be ordered by the partition key expressions.
> the rows are expected to be ordered...?
Done. Though I think "must be" makes it more clear that it's a  prerequisite to 
be guaranteed by the caller.


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <z...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

Reply via email to