Lars Volker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input ......................................................................
Patch Set 13: (10 comments) Thanks for the review. Please see PS14. http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 66: tsink.table_sink.hdfs_table_sink.skip_header_line_count : > if you break this up, then This has been formatted by clang-format since I touched these lines. While I agree that the formatting could be nicer, I'd like to keep the automatic format here for simplicity. Line 212: GetHashTblKey(NULL, dynamic_partition_key_value_ctxs, &key); > we're standardizing on nullptr now I changed all occurrences in lines I touched. Once the change has a +2 I will rebase the change and then replace all other occurrences in the file to keep it consistent (and keep the risk for merge conflicts low until then). Line 266: // The rows of this batch may span across multiple files. We repeatedly pass the row > remove "across" Done Line 273: RETURN_IF_ERROR(output_partition->writer->AppendRowBatch( > rename this function, it's not appending the entire row batch (AppendRows?) This is part of the HdfsTableWriter virtual interface. Should I change it here and in all table writer classes? Line 283: Status HdfsTableSink::WriteClusteredRowBatch(RuntimeState* state, RowBatch* batch) { > to speed this up, you should: Done. AppendRowBatch() already appends the whole batch if the row index vector is empty, so there was no change necessary. http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 170: const HdfsPartitionDescriptor& partition_descriptor, const TupleRow* current_row, > mention that current_row provides the partition key values. Done Line 181: /// GetHashTblKey(). Maps to an OutputPartition, which are owned by the object pool and > "pool," Done Line 189: void GetHashTblKey(const TupleRow* current_row, const std::vector<ExprContext*>& ctxs, > current_row -> row (it doesn't matter to this function that the row is 'cur Done Line 213: /// Appends all rows in batch to the temporary Hdfs files corresponding to partitions. > what "corresponding to partitions" refers to is unclear. Rephrased it. Line 214: /// The input must be ordered by the partition key expressions. > the rows are expected to be ordered...? Done. Though I think "must be" makes it more clear that it's a prerequisite to be guaranteed by the caller. -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Ivanfi <z...@cloudera.com> Gerrit-HasComments: Yes