Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
......................................................................


Patch Set 4:

(17 comments)

Thanks for adding the tests, this is looking pretty good - I think it could be 
refined and polished a bit more but no major concerns here.

http://gerrit.cloudera.org:8080/#/c/4863/4/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 268:   // Pass the row batch to the writer. If new_file is returned true 
then the current
I find this comment a little confusing. Maybe explain more the "why", i.e. that 
the rows may be written to multiple files.


Line 272:   do {
I feel like this might be clearer if you write it as while(true) and then break 
out if 'new_file' is false.


Line 288:   // TODO: Should we adapt hdfs writer to accept start,num pair 
instead of a vector?
My guess is that it would only help a little bit. It's nice to use the same 
code path for this and dynamic partitioning at least for now.


Line 296:     PartitionPair* next_partition_pair = NULL;
It would be more efficient to keep track of the previous key and comparing 
directly instead of looking up the partition hash table for every row. It's a 
little more complicated but I think would help perf.


PS4, Line 301: Flush
I think it might be clearer to just say "Write rows" instead of "flush". Flush 
makes me think it's writing to the filesystem, but it's really just pushing the 
rows into the HdfsTableWriter, which does its own buffering.


Line 301:         // Flush previous partition
I think the comments in here are a little too low level. Here maybe a single 
comment along the lines of "Done with previous partition - write rows and 
close." might be more helpful.


Line 310:     // Collect row index
This comment isn't too helpful.


PS4, Line 313: Flush
"Write rows to" is probably clearer than "flush".


PS4, Line 607: WriteRowsOfPartition
WriteRowsToPartition?


http://gerrit.cloudera.org:8080/#/c/4863/4/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

PS4, Line 183: rows
Maybe reword to be clearer, took me a while to figure out:
"a vector of indices of the rows in the current batch to insert into the 
partition"


Line 209:   /// Writes all rows of a partition to the partitions writer and 
clears the row index
This could be clearer that it's only writing the rows with indices in 
'partition_pair', it kind-of sounds like it's writing all the rows in 'batch' 
into the partition.


PS4, Line 209: partitions
partition's (missing apostrophe).


PS4, Line 215:  is expected to
This could be more declarative - i.e. "must" instead of "is expected to"


http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert.py
File tests/query_test/test_insert.py:

PS4, Line 71: text
parquet, right?


http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 476: 
Hmm, I think we could also do with a test that writes some large partitions 
split across files. E.g. do a CTAS of lineitem partitioned by some low-NDV 
column. We could also maybe adjust PARQUET_FILE_SIZE to force it to split it 
into more files.

Depending on how long it takes, might make sense to only run it in exhaustive.


Line 477: 
Extra blank line


Line 501:       assert len(files) == 1
Comment why we only expect one file - it might not be that obvious to people 
looking at this code later.


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to