Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13582 )
Change subject: IMPALA-8617: Add support for lz4 in parquet ...................................................................... Patch Set 6: (8 comments) http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/exec/parquet/hdfs-parquet-table-writer.cc@97 PS6, Line 97: const string& column_name style nit: take this by value, and then initialize the member variable as column_name_(std::move(column_name)) http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/exec/parquet/hdfs-parquet-table-writer.cc@400 PS6, Line 400: const string& column_name_; nit: it's dangerous (and anti-style-guide) to store a reference as a member variable like this -- whoever passed in this string might well mutate or destroy the string leaving a dangling reference here. Typically we lean towards using pointers for cases like this, and documenting clearly in the constructor that there's a lifetime requirement for a passed-in raw pointer. In this case, though, it's not a hot path at all so I think you should just make this a 'string' vs adding lifetime dependencies http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/exec/parquet/hdfs-parquet-table-writer.cc@773 PS6, Line 773: Parquet file '$0' column '$1' hit an error nit: change this to something like "Error writing partquet file '$0' column '$1': $2" (so it's more obvious this is on the _write_ side rather than the read side http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/exec/parquet/hdfs-parquet-table-writer.cc@903 PS6, Line 903: return Status(Substitute("Parquet file '$0' column '$1' hit an error. $2", same http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/exec/parquet/hdfs-parquet-table-writer.cc@997 PS6, Line 997: // Map parquet codecs to Impala codecs. Its important to do the above check before : // we do any mapping. : parquet::CompressionCodec::type parquet_codec = ConvertImpalaToParquetCodec(codec); : codec = ConvertParquetToImpalaCodec(parquet_codec); this code isn't very intuitive to me. What's going on here? We're mapping the impala codec to parquet, and then back to impala? Is this serving some kind of canonicalization? If so, can you add some comment explaining why we do this back-and-forth round-trip? http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/exec/parquet/parquet-column-readers.cc@1346 PS6, Line 1346: Parquet file '$0' column '$1' hit an error nit: think this reads better as "Error decompressing parquet file '$0' column '$1': $2" Is the offset of the page readily available? If not, no need to plumb it through a bunch of code, but if we can get it easily, I think that would aid in debugging if we ever hit this error http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/util/compress.cc File be/src/util/compress.cc: http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/util/compress.cc@349 PS6, Line 349: int64_t length = -1; nit: we tend towards the "early return" style instead of the "one return" style. You can get rid of this local variable and just change the assignments below on L355 and L360 to return statements, and then get rid of the 'else' on line 356 and 361, for a net reduction of several lines of code http://gerrit.cloudera.org:8080/#/c/13582/6/be/src/util/compress.cc@383 PS6, Line 383: CHECK(size > 0) You can do: CHECK_GT(size, 0) << "LZ4_compress_default() failed" which will end up including the 'size' value in the failure. That said, maybe we're better off with an if statement and return a bad Status here indicating a runtime error, since it seems you do have error checking at the call site, etc? -- To view, visit http://gerrit.cloudera.org:8080/13582 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c Gerrit-Change-Number: 13582 Gerrit-PatchSet: 6 Gerrit-Owner: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Thu, 13 Jun 2019 17:36:52 +0000 Gerrit-HasComments: Yes