Marcel Kornacker has posted comments on this change. Change subject: IMPALA-4815, IMPALA-4817, IMPALA-4819: Populate Parquet Statistics for remaining types ......................................................................
Patch Set 4: (21 comments) http://gerrit.cloudera.org:8080/#/c/6563/4/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 375: scoped_ptr<ColumnStats<T>> page_stats_; why this change? http://gerrit.cloudera.org:8080/#/c/6563/4/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 34: /// decode parquet::Statistics into slots. update Line 60: /// Enum to select whether to read minimum or maximum statistics. Values to not "do not" Line 73: const ColumnType& col_type, const StatsField& stats_field, void* slot); const& isn't necessary here Line 82: /// provide the functionality. is this useful for anything other than strings/var-len data? if not, make name more specific. Line 83: virtual void MaterializeValuesToInternalBuffer(){}; ) {} Line 105: static bool ShouldUseDeprecatedStats(const ColumnType& col_type); CanUse-? or should as in 'would be preferable'/'there's a moral imperative'? Line 108: /// This class contains the type-specific behavior to track statistics per column. i had to remind myself again what T was. would be useful to point out in the class comment that this is for the types of our in-memory format. Line 127: ColumnStats(MemPool* mem_pool, int plain_encoded_value_size) describe params Line 135: /// statistics. you should point out that it may keep a reference to 'v' until Materialize..() got called. Line 144: static bool ReadFromThrift(const string& thrift_stats, void* slot); this doesn't need to be part of the public interface, columstatsbase can be a friend. Line 152: void EncodeValueToString(const T& v, std::string* out) const; static Line 154: /// Decodes a statistics values from 'buffer' into 'result'. you don't mention here that this is plain-encoded. since they both convert between our format and plain, why don't you name them uniformly and make clear that that's what happens. unless that isn't true, but then you should still call it something like '..To/FromThrift' for the sake of uniformity. Line 158: int64_t BytesNeededInternal(const T& v) const; why -internal? Line 161: static void CopyToBufferInternal(StringBuffer* buffer, StringValue* value); why -internal? http://gerrit.cloudera.org:8080/#/c/6563/4/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: Line 28: inline bool ColumnStats<T>::ReadFromThrift(const string& thrift_stats, void* slot) { again, you changed the meaning of a variable but not the name. please pay attention to that yourself. Line 30: return DecodeValueFromThrift(thrift_stats, out); this whole function does almost nothing. is it worth having it (extra level of indirection = more obscurity) instead of just plugging that code into the call sites? it would still be a single statement. Line 189: template <typename T> why do you need T here? http://gerrit.cloudera.org:8080/#/c/6563/4/testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test File testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test: Line 138: # widerow table here. stray line? http://gerrit.cloudera.org:8080/#/c/6563/4/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test: Line 290: select count(*) from functional_parquet.alltypessmall where string_col < "1" use a predicate that will result in filter? (so we can see it also works with <) http://gerrit.cloudera.org:8080/#/c/6563/4/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 60: # 'timetuple', the datetime __eq__ function will forward an unknown equality check to 'timetuple' will the datetime __eq__ function forward... -- To view, visit http://gerrit.cloudera.org:8080/6563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: Yes