Hello Bruno, Thanks for bringing this up. While not apparent from the commit comments, this limitation was mentioned during the code review: 'min/max are only set when there are non-null values, so we don't consider statistics for "is null".' (see https://gerrit.cloudera.org/#/c/6147/). It looks to me that this was intended, but I'll let others confirm. Definitely a point where we can improve.
Thanks! On 26 October 2017 at 08:02, Bruno Quinart <bquin...@icloud.com> wrote: > Hi all > > With IMPALA-2328, Parquet row group statistics are now being used to skip > the row group completely if the min/max range is excluded from the > predicate. > We have a use case in which we make sure the data is sorted on a 'key' and > have then many selective queries on that 'key' field. We notice a > significant performance increase. > So thanks a lot for all the work on that! > > One thing we notice is an unexpected behavior for records where that 'key' > has null values. It seems that as soon as null values are present in a row > group, the test on the min/max fails and the row group is read. > > We work with Impala 2.9. The data is put in parquet files by Impala itself. > We have noticed this effect for both bigint as decimal fields. Note that > it's difficult for me to extract the min/max statistics from the parquet > files. The parquet-tools included in our distribution (5.12) is not the > latest. And I was told PARQUET-327 would anyway not print the those row > group stats because of the way Impala stores them. > We do confirm the expected behavior (exactly one row group read for properly > sorted data) when we create a similar table but explicitly filter out all > null values for that 'key' field. We also notice that the the number of row > groups read (but zero records retained) is proportional to the number of > null values. > > Is this behavior expected? > Is there a fundamental reason those row groups can not be skipped? > > Thanks! > Bruno >