Hi Bruno, Could you provide an example of the specific predicates that aren't being used to successfully skip the row group?
- Tim On Thu, Oct 26, 2017 at 7:21 AM, Jeszy <jes...@gmail.com> wrote: > Hello Bruno, > > Thanks for bringing this up. While not apparent from the commit > comments, this limitation was mentioned during the code review: > 'min/max are only set when there are non-null values, so we don't > consider statistics for "is null".' (see > https://gerrit.cloudera.org/#/c/6147/). > It looks to me that this was intended, but I'll let others confirm. > Definitely a point where we can improve. > > Thanks! > > On 26 October 2017 at 08:02, Bruno Quinart <bquin...@icloud.com> wrote: > > Hi all > > > > With IMPALA-2328, Parquet row group statistics are now being used to skip > > the row group completely if the min/max range is excluded from the > > predicate. > > We have a use case in which we make sure the data is sorted on a 'key' > and > > have then many selective queries on that 'key' field. We notice a > > significant performance increase. > > So thanks a lot for all the work on that! > > > > One thing we notice is an unexpected behavior for records where that > 'key' > > has null values. It seems that as soon as null values are present in a > row > > group, the test on the min/max fails and the row group is read. > > > > We work with Impala 2.9. The data is put in parquet files by Impala > itself. > > We have noticed this effect for both bigint as decimal fields. Note that > > it's difficult for me to extract the min/max statistics from the parquet > > files. The parquet-tools included in our distribution (5.12) is not the > > latest. And I was told PARQUET-327 would anyway not print the those row > > group stats because of the way Impala stores them. > > We do confirm the expected behavior (exactly one row group read for > properly > > sorted data) when we create a similar table but explicitly filter out all > > null values for that 'key' field. We also notice that the the number of > row > > groups read (but zero records retained) is proportional to the number of > > null values. > > > > Is this behavior expected? > > Is there a fundamental reason those row groups can not be skipped? > > > > Thanks! > > Bruno > > >