Hello Bruno,

Thanks for bringing this up. While not apparent from the commit
comments, this limitation was mentioned during the code review:
'min/max are only set when there are non-null values, so we don't
consider statistics for "is null".' (see
https://gerrit.cloudera.org/#/c/6147/).
It looks to me that this was intended, but I'll let others confirm.
Definitely a point where we can improve.

Thanks!

On 26 October 2017 at 08:02, Bruno Quinart <bquin...@icloud.com> wrote:
> Hi all
>
> With IMPALA-2328, Parquet row group statistics are now being used to skip
> the row group completely if the min/max range is excluded from the
> predicate.
> We have a use case in which we make sure the data is sorted on a 'key' and
> have then many selective queries on that 'key' field. We notice a
> significant performance increase.
> So thanks a lot for all the work on that!
>
> One thing we notice is an unexpected behavior for records where that 'key'
> has null values. It seems that as soon as null values are present in a row
> group, the test on the min/max fails and the row group is read.
>
> We work with Impala 2.9. The data is put in parquet files by Impala itself.
> We have noticed this effect for both bigint as decimal fields. Note that
> it's difficult for me to extract the min/max statistics from the parquet
> files. The parquet-tools included in our distribution (5.12) is not the
> latest. And I was told PARQUET-327 would anyway not print the those row
> group stats because of the way Impala stores them.
> We do confirm the expected behavior (exactly one row group read for properly
> sorted data) when we create a similar table but explicitly filter out all
> null values for that 'key' field. We also notice that the the number of row
> groups read (but zero records retained) is proportional to the number of
> null values.
>
> Is this behavior expected?
> Is there a fundamental reason those row groups can not be skipped?
>
> Thanks!
> Bruno
>

Reply via email to