Re: Parquet min/max statistics & null values

Bruno Quinart Sat, 28 Oct 2017 13:20:06 -0700

Hi

Thanks for your replies.


An example query:
SELECT * FROM reasonably_sized_partitioned_table
WHERE partition_key_dtnr = 20171028
AND sorted_column = 56789123456
(same effect with an IN instead of the last equality)

I checked the behavior in our cluster again and can now confirm Lars 
explanation:
- Filtering on stats works well as from 2.9, except when there is only null 
values for the column in a parquet file
- And that would be addressed by IMPALA-6113
I was wrongly assuming that in some of my tables there would not have been 
enough null values to fill a parquet file for each partition.

I wonder how you verify those stats in Parquet metadata. Parquet-tools 
currently does not print them.

Thanks for the help
Bruno

On 27 Oct, 2017, at 12:54 AM, Lars Volker wrote:

Hi Bruno,

To clarify on your original observation: Column chunks that only have null
values in them will not have their min_value and max_value fields populated
and thus won't be skipped based on stats. I filed IMPALA-6113
<https://issues.apache.org/jira/browse/IMPALA-6113> to track this.
IMPALA-5061 <https://issues.apache.org/jira/browse/IMPALA-5061> added
support to populate the null_count in statistics, allowing us to detect
column chunks that only contain NULLs. We should use that information to
skip row groups if the predicate allows us to.

Row groups with column chunks that have at least one non-null value should
get filtered correctly.

Cheers, Lars

On Thu, Oct 26, 2017 at 10:02 AM, Tim Armstrong wrote:

Hi Bruno,
Could you provide an example of the specific predicates that aren't being
used to successfully skip the row group?

- Tim

On Thu, Oct 26, 2017 at 7:21 AM, Jeszy wrote:

Hello Bruno,

Thanks for bringing this up. While not apparent from the commit
comments, this limitation was mentioned during the code review:
'min/max are only set when there are non-null values, so we don't
consider statistics for "is null".' (see
https://gerrit.cloudera.org/#/c/6147/).
It looks to me that this was intended, but I'll let others confirm.
Definitely a point where we can improve.

Thanks!

On 26 October 2017 at 08:02, Bruno Quinart wrote:

Hi all

With IMPALA-2328, Parquet row group statistics are now being used to

skip

the row group completely if the min/max range is excluded from the
predicate.
We have a use case in which we make sure the data is sorted on a 'key'

and

have then many selective queries on that 'key' field. We notice a
significant performance increase.
So thanks a lot for all the work on that!

One thing we notice is an unexpected behavior for records where that

'key'

has null values. It seems that as soon as null values are present in a

row

group, the test on the min/max fails and the row group is read.

We work with Impala 2.9. The data is put in parquet files by Impala

itself.

We have noticed this effect for both bigint as decimal fields. Note that
it's difficult for me to extract the min/max statistics from the parquet
files. The parquet-tools included in our distribution (5.12) is not the
latest. And I was told PARQUET-327 would anyway not print the those row
group stats because of the way Impala stores them.
We do confirm the expected behavior (exactly one row group read for

properly

sorted data) when we create a similar table but explicitly filter out

all

null values for that 'key' field. We also notice that the the number of

row

groups read (but zero records retained) is proportional to the number of
null values.

Is this behavior expected?
Is there a fundamental reason those row groups can not be skipped?

Thanks!
Bruno

Re: Parquet min/max statistics & null values

Reply via email to