Hi Thanks for your replies.
An example query: SELECT * FROM reasonably_sized_partitioned_table WHERE partition_key_dtnr = 20171028 AND sorted_column = 56789123456 (same effect with an IN instead of the last equality) I checked the behavior in our cluster again and can now confirm Lars explanation: - Filtering on stats works well as from 2.9, except when there is only null values for the column in a parquet file - And that would be addressed by IMPALA-6113 I was wrongly assuming that in some of my tables there would not have been enough null values to fill a parquet file for each partition. I wonder how you verify those stats in Parquet metadata. Parquet-tools currently does not print them. Thanks for the help Bruno On 27 Oct, 2017, at 12:54 AM, Lars Volker wrote: Hi Bruno, To clarify on your original observation: Column chunks that only have null values in them will not have their min_value and max_value fields populated and thus won't be skipped based on stats. I filed IMPALA-6113 <https://issues.apache.org/jira/browse/IMPALA-6113> to track this. IMPALA-5061 <https://issues.apache.org/jira/browse/IMPALA-5061> added support to populate the null_count in statistics, allowing us to detect column chunks that only contain NULLs. We should use that information to skip row groups if the predicate allows us to. Row groups with column chunks that have at least one non-null value should get filtered correctly. Cheers, Lars On Thu, Oct 26, 2017 at 10:02 AM, Tim Armstrong wrote: Hi Bruno, Could you provide an example of the specific predicates that aren't being used to successfully skip the row group? - Tim On Thu, Oct 26, 2017 at 7:21 AM, Jeszy wrote: Hello Bruno, Thanks for bringing this up. While not apparent from the commit comments, this limitation was mentioned during the code review: 'min/max are only set when there are non-null values, so we don't consider statistics for "is null".' (see https://gerrit.cloudera.org/#/c/6147/). It looks to me that this was intended, but I'll let others confirm. Definitely a point where we can improve. Thanks! On 26 October 2017 at 08:02, Bruno Quinart wrote:
Hi all With IMPALA-2328, Parquet row group statistics are now being used to
skip
the row group completely if the min/max range is excluded from the predicate. We have a use case in which we make sure the data is sorted on a 'key'
and
have then many selective queries on that 'key' field. We notice a significant performance increase. So thanks a lot for all the work on that! One thing we notice is an unexpected behavior for records where that
'key'
has null values. It seems that as soon as null values are present in a
row
group, the test on the min/max fails and the row group is read. We work with Impala 2.9. The data is put in parquet files by Impala
itself.
We have noticed this effect for both bigint as decimal fields. Note that it's difficult for me to extract the min/max statistics from the parquet files. The parquet-tools included in our distribution (5.12) is not the latest. And I was told PARQUET-327 would anyway not print the those row group stats because of the way Impala stores them. We do confirm the expected behavior (exactly one row group read for
properly
sorted data) when we create a similar table but explicitly filter out
all
null values for that 'key' field. We also notice that the the number of
row
groups read (but zero records retained) is proportional to the number of null values. Is this behavior expected? Is there a fundamental reason those row groups can not be skipped? Thanks! Bruno