[ https://issues.apache.org/jira/browse/IMPALA-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612104#comment-16612104 ]
Csaba Ringhofer edited comment on IMPALA-7559 at 9/12/18 1:05 PM: ------------------------------------------------------------------ Yet another update on this: the issue only occurs if all values in the row group are equal. The reason is that normally parquet-mr does not write timestamp statistics for int96 timestamps, because it considers the ordering undefined. The case when min==max is an exception, because ordering doesn't matter in this case. This logic is at https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L571. Currently Impala only does utc->local conversion if the Parquet file was written by parquet-mr (and convert_legacy_hive_parquet_utc_timestamps is true), so the issue only occurs in this specific case. Parquet-mr writes statistics that are actually used by Impala only since PARQUET-1025, so the issue occurs only with relatively new Parquet-mr and any Impala that uses Parquet stats. was (Author: csringhofer): Yet another update on this: the issue only occurs if all values in the row group are equal. The reason is that normally parquet-mr does not write timestamp statistics for int96 timestamps, because it considers the ordering undefined. The case when min==max is an exception, because ordering doesn't matter in this case. This logic is at https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L571. Currently Impala only does utc->local conversion if the Parquet file was written by parquet-mr (and convert_legacy_hive_parquet_utc_timestamps is true), so the issue only occurs in this specific case. Parquet-mr writes statistics that actually used by Impala only since PARQUET-1025, so the issue occurs only with relatively new Parquet-mr and any Impala that uses Parquet stats. > Parquet stat filtering ignores convert_legacy_hive_parquet_utc_timestamps > ------------------------------------------------------------------------- > > Key: IMPALA-7559 > URL: https://issues.apache.org/jira/browse/IMPALA-7559 > Project: IMPALA > Issue Type: Bug > Components: Backend > Reporter: Csaba Ringhofer > Priority: Blocker > Labels: correctness, parquet, wrongresults > > UPDATE: the issue turned out to be different than I first thought, see my > last comment. I will update the description with more details later. > If the min/max value of a timestamp column chunk is during the hour of the > Summer->Winter dst change (UTC+2 -> UTC+1 in CET) then stat filtering can > drop row groups that contain rows that would be "ok" for the predicate > otherwise. > To reproduce (on current master branch): > {code} > 1. it is assumed that the timezone is CET and that flag > convert_legacy_hive_parquet_utc_timestamps is enabled > ( export TZ=CET; bin/start-impala-cluster.py > --impalad_args="-convert_legacy_hive_parquet_utc_timestamps=true" ) > 2. create a table in hive and fill data in 3 inserts to create 3 files: > create table t (i int, d timestamp) stored as parquet; > insert into t values (1, "2017-10-29 02:30:00"), (2, "2018-10-28 02:30:00"); > insert into t values (3, "2018-10-28 02:30:00"); > insert into t values (4, "2017-10-29 02:30:00") > 3. Query from Impala > set num_nodes=1; > select * from t; -- returns all 4 values (same as Hive) > select * from t where d = "2017-10-29 02:30:00"; -- returns 1 in Impala (Hive > returns 1,4) > select * from t where d = "2018-10-28 02:30:00"; -- returns 2 in Impala (Hive > returns 2,3) > profile; -- NumStatsFilteredRowGroups: 2 (only one row group should have been > stat filtered) > select * from t where d = "2018-10-28 02:30:00" or i = 5; -- returns 2 and 3 > in Impala (same as Hive), because the "or" part disabled stat filtering > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org