Hi folks! I am working on the Parquet writer for new timestamp formats (IMPALA-5051), and I have a dilemma about the way to reduce a timestamp's precision from nanosecond to milli or microsecond. I have to choose between consistency with Hive vs Impala itself:
- Impala currently rounds timestamps to microseconds when writing Kudu tables (with some extra hacking near year 10000 to avoid rounding to an invalid timestamp). This was implemented in IMPALA-5137. - Hive seems to truncate timestamps towards negative infinity when it has to reduce precision. I lean towards truncating - theoretically rounding introduces smaller error, but it can move the timestamp to a different day / DST rule / year, which can cause much bigger differences in some queries. Truncating towards negative infinity also seems simpler and faster, as it only needs an integer division on the time_ part of Impala's TimestampValue and doesn't need special handling for near edge values like "9999-12-31 23:59:59.999999999". My proposal is to go with truncation in the Parquet writer, and consider switching the Kudu writer too, maybe in the next major release. Csaba
