[ https://issues.apache.org/jira/browse/HIVE-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Simhadri Govindappa updated HIVE-27772: --------------------------------------- Description: For invalid dates such as 2001-02-31, 2023-04-31 etc, UNIX_TIMESTAMP() is giving out the timestamp value as the last valid date, rather than NULL. (e.g. UNIX_TIMESTAMP('2001-02-31', 'yyyy-MM-dd') gives 983354400, which converts to '2001-02-28'. However, for calendar days larger than 31, e.g. 2001-02-32, or 2023-04-32, UNIX_TIMESTAMP() would give NULL as a result. In Spark and mysql, UNIX_TIMESTMAP for these invalid dates are all NULL (or 0). {noformat} 6: jdbc:hive2://localhost:10001/> select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from datetimetable; INFO : Compiling command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from datetimetable INFO : No Stats for default@datetimetable, Columns: month, datetimestamp INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:month, type:string, comment:null), FieldSchema(name:datetimestamp, type:string, comment:null), FieldSchema(name:timestampcol, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time taken: 0.102 seconds INFO : Operation QUERY obtained 0 locks INFO : Executing command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from datetimetable INFO : Completed executing command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time taken: 0.0 seconds +--------+----------------+---------------+ | month | datetimestamp | timestampcol | +--------+----------------+---------------+ | Feb | 2001-02-28 | 983318400 | | Feb | 2001-02-29 | 983318400 | | Feb | 2001-02-30 | 983318400 | | Feb | 2001-02-31 | 983318400 | | Feb | 2001-02-32 | NULL | +--------+----------------+---------------+ 5 rows selected (0.131 seconds){noformat} It looks like [InstantDateTimeFormatter.java#L52|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/InstantDateTimeFormatter.java#L52] by default, the formatter has the SMART resolver style. According to java jdk : https://github.com/frohoff/jdk8u-dev-jdk/blob/master/src/share/classes/java/time/format/ResolverStyle.java#L103 {noformat} /** * Style to resolve dates and times strictly. * <p> * Using strict resolution will ensure that all parsed values are within * the outer range of valid values for the field. Individual fields may * be further processed for strictness. * <p> * For example, resolving year-month and day-of-month in the ISO calendar * system using strict mode will ensure that the day-of-month is valid * for the year-month, rejecting invalid values. */ STRICT, /** * Style to resolve dates and times in a smart, or intelligent, manner. * <p> * Using smart resolution will perform the sensible default for each * field, which may be the same as strict, the same as lenient, or a third * behavior. Individual fields will interpret this differently. * <p> * For example, resolving year-month and day-of-month in the ISO calendar * system using smart mode will ensure that the day-of-month is from * 1 to 31, converting any value beyond the last valid day-of-month to be * the last valid day-of-month. */ SMART,{noformat} Therefore, we should set the resolverStyle to STRICT to reject invalid date values. was: For invalid dates such as 2001-02-31, 2023-04-31 etc, UNIX_TIMESTAMP() is giving out the timestamp value as the last valid date, rather than NULL. (e.g. UNIX_TIMESTAMP('2001-02-31', 'yyyy-MM-dd') gives 983354400, which converts to '2001-02-28'. However, for calendar days larger than 31, e.g. 2001-02-32, or 2023-04-32, UNIX_TIMESTAMP() would give NULL as a result. In Spark and mysql, UNIX_TIMESTMAP for these invalid dates are all NULL (or 0). {noformat} 6: jdbc:hive2://localhost:10001/> select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from datetimetable; INFO : Compiling command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from datetimetable INFO : No Stats for default@datetimetable, Columns: month, datetimestamp INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:month, type:string, comment:null), FieldSchema(name:datetimestamp, type:string, comment:null), FieldSchema(name:timestampcol, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time taken: 0.102 seconds INFO : Operation QUERY obtained 0 locks INFO : Executing command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from datetimetable INFO : Completed executing command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time taken: 0.0 seconds +--------+----------------+---------------+ | month | datetimestamp | timestampcol | +--------+----------------+---------------+ | Feb | 2001-02-28 | 983318400 | | Feb | 2001-02-29 | 983318400 | | Feb | 2001-02-30 | 983318400 | | Feb | 2001-02-31 | 983318400 | | Feb | 2001-02-32 | NULL | +--------+----------------+---------------+ 5 rows selected (0.131 seconds){noformat} > Hive UNIX_TIMESTAMP() not returning null for invalid dates > ---------------------------------------------------------- > > Key: HIVE-27772 > URL: https://issues.apache.org/jira/browse/HIVE-27772 > Project: Hive > Issue Type: Bug > Reporter: Simhadri Govindappa > Assignee: Simhadri Govindappa > Priority: Major > > For invalid dates such as 2001-02-31, 2023-04-31 etc, UNIX_TIMESTAMP() is > giving out the timestamp value as the last valid date, rather than NULL. > (e.g. UNIX_TIMESTAMP('2001-02-31', 'yyyy-MM-dd') gives 983354400, which > converts to '2001-02-28'. However, for calendar days larger than 31, e.g. > 2001-02-32, or 2023-04-32, UNIX_TIMESTAMP() would give NULL as a result. > In Spark and mysql, UNIX_TIMESTMAP for these invalid dates are all NULL (or > 0). > > {noformat} > 6: jdbc:hive2://localhost:10001/> select month, datetimestamp, > unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from > datetimetable; > INFO : Compiling > command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): > select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as > timestampCol from datetimetable > INFO : No Stats for default@datetimetable, Columns: month, datetimestamp > INFO : Semantic Analysis Completed (retrial = false) > INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:month, > type:string, comment:null), FieldSchema(name:datetimestamp, type:string, > comment:null), FieldSchema(name:timestampcol, type:bigint, comment:null)], > properties:null) > INFO : Completed compiling > command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); > Time taken: 0.102 seconds > INFO : Operation QUERY obtained 0 locks > INFO : Executing > command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): > select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as > timestampCol from datetimetable > INFO : Completed executing > command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); > Time taken: 0.0 seconds > +--------+----------------+---------------+ > | month | datetimestamp | timestampcol | > +--------+----------------+---------------+ > | Feb | 2001-02-28 | 983318400 | > | Feb | 2001-02-29 | 983318400 | > | Feb | 2001-02-30 | 983318400 | > | Feb | 2001-02-31 | 983318400 | > | Feb | 2001-02-32 | NULL | > +--------+----------------+---------------+ > 5 rows selected (0.131 seconds){noformat} > > > It looks like > [InstantDateTimeFormatter.java#L52|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/InstantDateTimeFormatter.java#L52] > by default, the formatter has the SMART resolver style. > According to java jdk : > https://github.com/frohoff/jdk8u-dev-jdk/blob/master/src/share/classes/java/time/format/ResolverStyle.java#L103 > > {noformat} > /** > * Style to resolve dates and times strictly. > * <p> > * Using strict resolution will ensure that all parsed values are within > * the outer range of valid values for the field. Individual fields may > * be further processed for strictness. > * <p> > * For example, resolving year-month and day-of-month in the ISO calendar > * system using strict mode will ensure that the day-of-month is valid > * for the year-month, rejecting invalid values. > */ > STRICT, > /** > * Style to resolve dates and times in a smart, or intelligent, manner. > * <p> > * Using smart resolution will perform the sensible default for each > * field, which may be the same as strict, the same as lenient, or a third > * behavior. Individual fields will interpret this differently. > * <p> > * For example, resolving year-month and day-of-month in the ISO calendar > * system using smart mode will ensure that the day-of-month is from > * 1 to 31, converting any value beyond the last valid day-of-month to be > * the last valid day-of-month. > */ > SMART,{noformat} > > > Therefore, we should set the resolverStyle to STRICT to reject invalid date > values. > -- This message was sent by Atlassian Jira (v8.20.10#820010)