[ 
https://issues.apache.org/jira/browse/HIVE-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simhadri Govindappa updated HIVE-27772:
---------------------------------------
    Description: 
For invalid dates such as 2001-02-31, 2023-04-31 etc, UNIX_TIMESTAMP() is 
giving out the timestamp value as the last valid date, rather than NULL. (e.g. 
UNIX_TIMESTAMP('2001-02-31', 'yyyy-MM-dd') gives 983354400, which converts to 
'2001-02-28'. However, for calendar days larger than 31, e.g. 2001-02-32, or 
2023-04-32, UNIX_TIMESTAMP() would give NULL as a result.

In Spark and mysql, UNIX_TIMESTMAP for these invalid dates are all NULL (or 0).

 
{noformat}
6: jdbc:hive2://localhost:10001/> select month, datetimestamp, 
unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from datetimetable;
INFO  : Compiling 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
timestampCol from datetimetable
INFO  : No Stats for default@datetimetable, Columns: month, datetimestamp
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:month, 
type:string, comment:null), FieldSchema(name:datetimestamp, type:string, 
comment:null), FieldSchema(name:timestampcol, type:bigint, comment:null)], 
properties:null)
INFO  : Completed compiling 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time 
taken: 0.102 seconds
INFO  : Operation QUERY obtained 0 locks
INFO  : Executing 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
timestampCol from datetimetable
INFO  : Completed executing 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time 
taken: 0.0 seconds
+--------+----------------+---------------+
| month  | datetimestamp  | timestampcol  |
+--------+----------------+---------------+
| Feb    | 2001-02-28     | 983318400     |
| Feb    | 2001-02-29     | 983318400     |
| Feb    | 2001-02-30     | 983318400     |
| Feb    | 2001-02-31     | 983318400     |
| Feb    | 2001-02-32     | NULL          |
+--------+----------------+---------------+
5 rows selected (0.131 seconds){noformat}
 

 

 

According to java jdk : 
[https://github.com/frohoff/jdk8u-dev-jdk/blob/master/src/share/classes/java/time/format/ResolverStyle.java#L103]

 
{noformat}
  /**
     * Style to resolve dates and times strictly.
     * <p>
     * Using strict resolution will ensure that all parsed values are within
     * the outer range of valid values for the field. Individual fields may
     * be further processed for strictness.
     * <p>
     * For example, resolving year-month and day-of-month in the ISO calendar
     * system using strict mode will ensure that the day-of-month is valid
     * for the year-month, rejecting invalid values.
     */
    STRICT,
    /**
     * Style to resolve dates and times in a smart, or intelligent, manner.
     * <p>
     * Using smart resolution will perform the sensible default for each
     * field, which may be the same as strict, the same as lenient, or a third
     * behavior. Individual fields will interpret this differently.
     * <p>
     * For example, resolving year-month and day-of-month in the ISO calendar
     * system using smart mode will ensure that the day-of-month is from
     * 1 to 31, converting any value beyond the last valid day-of-month to be
     * the last valid day-of-month.
     */
    SMART,{noformat}
 

 

By default, the DATETIME formatter uses the SMART resolution style and the 
SIMPLE formatter the LENIENT. Both of these styles are able to resolve 
"invalid" bounds to valid dates. In order to prevent seemingly "invalid" dates 
to be parsed correctly we have to use the STRICT resolution style. However, we 
cannot simply switch the formatters to always use the STRICT resolution cause 
that would break existing applications relying on the existing resolution 
rules. To address the problem reported here and retain the previous behaviour 
we opted to make the resolution style configurable by adding a new property. 
The new property only affects the DATETIME formatter; the SIMPLE formatter is 
almost deprecated so we don't add new features to it.





 

  was:
For invalid dates such as 2001-02-31, 2023-04-31 etc, UNIX_TIMESTAMP() is 
giving out the timestamp value as the last valid date, rather than NULL. (e.g. 
UNIX_TIMESTAMP('2001-02-31', 'yyyy-MM-dd') gives 983354400, which converts to 
'2001-02-28'. However, for calendar days larger than 31, e.g. 2001-02-32, or 
2023-04-32, UNIX_TIMESTAMP() would give NULL as a result.

In Spark and mysql, UNIX_TIMESTMAP for these invalid dates are all NULL (or 0).

 
{noformat}
6: jdbc:hive2://localhost:10001/> select month, datetimestamp, 
unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from datetimetable;
INFO  : Compiling 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
timestampCol from datetimetable
INFO  : No Stats for default@datetimetable, Columns: month, datetimestamp
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:month, 
type:string, comment:null), FieldSchema(name:datetimestamp, type:string, 
comment:null), FieldSchema(name:timestampcol, type:bigint, comment:null)], 
properties:null)
INFO  : Completed compiling 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time 
taken: 0.102 seconds
INFO  : Operation QUERY obtained 0 locks
INFO  : Executing 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
timestampCol from datetimetable
INFO  : Completed executing 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time 
taken: 0.0 seconds
+--------+----------------+---------------+
| month  | datetimestamp  | timestampcol  |
+--------+----------------+---------------+
| Feb    | 2001-02-28     | 983318400     |
| Feb    | 2001-02-29     | 983318400     |
| Feb    | 2001-02-30     | 983318400     |
| Feb    | 2001-02-31     | 983318400     |
| Feb    | 2001-02-32     | NULL          |
+--------+----------------+---------------+
5 rows selected (0.131 seconds){noformat}
 

 

It looks like 
[InstantDateTimeFormatter.java#L52|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/InstantDateTimeFormatter.java#L52]
  by default, the formatter has the SMART resolver style.

According to java jdk : 
https://github.com/frohoff/jdk8u-dev-jdk/blob/master/src/share/classes/java/time/format/ResolverStyle.java#L103

 
{noformat}
  /**
     * Style to resolve dates and times strictly.
     * <p>
     * Using strict resolution will ensure that all parsed values are within
     * the outer range of valid values for the field. Individual fields may
     * be further processed for strictness.
     * <p>
     * For example, resolving year-month and day-of-month in the ISO calendar
     * system using strict mode will ensure that the day-of-month is valid
     * for the year-month, rejecting invalid values.
     */
    STRICT,
    /**
     * Style to resolve dates and times in a smart, or intelligent, manner.
     * <p>
     * Using smart resolution will perform the sensible default for each
     * field, which may be the same as strict, the same as lenient, or a third
     * behavior. Individual fields will interpret this differently.
     * <p>
     * For example, resolving year-month and day-of-month in the ISO calendar
     * system using smart mode will ensure that the day-of-month is from
     * 1 to 31, converting any value beyond the last valid day-of-month to be
     * the last valid day-of-month.
     */
    SMART,{noformat}
 

 

Therefore, we should set the resolverStyle to STRICT to reject invalid date 
values.

 


> UNIX_TIMESTAMP should return NULL when date fields are out of bounds
> --------------------------------------------------------------------
>
>                 Key: HIVE-27772
>                 URL: https://issues.apache.org/jira/browse/HIVE-27772
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Simhadri Govindappa
>            Assignee: Simhadri Govindappa
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> For invalid dates such as 2001-02-31, 2023-04-31 etc, UNIX_TIMESTAMP() is 
> giving out the timestamp value as the last valid date, rather than NULL. 
> (e.g. UNIX_TIMESTAMP('2001-02-31', 'yyyy-MM-dd') gives 983354400, which 
> converts to '2001-02-28'. However, for calendar days larger than 31, e.g. 
> 2001-02-32, or 2023-04-32, UNIX_TIMESTAMP() would give NULL as a result.
> In Spark and mysql, UNIX_TIMESTMAP for these invalid dates are all NULL (or 
> 0).
>  
> {noformat}
> 6: jdbc:hive2://localhost:10001/> select month, datetimestamp, 
> unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from 
> datetimetable;
> INFO  : Compiling 
> command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
> select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
> timestampCol from datetimetable
> INFO  : No Stats for default@datetimetable, Columns: month, datetimestamp
> INFO  : Semantic Analysis Completed (retrial = false)
> INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:month, 
> type:string, comment:null), FieldSchema(name:datetimestamp, type:string, 
> comment:null), FieldSchema(name:timestampcol, type:bigint, comment:null)], 
> properties:null)
> INFO  : Completed compiling 
> command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); 
> Time taken: 0.102 seconds
> INFO  : Operation QUERY obtained 0 locks
> INFO  : Executing 
> command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
> select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
> timestampCol from datetimetable
> INFO  : Completed executing 
> command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); 
> Time taken: 0.0 seconds
> +--------+----------------+---------------+
> | month  | datetimestamp  | timestampcol  |
> +--------+----------------+---------------+
> | Feb    | 2001-02-28     | 983318400     |
> | Feb    | 2001-02-29     | 983318400     |
> | Feb    | 2001-02-30     | 983318400     |
> | Feb    | 2001-02-31     | 983318400     |
> | Feb    | 2001-02-32     | NULL          |
> +--------+----------------+---------------+
> 5 rows selected (0.131 seconds){noformat}
>  
>  
>  
> According to java jdk : 
> [https://github.com/frohoff/jdk8u-dev-jdk/blob/master/src/share/classes/java/time/format/ResolverStyle.java#L103]
>  
> {noformat}
>   /**
>      * Style to resolve dates and times strictly.
>      * <p>
>      * Using strict resolution will ensure that all parsed values are within
>      * the outer range of valid values for the field. Individual fields may
>      * be further processed for strictness.
>      * <p>
>      * For example, resolving year-month and day-of-month in the ISO calendar
>      * system using strict mode will ensure that the day-of-month is valid
>      * for the year-month, rejecting invalid values.
>      */
>     STRICT,
>     /**
>      * Style to resolve dates and times in a smart, or intelligent, manner.
>      * <p>
>      * Using smart resolution will perform the sensible default for each
>      * field, which may be the same as strict, the same as lenient, or a third
>      * behavior. Individual fields will interpret this differently.
>      * <p>
>      * For example, resolving year-month and day-of-month in the ISO calendar
>      * system using smart mode will ensure that the day-of-month is from
>      * 1 to 31, converting any value beyond the last valid day-of-month to be
>      * the last valid day-of-month.
>      */
>     SMART,{noformat}
>  
>  
> By default, the DATETIME formatter uses the SMART resolution style and the 
> SIMPLE formatter the LENIENT. Both of these styles are able to resolve 
> "invalid" bounds to valid dates. In order to prevent seemingly "invalid" 
> dates to be parsed correctly we have to use the STRICT resolution style. 
> However, we cannot simply switch the formatters to always use the STRICT 
> resolution cause that would break existing applications relying on the 
> existing resolution rules. To address the problem reported here and retain 
> the previous behaviour we opted to make the resolution style configurable by 
> adding a new property. The new property only affects the DATETIME formatter; 
> the SIMPLE formatter is almost deprecated so we don't add new features to it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to