[ 
https://issues.apache.org/jira/browse/HIVE-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simhadri Govindappa updated HIVE-27772:
---------------------------------------
    Description: 
For invalid dates such as 2001-02-31, 2023-04-31 etc, UNIX_TIMESTAMP() is 
giving out the timestamp value as the last valid date, rather than NULL. (e.g. 
UNIX_TIMESTAMP('2001-02-31', 'yyyy-MM-dd') gives 983354400, which converts to 
'2001-02-28'. However, for calendar days larger than 31, e.g. 2001-02-32, or 
2023-04-32, UNIX_TIMESTAMP() would give NULL as a result.

In Spark and mysql, UNIX_TIMESTMAP for these invalid dates are all NULL (or 0).

 
{noformat}
6: jdbc:hive2://localhost:10001/> select month, datetimestamp, 
unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from datetimetable;
INFO  : Compiling 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
timestampCol from datetimetable
INFO  : No Stats for default@datetimetable, Columns: month, datetimestamp
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:month, 
type:string, comment:null), FieldSchema(name:datetimestamp, type:string, 
comment:null), FieldSchema(name:timestampcol, type:bigint, comment:null)], 
properties:null)
INFO  : Completed compiling 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time 
taken: 0.102 seconds
INFO  : Operation QUERY obtained 0 locks
INFO  : Executing 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
timestampCol from datetimetable
INFO  : Completed executing 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time 
taken: 0.0 seconds
+--------+----------------+---------------+
| month  | datetimestamp  | timestampcol  |
+--------+----------------+---------------+
| Feb    | 2001-02-28     | 983318400     |
| Feb    | 2001-02-29     | 983318400     |
| Feb    | 2001-02-30     | 983318400     |
| Feb    | 2001-02-31     | 983318400     |
| Feb    | 2001-02-32     | NULL          |
+--------+----------------+---------------+
5 rows selected (0.131 seconds){noformat}
 

 

It looks like 
[InstantDateTimeFormatter.java#L52|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/InstantDateTimeFormatter.java#L52]
  by default, the formatter has the SMART resolver style.

According to java jdk : 
https://github.com/frohoff/jdk8u-dev-jdk/blob/master/src/share/classes/java/time/format/ResolverStyle.java#L103

 
{noformat}
  /**
     * Style to resolve dates and times strictly.
     * <p>
     * Using strict resolution will ensure that all parsed values are within
     * the outer range of valid values for the field. Individual fields may
     * be further processed for strictness.
     * <p>
     * For example, resolving year-month and day-of-month in the ISO calendar
     * system using strict mode will ensure that the day-of-month is valid
     * for the year-month, rejecting invalid values.
     */
    STRICT,
    /**
     * Style to resolve dates and times in a smart, or intelligent, manner.
     * <p>
     * Using smart resolution will perform the sensible default for each
     * field, which may be the same as strict, the same as lenient, or a third
     * behavior. Individual fields will interpret this differently.
     * <p>
     * For example, resolving year-month and day-of-month in the ISO calendar
     * system using smart mode will ensure that the day-of-month is from
     * 1 to 31, converting any value beyond the last valid day-of-month to be
     * the last valid day-of-month.
     */
    SMART,{noformat}
 

 

Therefore, we should set the resolverStyle to STRICT to reject invalid date 
values.

 

  was:
For invalid dates such as 2001-02-31, 2023-04-31 etc, UNIX_TIMESTAMP() is 
giving out the timestamp value as the last valid date, rather than NULL. (e.g. 
UNIX_TIMESTAMP('2001-02-31', 'yyyy-MM-dd') gives 983354400, which converts to 
'2001-02-28'. However, for calendar days larger than 31, e.g. 2001-02-32, or 
2023-04-32, UNIX_TIMESTAMP() would give NULL as a result.

In Spark and mysql, UNIX_TIMESTMAP for these invalid dates are all NULL (or 0).

 

 

 
{noformat}
6: jdbc:hive2://localhost:10001/> select month, datetimestamp, 
unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from datetimetable;
INFO  : Compiling 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
timestampCol from datetimetable
INFO  : No Stats for default@datetimetable, Columns: month, datetimestamp
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:month, 
type:string, comment:null), FieldSchema(name:datetimestamp, type:string, 
comment:null), FieldSchema(name:timestampcol, type:bigint, comment:null)], 
properties:null)
INFO  : Completed compiling 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time 
taken: 0.102 seconds
INFO  : Operation QUERY obtained 0 locks
INFO  : Executing 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
timestampCol from datetimetable
INFO  : Completed executing 
command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); Time 
taken: 0.0 seconds
+--------+----------------+---------------+
| month  | datetimestamp  | timestampcol  |
+--------+----------------+---------------+
| Feb    | 2001-02-28     | 983318400     |
| Feb    | 2001-02-29     | 983318400     |
| Feb    | 2001-02-30     | 983318400     |
| Feb    | 2001-02-31     | 983318400     |
| Feb    | 2001-02-32     | NULL          |
+--------+----------------+---------------+
5 rows selected (0.131 seconds){noformat}


> Hive UNIX_TIMESTAMP() not returning null for invalid dates
> ----------------------------------------------------------
>
>                 Key: HIVE-27772
>                 URL: https://issues.apache.org/jira/browse/HIVE-27772
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Simhadri Govindappa
>            Assignee: Simhadri Govindappa
>            Priority: Major
>
> For invalid dates such as 2001-02-31, 2023-04-31 etc, UNIX_TIMESTAMP() is 
> giving out the timestamp value as the last valid date, rather than NULL. 
> (e.g. UNIX_TIMESTAMP('2001-02-31', 'yyyy-MM-dd') gives 983354400, which 
> converts to '2001-02-28'. However, for calendar days larger than 31, e.g. 
> 2001-02-32, or 2023-04-32, UNIX_TIMESTAMP() would give NULL as a result.
> In Spark and mysql, UNIX_TIMESTMAP for these invalid dates are all NULL (or 
> 0).
>  
> {noformat}
> 6: jdbc:hive2://localhost:10001/> select month, datetimestamp, 
> unix_timestamp(datetimestamp, 'yyyy-MM-dd') as timestampCol from 
> datetimetable;
> INFO  : Compiling 
> command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
> select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
> timestampCol from datetimetable
> INFO  : No Stats for default@datetimetable, Columns: month, datetimestamp
> INFO  : Semantic Analysis Completed (retrial = false)
> INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:month, 
> type:string, comment:null), FieldSchema(name:datetimestamp, type:string, 
> comment:null), FieldSchema(name:timestampcol, type:bigint, comment:null)], 
> properties:null)
> INFO  : Completed compiling 
> command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); 
> Time taken: 0.102 seconds
> INFO  : Operation QUERY obtained 0 locks
> INFO  : Executing 
> command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62): 
> select month, datetimestamp, unix_timestamp(datetimestamp, 'yyyy-MM-dd') as 
> timestampCol from datetimetable
> INFO  : Completed executing 
> command(queryId=root_20231005104216_8520e3e9-d03b-4e34-93ec-ddc280845c62); 
> Time taken: 0.0 seconds
> +--------+----------------+---------------+
> | month  | datetimestamp  | timestampcol  |
> +--------+----------------+---------------+
> | Feb    | 2001-02-28     | 983318400     |
> | Feb    | 2001-02-29     | 983318400     |
> | Feb    | 2001-02-30     | 983318400     |
> | Feb    | 2001-02-31     | 983318400     |
> | Feb    | 2001-02-32     | NULL          |
> +--------+----------------+---------------+
> 5 rows selected (0.131 seconds){noformat}
>  
>  
> It looks like 
> [InstantDateTimeFormatter.java#L52|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/InstantDateTimeFormatter.java#L52]
>   by default, the formatter has the SMART resolver style.
> According to java jdk : 
> https://github.com/frohoff/jdk8u-dev-jdk/blob/master/src/share/classes/java/time/format/ResolverStyle.java#L103
>  
> {noformat}
>   /**
>      * Style to resolve dates and times strictly.
>      * <p>
>      * Using strict resolution will ensure that all parsed values are within
>      * the outer range of valid values for the field. Individual fields may
>      * be further processed for strictness.
>      * <p>
>      * For example, resolving year-month and day-of-month in the ISO calendar
>      * system using strict mode will ensure that the day-of-month is valid
>      * for the year-month, rejecting invalid values.
>      */
>     STRICT,
>     /**
>      * Style to resolve dates and times in a smart, or intelligent, manner.
>      * <p>
>      * Using smart resolution will perform the sensible default for each
>      * field, which may be the same as strict, the same as lenient, or a third
>      * behavior. Individual fields will interpret this differently.
>      * <p>
>      * For example, resolving year-month and day-of-month in the ISO calendar
>      * system using smart mode will ensure that the day-of-month is from
>      * 1 to 31, converting any value beyond the last valid day-of-month to be
>      * the last valid day-of-month.
>      */
>     SMART,{noformat}
>  
>  
> Therefore, we should set the resolverStyle to STRICT to reject invalid date 
> values.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to