[ 
https://issues.apache.org/jira/browse/HIVE-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25449:
-------------------------------------
    Description: 
Repro (thanks Qiaosong Dong) - 

Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
{code}
create external table test_dt(id string, dt date);
insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');

select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on 
dt1.id = dt.id;
+------+
| _c0  |
+------+
| 6    |
| 7    |
+------+
{code}

Expected output - 
{code}
+------+
| _c0  |
+------+
| 5    |
| 6    |
+------+
{code}

*Cause*

This happens because in {{VectorUDFDateDiffColScalar}} class  
1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse 
the date strings which interprets it to be in local timezone.

2. For first column we get a column vector which represents the date as epoch 
day. This is always in UTC.

*Solution*

We need to check other variants of datediff UDFs as well and change the parsing 
mechanism to always interpret date strings in UTC. 
 
I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
{code}
-          date.setTime(formatter.parse(new String(bytesValue, 
"UTF-8")).getTime());
-          baseDate = DateWritableV2.dateToDays(date);
+          org.apache.hadoop.hive.common.type.Date hiveDate
+              = org.apache.hadoop.hive.common.type.Date.valueOf(new 
String(bytesValue, "UTF-8"));
+          date.setTime(hiveDate.toEpochMilli());
+          baseDate = hiveDate.toEpochDay();
{code}








  was:
Repro (thanks Qiaosong Dong) - 

Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
{code}
create external table test_dt(id string, dt date);
insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');

select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on 
dt1.id = dt.id;
+------+
| _c0  |
+------+
| 6    |
| 7    |
+------+
{code}

Expected output - 
{code}
+------+
| _c0  |
+------+
| 5    |
| 6    |
+------+
{code}

*Cause*

This happens because in {{VectorUDFDateDiffColScalar}} class  
1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse 
the date strings which interprets it to be local timezone.

2. For first column we get a column vector which represents the date as epoch 
day. This is always in UTC.

*Solution*

We need to check other variants of datediff UDFs as well and change the parsing 
mechanism to always interpret date strings in UTC. 
 
I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
{code}
-          date.setTime(formatter.parse(new String(bytesValue, 
"UTF-8")).getTime());
-          baseDate = DateWritableV2.dateToDays(date);
+          org.apache.hadoop.hive.common.type.Date hiveDate
+              = org.apache.hadoop.hive.common.type.Date.valueOf(new 
String(bytesValue, "UTF-8"));
+          date.setTime(hiveDate.toEpochMilli());
+          baseDate = hiveDate.toEpochDay();
{code}









> datediff() gives wrong output when we set tez.task.launch.cmd-opts to some 
> non UTC timezone
> -------------------------------------------------------------------------------------------
>
>                 Key: HIVE-25449
>                 URL: https://issues.apache.org/jira/browse/HIVE-25449
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>            Reporter: Shubham Chaurasia
>            Assignee: Shubham Chaurasia
>            Priority: Major
>
> Repro (thanks Qiaosong Dong) - 
> Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
> {code}
> create external table test_dt(id string, dt date);
> insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');
> select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt 
> on dt1.id = dt.id;
> +------+
> | _c0  |
> +------+
> | 6    |
> | 7    |
> +------+
> {code}
> Expected output - 
> {code}
> +------+
> | _c0  |
> +------+
> | 5    |
> | 6    |
> +------+
> {code}
> *Cause*
> This happens because in {{VectorUDFDateDiffColScalar}} class  
> 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to 
> parse the date strings which interprets it to be in local timezone.
> 2. For first column we get a column vector which represents the date as epoch 
> day. This is always in UTC.
> *Solution*
> We need to check other variants of datediff UDFs as well and change the 
> parsing mechanism to always interpret date strings in UTC. 
>  
> I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
> {code}
> -          date.setTime(formatter.parse(new String(bytesValue, 
> "UTF-8")).getTime());
> -          baseDate = DateWritableV2.dateToDays(date);
> +          org.apache.hadoop.hive.common.type.Date hiveDate
> +              = org.apache.hadoop.hive.common.type.Date.valueOf(new 
> String(bytesValue, "UTF-8"));
> +          date.setTime(hiveDate.toEpochMilli());
> +          baseDate = hiveDate.toEpochDay();
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to