from:"Shubham Chaurasia \(JIRA\)"

[jira] [Updated] (HIVE-25575) Add support for JWT authentication

2021-09-29 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25575:
-
Description: 
It would be good to support JWT auth mechanism in hive. In order to implement 
it, we would need the following - 

On HS2 side -
1. Accept JWT in Authorization: Bearer header.
2. Fetch JWKS from a public endpoint to verify JWT signature, to start with we 
can fetch on HS2 start up.
3. Verify JWT Signature.

On JDBC Client side - 
1. Hive jdbc client should be able to accept jwt in JDBC url. (will add more 
details)
2. Client should also be able to pick up JWT from an env var if it's defined.


  was:
It would be good to support JWT auth mechanism in hive. In order to implement 
it, we would need the following - 

On HS2 side -
1. Accept JWT in Authorization: Bearer header.
2. Fetch JWKS from a public endpoint to verify JWT signature, to start with we 
can fetch on HS2 start up.
3. Verify JWT Signature.

On JDBC Client side - 
1. Hive jdbc client should be able to accept jwt in JDBC url. (will add more 
details)
2. Client should also be able to pick up JWT from a env var if it's defined.



> Add support for JWT authentication
> --
>
> Key: HIVE-25575
> URL: https://issues.apache.org/jira/browse/HIVE-25575
> Project: Hive
>  Issue Type: New Feature
>  Components: HiveServer2, JDBC
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> It would be good to support JWT auth mechanism in hive. In order to implement 
> it, we would need the following - 
> On HS2 side -
> 1. Accept JWT in Authorization: Bearer header.
> 2. Fetch JWKS from a public endpoint to verify JWT signature, to start with 
> we can fetch on HS2 start up.
> 3. Verify JWT Signature.
> On JDBC Client side - 
> 1. Hive jdbc client should be able to accept jwt in JDBC url. (will add more 
> details)
> 2. Client should also be able to pick up JWT from an env var if it's defined.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-25575) Add support for JWT authentication

2021-09-29 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-25575:



> Add support for JWT authentication
> --
>
> Key: HIVE-25575
> URL: https://issues.apache.org/jira/browse/HIVE-25575
> Project: Hive
>  Issue Type: New Feature
>  Components: HiveServer2, JDBC
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> It would be good to support JWT auth mechanism in hive. In order to implement 
> it, we would need the following - 
> On HS2 side -
> 1. Accept JWT in Authorization: Bearer header.
> 2. Fetch JWKS from a public endpoint to verify JWT signature, to start with 
> we can fetch on HS2 start up.
> 3. Verify JWT Signature.
> On JDBC Client side - 
> 1. Hive jdbc client should be able to accept jwt in JDBC url. (will add more 
> details)
> 2. Client should also be able to pick up JWT from a env var if it's defined.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-25449) datediff() gives wrong output when run in a tez task with some non-UTC timezone

2021-08-18 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17401456#comment-17401456
 ] 

Shubham Chaurasia commented on HIVE-25449:
--

[~abstractdog] [~klcopp] 

Can you please review ?

> datediff() gives wrong output when run in a tez task with some non-UTC 
> timezone
> ---
>
> Key: HIVE-25449
> URL: https://issues.apache.org/jira/browse/HIVE-25449
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Repro (thanks Qiaosong Dong) - 
> Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
> {code}
> create external table test_dt(id string, dt date);
> insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');
> select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt 
> on dt1.id = dt.id;
> +--+
> | _c0  |
> +--+
> | 6|
> | 7|
> +--+
> {code}
> Expected output - 
> {code}
> +--+
> | _c0  |
> +--+
> | 5|
> | 6|
> +--+
> {code}
> *Cause*
> This happens because in {{VectorUDFDateDiffColScalar}} class  
> 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to 
> parse the date strings which interprets it to be in local timezone.
> 2. For first column we get a column vector which represents the date as epoch 
> day. This is always in UTC.
> *Solution*
> We need to check other variants of datediff UDFs as well and change the 
> parsing mechanism to always interpret date strings in UTC. 
>  
> I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
> {code}
> -  date.setTime(formatter.parse(new String(bytesValue, 
> "UTF-8")).getTime());
> -  baseDate = DateWritableV2.dateToDays(date);
> +  org.apache.hadoop.hive.common.type.Date hiveDate
> +  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
> String(bytesValue, "UTF-8"));
> +  date.setTime(hiveDate.toEpochMilli());
> +  baseDate = hiveDate.toEpochDay();
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-25449) datediff() gives wrong output when run in a tez task with some non-UTC timezone

2021-08-17 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25449:
-
Summary: datediff() gives wrong output when run in a tez task with some 
non-UTC timezone  (was: datediff() gives wrong output when we add some non UTC 
timezone to tez.task.launch.cmd-opts)

> datediff() gives wrong output when run in a tez task with some non-UTC 
> timezone
> ---
>
> Key: HIVE-25449
> URL: https://issues.apache.org/jira/browse/HIVE-25449
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Repro (thanks Qiaosong Dong) - 
> Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
> {code}
> create external table test_dt(id string, dt date);
> insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');
> select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt 
> on dt1.id = dt.id;
> +--+
> | _c0  |
> +--+
> | 6|
> | 7|
> +--+
> {code}
> Expected output - 
> {code}
> +--+
> | _c0  |
> +--+
> | 5|
> | 6|
> +--+
> {code}
> *Cause*
> This happens because in {{VectorUDFDateDiffColScalar}} class  
> 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to 
> parse the date strings which interprets it to be in local timezone.
> 2. For first column we get a column vector which represents the date as epoch 
> day. This is always in UTC.
> *Solution*
> We need to check other variants of datediff UDFs as well and change the 
> parsing mechanism to always interpret date strings in UTC. 
>  
> I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
> {code}
> -  date.setTime(formatter.parse(new String(bytesValue, 
> "UTF-8")).getTime());
> -  baseDate = DateWritableV2.dateToDays(date);
> +  org.apache.hadoop.hive.common.type.Date hiveDate
> +  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
> String(bytesValue, "UTF-8"));
> +  date.setTime(hiveDate.toEpochMilli());
> +  baseDate = hiveDate.toEpochDay();
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-25449) datediff() gives wrong output when we add some non UTC timezone to tez.task.launch.cmd-opts

2021-08-16 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25449:
-
Summary: datediff() gives wrong output when we add some non UTC timezone to 
tez.task.launch.cmd-opts  (was: datediff() gives wrong output when we set 
tez.task.launch.cmd-opts to some non UTC timezone)

> datediff() gives wrong output when we add some non UTC timezone to 
> tez.task.launch.cmd-opts
> ---
>
> Key: HIVE-25449
> URL: https://issues.apache.org/jira/browse/HIVE-25449
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Repro (thanks Qiaosong Dong) - 
> Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
> {code}
> create external table test_dt(id string, dt date);
> insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');
> select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt 
> on dt1.id = dt.id;
> +--+
> | _c0  |
> +--+
> | 6|
> | 7|
> +--+
> {code}
> Expected output - 
> {code}
> +--+
> | _c0  |
> +--+
> | 5|
> | 6|
> +--+
> {code}
> *Cause*
> This happens because in {{VectorUDFDateDiffColScalar}} class  
> 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to 
> parse the date strings which interprets it to be in local timezone.
> 2. For first column we get a column vector which represents the date as epoch 
> day. This is always in UTC.
> *Solution*
> We need to check other variants of datediff UDFs as well and change the 
> parsing mechanism to always interpret date strings in UTC. 
>  
> I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
> {code}
> -  date.setTime(formatter.parse(new String(bytesValue, 
> "UTF-8")).getTime());
> -  baseDate = DateWritableV2.dateToDays(date);
> +  org.apache.hadoop.hive.common.type.Date hiveDate
> +  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
> String(bytesValue, "UTF-8"));
> +  date.setTime(hiveDate.toEpochMilli());
> +  baseDate = hiveDate.toEpochDay();
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-25449) datediff() gives wrong output when we set tez.task.launch.cmd-opts to some non UTC timezone

2021-08-15 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25449:
-
Description: 
Repro (thanks Qiaosong Dong) - 

Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
{code}
create external table test_dt(id string, dt date);
insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');

select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on 
dt1.id = dt.id;
+--+
| _c0  |
+--+
| 6|
| 7|
+--+
{code}

Expected output - 
{code}
+--+
| _c0  |
+--+
| 5|
| 6|
+--+
{code}

*Cause*

This happens because in {{VectorUDFDateDiffColScalar}} class  
1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse 
the date strings which interprets it to be in local timezone.

2. For first column we get a column vector which represents the date as epoch 
day. This is always in UTC.

*Solution*

We need to check other variants of datediff UDFs as well and change the parsing 
mechanism to always interpret date strings in UTC. 
 
I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
{code}
-  date.setTime(formatter.parse(new String(bytesValue, 
"UTF-8")).getTime());
-  baseDate = DateWritableV2.dateToDays(date);
+  org.apache.hadoop.hive.common.type.Date hiveDate
+  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
String(bytesValue, "UTF-8"));
+  date.setTime(hiveDate.toEpochMilli());
+  baseDate = hiveDate.toEpochDay();
{code}








  was:
Repro (thanks Qiaosong Dong) - 

Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
{code}
create external table test_dt(id string, dt date);
insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');

select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on 
dt1.id = dt.id;
+--+
| _c0  |
+--+
| 6|
| 7|
+--+
{code}

Expected output - 
{code}
+--+
| _c0  |
+--+
| 5|
| 6|
+--+
{code}

*Cause*

This happens because in {{VectorUDFDateDiffColScalar}} class  
1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse 
the date strings which interprets it to be local timezone.

2. For first column we get a column vector which represents the date as epoch 
day. This is always in UTC.

*Solution*

We need to check other variants of datediff UDFs as well and change the parsing 
mechanism to always interpret date strings in UTC. 
 
I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
{code}
-  date.setTime(formatter.parse(new String(bytesValue, 
"UTF-8")).getTime());
-  baseDate = DateWritableV2.dateToDays(date);
+  org.apache.hadoop.hive.common.type.Date hiveDate
+  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
String(bytesValue, "UTF-8"));
+  date.setTime(hiveDate.toEpochMilli());
+  baseDate = hiveDate.toEpochDay();
{code}









> datediff() gives wrong output when we set tez.task.launch.cmd-opts to some 
> non UTC timezone
> ---
>
> Key: HIVE-25449
> URL: https://issues.apache.org/jira/browse/HIVE-25449
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Repro (thanks Qiaosong Dong) - 
> Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
> {code}
> create external table test_dt(id string, dt date);
> insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');
> select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt 
> on dt1.id = dt.id;
> +--+
> | _c0  |
> +--+
> | 6|
> | 7|
> +--+
> {code}
> Expected output - 
> {code}
> +--+
> | _c0  |
> +--+
> | 5|
> | 6|
> +--+
> {code}
> *Cause*
> This happens because in {{VectorUDFDateDiffColScalar}} class  
> 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to 
> parse the date strings which interprets it to be in local timezone.
> 2. For first column we get a column vector which represents the date as epoch 
> day. This is always in UTC.
> *Solution*
> We need to check other variants of datediff UDFs as well and change the 
> parsing mechanism to always interpret date strings in UTC. 
>  
> I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
> {code}
> -  date.setTime(formatter.parse(new String(bytesValue, 
> "UTF-8")).getTime());
> -  baseDate = DateWritableV2.dateToDays(date);
> +  org.apache.hadoop.hive.common.type.Date hiveDate
> +  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
> String(bytesValue, "UTF-8"));
> +  d

[jira] [Updated] (HIVE-25449) datediff() gives wrong output when we set tez.task.launch.cmd-opts to some non UTC timezone

2021-08-15 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25449:
-
Description: 
Repro (thanks Qiaosong Dong) - 

Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
{code}
create external table test_dt(id string, dt date);
insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');

select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on 
dt1.id = dt.id;
+--+
| _c0  |
+--+
| 6|
| 7|
+--+
{code}

Expected output - 
{code}
+--+
| _c0  |
+--+
| 5|
| 6|
+--+
{code}

*Cause*

This happens because in {{VectorUDFDateDiffColScalar}} class  
1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse 
the date strings which interprets it to be local timezone.

2. For first column we get a column vector which represents the date as epoch 
day. This is always in UTC.

*Solution*

We need to check other variants of datediff UDFs as well and change the parsing 
mechanism to always interpret date strings in UTC. 
 
I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
{code}
-  date.setTime(formatter.parse(new String(bytesValue, 
"UTF-8")).getTime());
-  baseDate = DateWritableV2.dateToDays(date);
+  org.apache.hadoop.hive.common.type.Date hiveDate
+  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
String(bytesValue, "UTF-8"));
+  date.setTime(hiveDate.toEpochMilli());
+  baseDate = hiveDate.toEpochDay();
{code}








  was:
Repro (thanks Qiaosong Dong) - 

1. Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
{code}
create external table test_dt(id string, dt date);
insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');

select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on 
dt1.id = dt.id;
+--+
| _c0  |
+--+
| 6|
| 7|
+--+
{code}

Expected output - 
{code}
+--+
| _c0  |
+--+
| 5|
| 6|
+--+
{code}

*Cause*

This happens because in {{VectorUDFDateDiffColScalar}} class  
1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse 
the date strings which interprets it to be local timezone.

2. For first column we get a column vector which represents the date as epoch 
day. This is always in UTC.

*Solution*

We need to check other variants of datediff UDFs as well and change the parsing 
mechanism to always interpret date strings in UTC. 
 
I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
{code}
-  date.setTime(formatter.parse(new String(bytesValue, 
"UTF-8")).getTime());
-  baseDate = DateWritableV2.dateToDays(date);
+  org.apache.hadoop.hive.common.type.Date hiveDate
+  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
String(bytesValue, "UTF-8"));
+  date.setTime(hiveDate.toEpochMilli());
+  baseDate = hiveDate.toEpochDay();
{code}









> datediff() gives wrong output when we set tez.task.launch.cmd-opts to some 
> non UTC timezone
> ---
>
> Key: HIVE-25449
> URL: https://issues.apache.org/jira/browse/HIVE-25449
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Repro (thanks Qiaosong Dong) - 
> Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
> {code}
> create external table test_dt(id string, dt date);
> insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');
> select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt 
> on dt1.id = dt.id;
> +--+
> | _c0  |
> +--+
> | 6|
> | 7|
> +--+
> {code}
> Expected output - 
> {code}
> +--+
> | _c0  |
> +--+
> | 5|
> | 6|
> +--+
> {code}
> *Cause*
> This happens because in {{VectorUDFDateDiffColScalar}} class  
> 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to 
> parse the date strings which interprets it to be local timezone.
> 2. For first column we get a column vector which represents the date as epoch 
> day. This is always in UTC.
> *Solution*
> We need to check other variants of datediff UDFs as well and change the 
> parsing mechanism to always interpret date strings in UTC. 
>  
> I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
> {code}
> -  date.setTime(formatter.parse(new String(bytesValue, 
> "UTF-8")).getTime());
> -  baseDate = DateWritableV2.dateToDays(date);
> +  org.apache.hadoop.hive.common.type.Date hiveDate
> +  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
> String(bytesValue, "UTF-8"));
> +  date

[jira] [Updated] (HIVE-25449) datediff() gives wrong output when we set tez.task.launch.cmd-opts to some non UTC timezone

2021-08-15 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25449:
-
Description: 
Repro (thanks Qiaosong Dong) - 

1. Add -Duser.timezone=GMT+8 to 
{code}
create external table test_dt(id string, dt date);
insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');

select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on 
dt1.id = dt.id;
+--+
| _c0  |
+--+
| 6|
| 7|
+--+
{code}

Expected output - 
{code}
+--+
| _c0  |
+--+
| 5|
| 6|
+--+
{code}

*Cause*

This happens because in {{VectorUDFDateDiffColScalar}} class  
1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse 
the date strings which interprets it to be local timezone.

2. For first column we get a column vector which represents the date as epoch 
day. This is always in UTC.

*Solution*

We need to check other variants of datediff UDFs as well and change the parsing 
mechanism to always interpret date strings in UTC. 
 
I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
{code}
-  date.setTime(formatter.parse(new String(bytesValue, 
"UTF-8")).getTime());
-  baseDate = DateWritableV2.dateToDays(date);
+  org.apache.hadoop.hive.common.type.Date hiveDate
+  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
String(bytesValue, "UTF-8"));
+  date.setTime(hiveDate.toEpochMilli());
+  baseDate = hiveDate.toEpochDay();
{code}








  was:
Repro (thanks Qiaosong Dong) - 
{code}
create external table test_dt(id string, dt date);
insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');

select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on 
dt1.id = dt.id;
+--+
| _c0  |
+--+
| 6|
| 7|
+--+
{code}

Expected output - 
{code}
+--+
| _c0  |
+--+
| 5|
| 6|
+--+
{code}

*Cause*

This happens because in {{VectorUDFDateDiffColScalar}} class  
1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse 
the date strings which interprets it to be local timezone.

2. For first column we get a column vector which represents the date as epoch 
day. This is always in UTC.

*Solution*

We need to check other variants of datediff UDFs as well and change the parsing 
mechanism to always interpret date strings in UTC. 
 
I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
{code}
-  date.setTime(formatter.parse(new String(bytesValue, 
"UTF-8")).getTime());
-  baseDate = DateWritableV2.dateToDays(date);
+  org.apache.hadoop.hive.common.type.Date hiveDate
+  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
String(bytesValue, "UTF-8"));
+  date.setTime(hiveDate.toEpochMilli());
+  baseDate = hiveDate.toEpochDay();
{code}









> datediff() gives wrong output when we set tez.task.launch.cmd-opts to some 
> non UTC timezone
> ---
>
> Key: HIVE-25449
> URL: https://issues.apache.org/jira/browse/HIVE-25449
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Repro (thanks Qiaosong Dong) - 
> 1. Add -Duser.timezone=GMT+8 to 
> {code}
> create external table test_dt(id string, dt date);
> insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');
> select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt 
> on dt1.id = dt.id;
> +--+
> | _c0  |
> +--+
> | 6|
> | 7|
> +--+
> {code}
> Expected output - 
> {code}
> +--+
> | _c0  |
> +--+
> | 5|
> | 6|
> +--+
> {code}
> *Cause*
> This happens because in {{VectorUDFDateDiffColScalar}} class  
> 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to 
> parse the date strings which interprets it to be local timezone.
> 2. For first column we get a column vector which represents the date as epoch 
> day. This is always in UTC.
> *Solution*
> We need to check other variants of datediff UDFs as well and change the 
> parsing mechanism to always interpret date strings in UTC. 
>  
> I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
> {code}
> -  date.setTime(formatter.parse(new String(bytesValue, 
> "UTF-8")).getTime());
> -  baseDate = DateWritableV2.dateToDays(date);
> +  org.apache.hadoop.hive.common.type.Date hiveDate
> +  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
> String(bytesValue, "UTF-8"));
> +  date.setTime(hiveDate.toEpochMilli());
> +  baseDate = hiveDate.toEpochDay();
> {code}



--
This message wa

[jira] [Updated] (HIVE-25449) datediff() gives wrong output when we set tez.task.launch.cmd-opts to some non UTC timezone

2021-08-15 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25449:
-
Description: 
Repro (thanks Qiaosong Dong) - 

1. Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
{code}
create external table test_dt(id string, dt date);
insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');

select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on 
dt1.id = dt.id;
+--+
| _c0  |
+--+
| 6|
| 7|
+--+
{code}

Expected output - 
{code}
+--+
| _c0  |
+--+
| 5|
| 6|
+--+
{code}

*Cause*

This happens because in {{VectorUDFDateDiffColScalar}} class  
1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse 
the date strings which interprets it to be local timezone.

2. For first column we get a column vector which represents the date as epoch 
day. This is always in UTC.

*Solution*

We need to check other variants of datediff UDFs as well and change the parsing 
mechanism to always interpret date strings in UTC. 
 
I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
{code}
-  date.setTime(formatter.parse(new String(bytesValue, 
"UTF-8")).getTime());
-  baseDate = DateWritableV2.dateToDays(date);
+  org.apache.hadoop.hive.common.type.Date hiveDate
+  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
String(bytesValue, "UTF-8"));
+  date.setTime(hiveDate.toEpochMilli());
+  baseDate = hiveDate.toEpochDay();
{code}








  was:
Repro (thanks Qiaosong Dong) - 

1. Add -Duser.timezone=GMT+8 to 
{code}
create external table test_dt(id string, dt date);
insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');

select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on 
dt1.id = dt.id;
+--+
| _c0  |
+--+
| 6|
| 7|
+--+
{code}

Expected output - 
{code}
+--+
| _c0  |
+--+
| 5|
| 6|
+--+
{code}

*Cause*

This happens because in {{VectorUDFDateDiffColScalar}} class  
1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to parse 
the date strings which interprets it to be local timezone.

2. For first column we get a column vector which represents the date as epoch 
day. This is always in UTC.

*Solution*

We need to check other variants of datediff UDFs as well and change the parsing 
mechanism to always interpret date strings in UTC. 
 
I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
{code}
-  date.setTime(formatter.parse(new String(bytesValue, 
"UTF-8")).getTime());
-  baseDate = DateWritableV2.dateToDays(date);
+  org.apache.hadoop.hive.common.type.Date hiveDate
+  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
String(bytesValue, "UTF-8"));
+  date.setTime(hiveDate.toEpochMilli());
+  baseDate = hiveDate.toEpochDay();
{code}









> datediff() gives wrong output when we set tez.task.launch.cmd-opts to some 
> non UTC timezone
> ---
>
> Key: HIVE-25449
> URL: https://issues.apache.org/jira/browse/HIVE-25449
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Repro (thanks Qiaosong Dong) - 
> 1. Add -Duser.timezone=GMT+8 to {{tez.task.launch.cmd-opts}}
> {code}
> create external table test_dt(id string, dt date);
> insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');
> select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt 
> on dt1.id = dt.id;
> +--+
> | _c0  |
> +--+
> | 6|
> | 7|
> +--+
> {code}
> Expected output - 
> {code}
> +--+
> | _c0  |
> +--+
> | 5|
> | 6|
> +--+
> {code}
> *Cause*
> This happens because in {{VectorUDFDateDiffColScalar}} class  
> 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to 
> parse the date strings which interprets it to be local timezone.
> 2. For first column we get a column vector which represents the date as epoch 
> day. This is always in UTC.
> *Solution*
> We need to check other variants of datediff UDFs as well and change the 
> parsing mechanism to always interpret date strings in UTC. 
>  
> I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
> {code}
> -  date.setTime(formatter.parse(new String(bytesValue, 
> "UTF-8")).getTime());
> -  baseDate = DateWritableV2.dateToDays(date);
> +  org.apache.hadoop.hive.common.type.Date hiveDate
> +  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
> String(bytesValue, "UTF-8"));
> +  date.setTime(hiveDate.toEp

[jira] [Assigned] (HIVE-25449) datediff() gives wrong output when we set tez.task.launch.cmd-opts to some non UTC timezone

2021-08-15 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-25449:



> datediff() gives wrong output when we set tez.task.launch.cmd-opts to some 
> non UTC timezone
> ---
>
> Key: HIVE-25449
> URL: https://issues.apache.org/jira/browse/HIVE-25449
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Repro (thanks Qiaosong Dong) - 
> {code}
> create external table test_dt(id string, dt date);
> insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');
> select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt 
> on dt1.id = dt.id;
> +--+
> | _c0  |
> +--+
> | 6|
> | 7|
> +--+
> {code}
> Expected output - 
> {code}
> +--+
> | _c0  |
> +--+
> | 5|
> | 6|
> +--+
> {code}
> *Cause*
> This happens because in {{VectorUDFDateDiffColScalar}} class  
> 1. For second argument(scalar) , we use {{java.text.SimpleDateFormat}} to 
> parse the date strings which interprets it to be local timezone.
> 2. For first column we get a column vector which represents the date as epoch 
> day. This is always in UTC.
> *Solution*
> We need to check other variants of datediff UDFs as well and change the 
> parsing mechanism to always interpret date strings in UTC. 
>  
> I did a quick change in {{VectorUDFDateDiffColScalar}} which fixes the issue.
> {code}
> -  date.setTime(formatter.parse(new String(bytesValue, 
> "UTF-8")).getTime());
> -  baseDate = DateWritableV2.dateToDays(date);
> +  org.apache.hadoop.hive.common.type.Date hiveDate
> +  = org.apache.hadoop.hive.common.type.Date.valueOf(new 
> String(bytesValue, "UTF-8"));
> +  date.setTime(hiveDate.toEpochMilli());
> +  baseDate = hiveDate.toEpochDay();
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-25243) Llap external client - Handle nested values when the parent struct is null

2021-06-24 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369256#comment-17369256
 ] 

Shubham Chaurasia commented on HIVE-25243:
--

Thanks for the review and merge [~maheshk114]

> Llap external client - Handle nested values when the parent struct is null
> --
>
> Key: HIVE-25243
> URL: https://issues.apache.org/jira/browse/HIVE-25243
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Consider the following table in text format - 
> {code}
> +---+
> |  c8   |
> +---+
> | NULL  |
> | {"r":null,"s":null,"t":null}  |
> | {"r":"a","s":9,"t":2.2}   |
> +---+
> {code}
> When we query above table via llap external client, it throws following 
> exception - 
> {code:java}
> Caused by: java.lang.NullPointerException: src
> at io.netty.util.internal.ObjectUtil.checkNotNull(ObjectUtil.java:33)
> at 
> io.netty.buffer.UnsafeByteBufUtil.setBytes(UnsafeByteBufUtil.java:537)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:199)
> at io.netty.buffer.WrappedByteBuf.setBytes(WrappedByteBuf.java:486)
> at 
> io.netty.buffer.UnsafeDirectLittleEndian.setBytes(UnsafeDirectLittleEndian.java:34)
> at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:933)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1191)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.lambda$static$15(Serializer.java:834)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeGeneric(Serializer.java:777)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writePrimitive(Serializer.java:581)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:290)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeStruct(Serializer.java:359)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:296)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.serializeBatch(Serializer.java:213)
> at 
> org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:135)
> {code}
> Created a test to repro it - 
> {code:java}
> /**
>  * TestMiniLlapVectorArrowWithLlapIODisabled - turns off llap io while 
> testing LLAP external client flow.
>  * The aim of turning off LLAP IO is -
>  * when we create table through this test, LLAP caches them and returns the 
> same
>  * when we do a read query, due to this we miss some code paths which may 
> have been hit otherwise.
>  */
> public class TestMiniLlapVectorArrowWithLlapIODisabled extends 
> BaseJdbcWithMiniLlap {
>   @BeforeClass
>   public static void beforeTest() throws Exception {
> HiveConf conf = defaultConf();
> conf.setBoolVar(ConfVars.LLAP_OUTPUT_FORMAT_ARROW, true);
> 
> conf.setBoolVar(ConfVars.HIVE_VECTORIZATION_FILESINK_ARROW_NATIVE_ENABLED, 
> true);
> conf.set(ConfVars.LLAP_IO_ENABLED.varname, "false");
> BaseJdbcWithMiniLlap.beforeTest(conf);
>   }
>   @Override
>   protected InputFormat getInputFormat() {
> //For unit testing, no harm in hard-coding allocator ceiling to 
> LONG.MAX_VALUE
> return new LlapArrowRowInputFormat(Long.MAX_VALUE);
>   }
>   @Test
>   public void testNullsInStructFields() throws Exception {
> createDataTypesTable("datatypes");
> RowCollector2 rowCollector = new RowCollector2();
> // c8 struct
> String query = "select c8 from datatypes";
> int rowCount = processQuery(query, 1, rowCollector);
> assertEquals(3, rowCount);
>   }
> }
> {code}
> Cause - As we see in the table above, first row of the table is NULL, and 
> correspondingly we get {{structVector.isNull[i]=true}} in arrow serializer 
> but we don't get {{isNull[i]=true}} for the fields of struct. And later the 
> code goes for setting such fields in arrow vector and we see above exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HIVE-25243) Llap external client - Handle nested values when the parent struct is null

2021-06-24 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia resolved HIVE-25243.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

> Llap external client - Handle nested values when the parent struct is null
> --
>
> Key: HIVE-25243
> URL: https://issues.apache.org/jira/browse/HIVE-25243
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Consider the following table in text format - 
> {code}
> +---+
> |  c8   |
> +---+
> | NULL  |
> | {"r":null,"s":null,"t":null}  |
> | {"r":"a","s":9,"t":2.2}   |
> +---+
> {code}
> When we query above table via llap external client, it throws following 
> exception - 
> {code:java}
> Caused by: java.lang.NullPointerException: src
> at io.netty.util.internal.ObjectUtil.checkNotNull(ObjectUtil.java:33)
> at 
> io.netty.buffer.UnsafeByteBufUtil.setBytes(UnsafeByteBufUtil.java:537)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:199)
> at io.netty.buffer.WrappedByteBuf.setBytes(WrappedByteBuf.java:486)
> at 
> io.netty.buffer.UnsafeDirectLittleEndian.setBytes(UnsafeDirectLittleEndian.java:34)
> at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:933)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1191)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.lambda$static$15(Serializer.java:834)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeGeneric(Serializer.java:777)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writePrimitive(Serializer.java:581)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:290)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeStruct(Serializer.java:359)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:296)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.serializeBatch(Serializer.java:213)
> at 
> org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:135)
> {code}
> Created a test to repro it - 
> {code:java}
> /**
>  * TestMiniLlapVectorArrowWithLlapIODisabled - turns off llap io while 
> testing LLAP external client flow.
>  * The aim of turning off LLAP IO is -
>  * when we create table through this test, LLAP caches them and returns the 
> same
>  * when we do a read query, due to this we miss some code paths which may 
> have been hit otherwise.
>  */
> public class TestMiniLlapVectorArrowWithLlapIODisabled extends 
> BaseJdbcWithMiniLlap {
>   @BeforeClass
>   public static void beforeTest() throws Exception {
> HiveConf conf = defaultConf();
> conf.setBoolVar(ConfVars.LLAP_OUTPUT_FORMAT_ARROW, true);
> 
> conf.setBoolVar(ConfVars.HIVE_VECTORIZATION_FILESINK_ARROW_NATIVE_ENABLED, 
> true);
> conf.set(ConfVars.LLAP_IO_ENABLED.varname, "false");
> BaseJdbcWithMiniLlap.beforeTest(conf);
>   }
>   @Override
>   protected InputFormat getInputFormat() {
> //For unit testing, no harm in hard-coding allocator ceiling to 
> LONG.MAX_VALUE
> return new LlapArrowRowInputFormat(Long.MAX_VALUE);
>   }
>   @Test
>   public void testNullsInStructFields() throws Exception {
> createDataTypesTable("datatypes");
> RowCollector2 rowCollector = new RowCollector2();
> // c8 struct
> String query = "select c8 from datatypes";
> int rowCount = processQuery(query, 1, rowCollector);
> assertEquals(3, rowCount);
>   }
> }
> {code}
> Cause - As we see in the table above, first row of the table is NULL, and 
> correspondingly we get {{structVector.isNull[i]=true}} in arrow serializer 
> but we don't get {{isNull[i]=true}} for the fields of struct. And later the 
> code goes for setting such fields in arrow vector and we see above exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work started] (HIVE-25243) Llap external client - Handle nested values when the parent struct is null

2021-06-24 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-25243 started by Shubham Chaurasia.

> Llap external client - Handle nested values when the parent struct is null
> --
>
> Key: HIVE-25243
> URL: https://issues.apache.org/jira/browse/HIVE-25243
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Consider the following table in text format - 
> {code}
> +---+
> |  c8   |
> +---+
> | NULL  |
> | {"r":null,"s":null,"t":null}  |
> | {"r":"a","s":9,"t":2.2}   |
> +---+
> {code}
> When we query above table via llap external client, it throws following 
> exception - 
> {code:java}
> Caused by: java.lang.NullPointerException: src
> at io.netty.util.internal.ObjectUtil.checkNotNull(ObjectUtil.java:33)
> at 
> io.netty.buffer.UnsafeByteBufUtil.setBytes(UnsafeByteBufUtil.java:537)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:199)
> at io.netty.buffer.WrappedByteBuf.setBytes(WrappedByteBuf.java:486)
> at 
> io.netty.buffer.UnsafeDirectLittleEndian.setBytes(UnsafeDirectLittleEndian.java:34)
> at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:933)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1191)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.lambda$static$15(Serializer.java:834)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeGeneric(Serializer.java:777)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writePrimitive(Serializer.java:581)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:290)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeStruct(Serializer.java:359)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:296)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.serializeBatch(Serializer.java:213)
> at 
> org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:135)
> {code}
> Created a test to repro it - 
> {code:java}
> /**
>  * TestMiniLlapVectorArrowWithLlapIODisabled - turns off llap io while 
> testing LLAP external client flow.
>  * The aim of turning off LLAP IO is -
>  * when we create table through this test, LLAP caches them and returns the 
> same
>  * when we do a read query, due to this we miss some code paths which may 
> have been hit otherwise.
>  */
> public class TestMiniLlapVectorArrowWithLlapIODisabled extends 
> BaseJdbcWithMiniLlap {
>   @BeforeClass
>   public static void beforeTest() throws Exception {
> HiveConf conf = defaultConf();
> conf.setBoolVar(ConfVars.LLAP_OUTPUT_FORMAT_ARROW, true);
> 
> conf.setBoolVar(ConfVars.HIVE_VECTORIZATION_FILESINK_ARROW_NATIVE_ENABLED, 
> true);
> conf.set(ConfVars.LLAP_IO_ENABLED.varname, "false");
> BaseJdbcWithMiniLlap.beforeTest(conf);
>   }
>   @Override
>   protected InputFormat getInputFormat() {
> //For unit testing, no harm in hard-coding allocator ceiling to 
> LONG.MAX_VALUE
> return new LlapArrowRowInputFormat(Long.MAX_VALUE);
>   }
>   @Test
>   public void testNullsInStructFields() throws Exception {
> createDataTypesTable("datatypes");
> RowCollector2 rowCollector = new RowCollector2();
> // c8 struct
> String query = "select c8 from datatypes";
> int rowCount = processQuery(query, 1, rowCollector);
> assertEquals(3, rowCount);
>   }
> }
> {code}
> Cause - As we see in the table above, first row of the table is NULL, and 
> correspondingly we get {{structVector.isNull[i]=true}} in arrow serializer 
> but we don't get {{isNull[i]=true}} for the fields of struct. And later the 
> code goes for setting such fields in arrow vector and we see above exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-25243) Llap external client - Handle nested values when the parent struct is null

2021-06-14 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25243:
-
Summary: Llap external client - Handle nested values when the parent struct 
is null  (was: Llap external client - Handle nested values when parent struct 
is null)

> Llap external client - Handle nested values when the parent struct is null
> --
>
> Key: HIVE-25243
> URL: https://issues.apache.org/jira/browse/HIVE-25243
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Consider the following table in text format - 
> {code}
> +---+
> |  c8   |
> +---+
> | NULL  |
> | {"r":null,"s":null,"t":null}  |
> | {"r":"a","s":9,"t":2.2}   |
> +---+
> {code}
> When we query above table via llap external client, it throws following 
> exception - 
> {code:java}
> Caused by: java.lang.NullPointerException: src
> at io.netty.util.internal.ObjectUtil.checkNotNull(ObjectUtil.java:33)
> at 
> io.netty.buffer.UnsafeByteBufUtil.setBytes(UnsafeByteBufUtil.java:537)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:199)
> at io.netty.buffer.WrappedByteBuf.setBytes(WrappedByteBuf.java:486)
> at 
> io.netty.buffer.UnsafeDirectLittleEndian.setBytes(UnsafeDirectLittleEndian.java:34)
> at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:933)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1191)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.lambda$static$15(Serializer.java:834)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeGeneric(Serializer.java:777)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writePrimitive(Serializer.java:581)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:290)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeStruct(Serializer.java:359)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:296)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.serializeBatch(Serializer.java:213)
> at 
> org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:135)
> {code}
> Created a test to repro it - 
> {code:java}
> /**
>  * TestMiniLlapVectorArrowWithLlapIODisabled - turns off llap io while 
> testing LLAP external client flow.
>  * The aim of turning off LLAP IO is -
>  * when we create table through this test, LLAP caches them and returns the 
> same
>  * when we do a read query, due to this we miss some code paths which may 
> have been hit otherwise.
>  */
> public class TestMiniLlapVectorArrowWithLlapIODisabled extends 
> BaseJdbcWithMiniLlap {
>   @BeforeClass
>   public static void beforeTest() throws Exception {
> HiveConf conf = defaultConf();
> conf.setBoolVar(ConfVars.LLAP_OUTPUT_FORMAT_ARROW, true);
> 
> conf.setBoolVar(ConfVars.HIVE_VECTORIZATION_FILESINK_ARROW_NATIVE_ENABLED, 
> true);
> conf.set(ConfVars.LLAP_IO_ENABLED.varname, "false");
> BaseJdbcWithMiniLlap.beforeTest(conf);
>   }
>   @Override
>   protected InputFormat getInputFormat() {
> //For unit testing, no harm in hard-coding allocator ceiling to 
> LONG.MAX_VALUE
> return new LlapArrowRowInputFormat(Long.MAX_VALUE);
>   }
>   @Test
>   public void testNullsInStructFields() throws Exception {
> createDataTypesTable("datatypes");
> RowCollector2 rowCollector = new RowCollector2();
> // c8 struct
> String query = "select c8 from datatypes";
> int rowCount = processQuery(query, 1, rowCollector);
> assertEquals(3, rowCount);
>   }
> }
> {code}
> Cause - As we see in the table above, first row of the table is NULL, and 
> correspondingly we get {{structVector.isNull[i]=true}} in arrow serializer 
> but we don't get {{isNull[i]=true}} for the fields of struct. And later the 
> code goes for setting such fields in arrow vector and we see above exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-25243) Llap external client - Handle nested values when parent struct is null

2021-06-14 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25243:
-
Summary: Llap external client - Handle nested values when parent struct is 
null  (was: Llap external client - Handle nested null values in struct vector 
in arrow serializer)

> Llap external client - Handle nested values when parent struct is null
> --
>
> Key: HIVE-25243
> URL: https://issues.apache.org/jira/browse/HIVE-25243
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Consider the following table in text format - 
> {code}
> +---+
> |  c8   |
> +---+
> | NULL  |
> | {"r":null,"s":null,"t":null}  |
> | {"r":"a","s":9,"t":2.2}   |
> +---+
> {code}
> When we query above table via llap external client, it throws following 
> exception - 
> {code:java}
> Caused by: java.lang.NullPointerException: src
> at io.netty.util.internal.ObjectUtil.checkNotNull(ObjectUtil.java:33)
> at 
> io.netty.buffer.UnsafeByteBufUtil.setBytes(UnsafeByteBufUtil.java:537)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:199)
> at io.netty.buffer.WrappedByteBuf.setBytes(WrappedByteBuf.java:486)
> at 
> io.netty.buffer.UnsafeDirectLittleEndian.setBytes(UnsafeDirectLittleEndian.java:34)
> at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:933)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1191)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.lambda$static$15(Serializer.java:834)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeGeneric(Serializer.java:777)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writePrimitive(Serializer.java:581)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:290)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeStruct(Serializer.java:359)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:296)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.serializeBatch(Serializer.java:213)
> at 
> org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:135)
> {code}
> Created a test to repro it - 
> {code:java}
> /**
>  * TestMiniLlapVectorArrowWithLlapIODisabled - turns off llap io while 
> testing LLAP external client flow.
>  * The aim of turning off LLAP IO is -
>  * when we create table through this test, LLAP caches them and returns the 
> same
>  * when we do a read query, due to this we miss some code paths which may 
> have been hit otherwise.
>  */
> public class TestMiniLlapVectorArrowWithLlapIODisabled extends 
> BaseJdbcWithMiniLlap {
>   @BeforeClass
>   public static void beforeTest() throws Exception {
> HiveConf conf = defaultConf();
> conf.setBoolVar(ConfVars.LLAP_OUTPUT_FORMAT_ARROW, true);
> 
> conf.setBoolVar(ConfVars.HIVE_VECTORIZATION_FILESINK_ARROW_NATIVE_ENABLED, 
> true);
> conf.set(ConfVars.LLAP_IO_ENABLED.varname, "false");
> BaseJdbcWithMiniLlap.beforeTest(conf);
>   }
>   @Override
>   protected InputFormat getInputFormat() {
> //For unit testing, no harm in hard-coding allocator ceiling to 
> LONG.MAX_VALUE
> return new LlapArrowRowInputFormat(Long.MAX_VALUE);
>   }
>   @Test
>   public void testNullsInStructFields() throws Exception {
> createDataTypesTable("datatypes");
> RowCollector2 rowCollector = new RowCollector2();
> // c8 struct
> String query = "select c8 from datatypes";
> int rowCount = processQuery(query, 1, rowCollector);
> assertEquals(3, rowCount);
>   }
> }
> {code}
> Cause - As we see in the table above, first row of the table is NULL, and 
> correspondingly we get {{structVector.isNull[i]=true}} in arrow serializer 
> but we don't get {{isNull[i]=true}} for the fields of struct. And later the 
> code goes for setting such fields in arrow vector and we see above exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-25243) Llap external client - Handle nested null values in struct vector in arrow serializer

2021-06-14 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-25243:



> Llap external client - Handle nested null values in struct vector in arrow 
> serializer
> -
>
> Key: HIVE-25243
> URL: https://issues.apache.org/jira/browse/HIVE-25243
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Consider the following table in text format - 
> {code}
> +---+
> |  c8   |
> +---+
> | NULL  |
> | {"r":null,"s":null,"t":null}  |
> | {"r":"a","s":9,"t":2.2}   |
> +---+
> {code}
> When we query above table via llap external client, it throws following 
> exception - 
> {code:java}
> Caused by: java.lang.NullPointerException: src
> at io.netty.util.internal.ObjectUtil.checkNotNull(ObjectUtil.java:33)
> at 
> io.netty.buffer.UnsafeByteBufUtil.setBytes(UnsafeByteBufUtil.java:537)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:199)
> at io.netty.buffer.WrappedByteBuf.setBytes(WrappedByteBuf.java:486)
> at 
> io.netty.buffer.UnsafeDirectLittleEndian.setBytes(UnsafeDirectLittleEndian.java:34)
> at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:933)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1191)
> at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1026)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.lambda$static$15(Serializer.java:834)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeGeneric(Serializer.java:777)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writePrimitive(Serializer.java:581)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:290)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeStruct(Serializer.java:359)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:296)
> at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.serializeBatch(Serializer.java:213)
> at 
> org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:135)
> {code}
> Created a test to repro it - 
> {code:java}
> /**
>  * TestMiniLlapVectorArrowWithLlapIODisabled - turns off llap io while 
> testing LLAP external client flow.
>  * The aim of turning off LLAP IO is -
>  * when we create table through this test, LLAP caches them and returns the 
> same
>  * when we do a read query, due to this we miss some code paths which may 
> have been hit otherwise.
>  */
> public class TestMiniLlapVectorArrowWithLlapIODisabled extends 
> BaseJdbcWithMiniLlap {
>   @BeforeClass
>   public static void beforeTest() throws Exception {
> HiveConf conf = defaultConf();
> conf.setBoolVar(ConfVars.LLAP_OUTPUT_FORMAT_ARROW, true);
> 
> conf.setBoolVar(ConfVars.HIVE_VECTORIZATION_FILESINK_ARROW_NATIVE_ENABLED, 
> true);
> conf.set(ConfVars.LLAP_IO_ENABLED.varname, "false");
> BaseJdbcWithMiniLlap.beforeTest(conf);
>   }
>   @Override
>   protected InputFormat getInputFormat() {
> //For unit testing, no harm in hard-coding allocator ceiling to 
> LONG.MAX_VALUE
> return new LlapArrowRowInputFormat(Long.MAX_VALUE);
>   }
>   @Test
>   public void testNullsInStructFields() throws Exception {
> createDataTypesTable("datatypes");
> RowCollector2 rowCollector = new RowCollector2();
> // c8 struct
> String query = "select c8 from datatypes";
> int rowCount = processQuery(query, 1, rowCollector);
> assertEquals(3, rowCount);
>   }
> }
> {code}
> Cause - As we see in the table above, first row of the table is NULL, and 
> correspondingly we get {{structVector.isNull[i]=true}} in arrow serializer 
> but we don't get {{isNull[i]=true}} for the fields of struct. And later the 
> code goes for setting such fields in arrow vector and we see above exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-25159) Remove support for ordered results in llap external client library

2021-06-02 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-25159:
-
Attachment: HIVE-25159.01.patch

> Remove support for ordered results in llap external client library
> --
>
> Key: HIVE-25159
> URL: https://issues.apache.org/jira/browse/HIVE-25159
> Project: Hive
>  Issue Type: Bug
>  Components: Clients, Hive
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-25159.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when querying via llap external client framework, in case of order 
> by queries -
> 1. Due to the fact that spark-llap used to wrap actual query in a subquery as 
> mentioned in [HIVE-19794|https://issues.apache.org/jira/browse/HIVE-19794]
> a) We had to detect order by like - 
> {code}
> orderByQuery = plan.getQueryProperties().hasOrderBy() || 
> plan.getQueryProperties().hasOuterOrderBy();
> {code}
> Due to this we recently saw an exception like below for one of the queries 
> that did not have an outer order by (It was having an order by in a subquery)
> {code}
> org.apache.hive.service.cli.HiveSQLException: java.io.IOException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException:
> java.lang.IllegalStateException: Requested to generate single split. Paths 
> and fileStatuses are expected to be 1. Got paths: 1 fileStatuses: 7
> {code}
> b) Also we had to disable following optimization - 
> {code}
> HiveConf.setBoolVar(conf, ConfVars.HIVE_REMOVE_ORDERBY_IN_SUBQUERY, false);
> {code}
> 2. By default we have 
> {{hive.llap.external.splits.order.by.force.single.split=true}} which forces 
> us to generate single split leading to performance bottleneck. 
> We should remove ordering support altogether from llap external client repo 
> and let clients handle it at their end.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-25159) Remove support for ordered results in llap external client library

2021-05-24 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-25159:



> Remove support for ordered results in llap external client library
> --
>
> Key: HIVE-25159
> URL: https://issues.apache.org/jira/browse/HIVE-25159
> Project: Hive
>  Issue Type: Bug
>  Components: Clients, Hive
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Currently when querying via llap external client framework, in case of order 
> by queries -
> 1. Due to the fact that spark-llap used to wrap actual query in a subquery as 
> mentioned in [HIVE-19794|https://issues.apache.org/jira/browse/HIVE-19794]
> a) We had to detect order by like - 
> {code}
> orderByQuery = plan.getQueryProperties().hasOrderBy() || 
> plan.getQueryProperties().hasOuterOrderBy();
> {code}
> Due to this we recently saw an exception like below for one of the queries 
> that did not have an outer order by (It was having an order by in a subquery)
> {code}
> org.apache.hive.service.cli.HiveSQLException: java.io.IOException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException:
> java.lang.IllegalStateException: Requested to generate single split. Paths 
> and fileStatuses are expected to be 1. Got paths: 1 fileStatuses: 7
> {code}
> b) Also we had to disable following optimization - 
> {code}
> HiveConf.setBoolVar(conf, ConfVars.HIVE_REMOVE_ORDERBY_IN_SUBQUERY, false);
> {code}
> 2. By default we have 
> {{hive.llap.external.splits.order.by.force.single.split=true}} which forces 
> us to generate single split leading to performance bottleneck. 
> We should remove ordering support altogether from llap external client repo 
> and let clients handle it at their end.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24563) Check if we can interchange client and server sides for umbilical for external client flow

2020-12-22 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-24563:
-
Description: 
Currently we open three tcp connections when llap external client communicates 
to llap.


{noformat}
   llap-ext-client   ... llap


connection1: client  ...>>...  server 

(RPC for submitting fragments - say t1, t2, t3. llap-ext-client initiates 
connection)


connection2: client  ...>>...  server  

(for reading the output of t1, t2, t3. llap-ext-client initiates connection)


connection3: umbilical server  ...<<...  client

(RPC for status updates/heartbeat of t1, t2, t3. llap Daemon initiates 
connection)

{noformat}

connection3 starts a umbilical(RPC) server at the client side to which llap 
daemon keeps sending the task statuses / heartbeats and node heartbeats. 

*The Problem* 

In cloud based deployment, we need to open tcp traffic. 
1. For connection1 and connection2, we need to open incoming tcp traffic on the 
machines running llap from client.

2. For connection3, we need to open incoming tcp traffic on the machines where 
llap-ext-client is running, from llap daemon. 

Here clients also need to worry about opening traffic(from llap) at their end. 

*Possible Solution*

This jira is to evaluate the possibility of interchanging Umbilical server and 
client sides i.e. umbilical server will run in llap only and llap-ext-client 
will act as client and initiate the connection.  

We can have umbilical address in llap splits (when get_splits is called by 
external client) which the client can later connect to. 

 cc [~prasanth_j] [~harishjp]


  was:
Currently we open three tcp connections when llap external client communicates 
to llap.


{noformat}
   llap-ext-client   ... llap


connection1: client  ...>>...  server 

(RPC for submitting fragments - say t1, t2, t3. llap-ext-client initiates 
connection)


connection2: client  ...>>...  server  

(for reading the output of t1, t2, t3. llap-ext-client initiates connection)


connection3: umbilical server  ...<<...  client

(RPC for status updates/heartbeat of t1, t2, t3. llap Daemon initiates 
connection)

{noformat}

connection3 starts a umbilical(RPC) server at the client side to which llap 
daemon keeps sending the task statuses / heartbeats and node heartbeats. 

*The Problem* 

In cloud based deployment, we need to open tcp traffic. 
1. For connection1 and connection2, we need to open incoming tcp traffic on the 
machines running llap from client.

2. For connection3, we need to open incoming tcp traffic on the machines where 
llap-ext-client is running, from llap daemon. 

Here clients also need to worry about opening traffic(from llap) at their end. 

*Possible Solution*

This jira is to evaluate the possibility of interchanging Umbilical server and 
client sides i.e. umbilical server will run in llap only and llap-ext-client 
will act as client and initiate the connection.  

We can have umbilical address in llap splits (when get_splits is called by 
external client) which the client can later connect to. 

 cc [~prasanth_j]



> Check if we can interchange client and server sides for umbilical for 
> external client flow
> --
>
> Key: HIVE-24563
> URL: https://issues.apache.org/jira/browse/HIVE-24563
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Currently we open three tcp connections when llap external client 
> communicates to llap.
> {noformat}
>llap-ext-client   ... llap
> connection1: client  ...>>...  server 
> (RPC for submitting fragments - say t1, t2, t3. llap-ext-client initiates 
> connection)
> connection2: client  ...>>...  server  
> (for reading the output of t1, t2, t3. llap-ext-client initiates connection)
> connection3: umbilical server  ...<<...  client
> (RPC for status updates/heartbeat of t1, t2, t3. llap Daemon initiates 
> connection)
> {noformat}
> connection3 starts a umbilical(RPC) server at the client side to which llap 
> daemon keeps sending the task statuses / heartbeats and node heartbeats. 
> *The Problem* 
> In cloud based deployment, we need to open tcp traffic. 
> 1. For connection1 and connection2, we need to open incoming tcp traffic on 
> the machines running llap from client.
> 2. For connection3, we need to open incoming tcp traffic on the machines 
> where llap-ext-client is running, from llap daemon. 
> Here clients also need to worry about opening traffic(from llap) at their 
> end. 
> *Possible Solution*
>

[jira] [Updated] (HIVE-24563) Check if we can interchange client and server sides for umbilical for external client flow

2020-12-22 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-24563:
-
Description: 
Currently we open three tcp connections when llap external client communicates 
to llap.


{noformat}
   llap-ext-client   ... llap


connection1: client  ...>>...  server 

(RPC for submitting fragments - say t1, t2, t3. llap-ext-client initiates 
connection)


connection2: client  ...>>...  server  

(for reading the output of t1, t2, t3. llap-ext-client initiates connection)


connection3: umbilical server  ...<<...  client

(RPC for status updates/heartbeat of t1, t2, t3. llap Daemon initiates 
connection)

{noformat}

connection3 starts a umbilical(RPC) server at the client side to which llap 
daemon keeps sending the task statuses / heartbeats and node heartbeats. 

*The Problem* 

In cloud based deployment, we need to open tcp traffic. 
1. For connection1 and connection2, we need to open incoming tcp traffic on the 
machines running llap from client.

2. For connection3, we need to open incoming tcp traffic on the machines where 
llap-ext-client is running, from llap daemon. 

Here clients also need to worry about opening traffic(from llap) at their end. 

*Possible Solution*

This jira is to evaluate the possibility of interchanging Umbilical server and 
client sides i.e. umbilical server will run in llap only and llap-ext-client 
will act as client and initiate the connection.  

We can have umbilical address in llap splits (when get_splits is called by 
external client) which the client can later connect to. 

 cc [~prasanth_j]


  was:
Currently we open three tcp connections when llap external client communicates 
to llap.


{noformat}
   llap-ext-client   ... llap


connection1: client  ...>>...  server 

(RPC for submitting fragments - say t1, t2, t3. llap-ext-client initiates 
connection)


connection2: client  ...>>...  server  

(for reading the output of t1, t2, t3. llap-ext-client initiates connection)


connection3: umbilical server  ...<<...  client

(RPC for status updates/heartbeat of t1, t2, t3. llap Daemon initiates 
connection)

{noformat}

connection3 starts a umbilical(RPC) server at the client side to which llap 
daemon keeps sending the task statuses / heartbeats and node heartbeats. 

*The Problem* 

In cloud based deployment, we need to open tcp traffic. 
1. For connection1 and connection2, we need to open incoming tcp traffic on the 
machines running llap from client.

2. For connection3, we need to open incoming tcp traffic on the machines where 
llap-ext-client is running, from llap daemon. 

Here clients also need to worry about opening traffic(from llap) at their end. 

*Possible Solution*

This jira is to evaluate the possibility of interchanging Umbilical server and 
client sides i.e. umbilical server will run in llap only and llap-ext-client 
will act as client and initiate the connection.  

We can have umbilical address in llap splits (when get_splits is called by 
external client) which the client can later connect to. 

 



> Check if we can interchange client and server sides for umbilical for 
> external client flow
> --
>
> Key: HIVE-24563
> URL: https://issues.apache.org/jira/browse/HIVE-24563
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Currently we open three tcp connections when llap external client 
> communicates to llap.
> {noformat}
>llap-ext-client   ... llap
> connection1: client  ...>>...  server 
> (RPC for submitting fragments - say t1, t2, t3. llap-ext-client initiates 
> connection)
> connection2: client  ...>>...  server  
> (for reading the output of t1, t2, t3. llap-ext-client initiates connection)
> connection3: umbilical server  ...<<...  client
> (RPC for status updates/heartbeat of t1, t2, t3. llap Daemon initiates 
> connection)
> {noformat}
> connection3 starts a umbilical(RPC) server at the client side to which llap 
> daemon keeps sending the task statuses / heartbeats and node heartbeats. 
> *The Problem* 
> In cloud based deployment, we need to open tcp traffic. 
> 1. For connection1 and connection2, we need to open incoming tcp traffic on 
> the machines running llap from client.
> 2. For connection3, we need to open incoming tcp traffic on the machines 
> where llap-ext-client is running, from llap daemon. 
> Here clients also need to worry about opening traffic(from llap) at their 
> end. 
> *Possible Solution*
> This jira is to evaluate th

[jira] [Updated] (HIVE-24563) Check if we can interchange client and server sides for umbilical for external client flow

2020-12-22 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-24563:
-
Description: 
Currently we open three tcp connections when llap external client communicates 
to llap.


{noformat}
   llap-ext-client   ... llap


connection1: client  ...>>...  server 

(RPC for submitting fragments - say t1, t2, t3. llap-ext-client initiates 
connection)


connection2: client  ...>>...  server  

(for reading the output of t1, t2, t3. llap-ext-client initiates connection)


connection3: umbilical server  ...<<...  client

(RPC for status updates/heartbeat of t1, t2, t3. llap Daemon initiates 
connection)

{noformat}

connection3 starts a umbilical(RPC) server at the client side to which llap 
daemon keeps sending the task statuses / heartbeats and node heartbeats. 

*The Problem* 

In cloud based deployment, we need to open tcp traffic. 
1. For connection1 and connection2, we need to open incoming tcp traffic on the 
machines running llap from client.

2. For connection3, we need to open incoming tcp traffic on the machines where 
llap-ext-client is running, from llap daemon. 

Here clients also need to worry about opening traffic(from llap) at their end. 

*Possible Solution*

This jira is to evaluate the possibility of interchanging Umbilical server and 
client sides i.e. umbilical server will run in llap only and llap-ext-client 
will act as client and initiate the connection.  

We can have umbilical address in llap splits (when get_splits is called by 
external client) which the client can later connect to. 

 


  was:
Currently we open three tcp connections when llap external client communicates 
to llap.


{noformat}
   llap-ext-client   ... llap


connection1: client  ...>>...  server 

(RPC for submitting fragments - say t1, t2, t3. llap-ext-client initiates 
connection)


connection2: client  ...>>...  server  

(for reading the output of t1, t2, t3. llap-ext-client initiates connection)


connection3: umbilical server  ...<<...  client

(RPC for status updates/heartbeat of t1, t2, t3. llap Daemon initiates 
connection)

{noformat}

connection3 starts a umbilical(RPC) server at the client side to which llap 
daemon keeps sending the task statuses / heartbeats and node heartbeats. 

*The Problem* 

In cloud based deployment, we need to open tcp traffic. 
1. For connection1 and connection2, we need to open incoming tcp traffic on the 
machines running llap from client.

2. For connection3, we need to open incoming tcp traffic on the machines where 
llap-ext-client is running, from llap daemon. 

Here clients also need to worry about opening traffic(from llap) at their end. 


This jira is to evaluate the possibility of interchanging Umbilical server and 
client sides i.e. umbilical server will run in llap only and llap-ext-client 
will act as client and initiate the connection.  

We can have umbilical address in llap splits (when get_splits is called by 
external client) which the client can later connect to. 

 



> Check if we can interchange client and server sides for umbilical for 
> external client flow
> --
>
> Key: HIVE-24563
> URL: https://issues.apache.org/jira/browse/HIVE-24563
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Currently we open three tcp connections when llap external client 
> communicates to llap.
> {noformat}
>llap-ext-client   ... llap
> connection1: client  ...>>...  server 
> (RPC for submitting fragments - say t1, t2, t3. llap-ext-client initiates 
> connection)
> connection2: client  ...>>...  server  
> (for reading the output of t1, t2, t3. llap-ext-client initiates connection)
> connection3: umbilical server  ...<<...  client
> (RPC for status updates/heartbeat of t1, t2, t3. llap Daemon initiates 
> connection)
> {noformat}
> connection3 starts a umbilical(RPC) server at the client side to which llap 
> daemon keeps sending the task statuses / heartbeats and node heartbeats. 
> *The Problem* 
> In cloud based deployment, we need to open tcp traffic. 
> 1. For connection1 and connection2, we need to open incoming tcp traffic on 
> the machines running llap from client.
> 2. For connection3, we need to open incoming tcp traffic on the machines 
> where llap-ext-client is running, from llap daemon. 
> Here clients also need to worry about opening traffic(from llap) at their 
> end. 
> *Possible Solution*
> This jira is to evaluate the possibility of interchanging Umbil

[jira] [Assigned] (HIVE-24563) Check if we can interchange client and server sides for umbilical for external client flow

2020-12-22 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-24563:



> Check if we can interchange client and server sides for umbilical for 
> external client flow
> --
>
> Key: HIVE-24563
> URL: https://issues.apache.org/jira/browse/HIVE-24563
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive, llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Currently we open three tcp connections when llap external client 
> communicates to llap.
> {noformat}
>llap-ext-client   ... llap
> connection1: client  ...>>...  server 
> (RPC for submitting fragments - say t1, t2, t3. llap-ext-client initiates 
> connection)
> connection2: client  ...>>...  server  
> (for reading the output of t1, t2, t3. llap-ext-client initiates connection)
> connection3: umbilical server  ...<<...  client
> (RPC for status updates/heartbeat of t1, t2, t3. llap Daemon initiates 
> connection)
> {noformat}
> connection3 starts a umbilical(RPC) server at the client side to which llap 
> daemon keeps sending the task statuses / heartbeats and node heartbeats. 
> *The Problem* 
> In cloud based deployment, we need to open tcp traffic. 
> 1. For connection1 and connection2, we need to open incoming tcp traffic on 
> the machines running llap from client.
> 2. For connection3, we need to open incoming tcp traffic on the machines 
> where llap-ext-client is running, from llap daemon. 
> Here clients also need to worry about opening traffic(from llap) at their 
> end. 
> This jira is to evaluate the possibility of interchanging Umbilical server 
> and client sides i.e. umbilical server will run in llap only and 
> llap-ext-client will act as client and initiate the connection.  
> We can have umbilical address in llap splits (when get_splits is called by 
> external client) which the client can later connect to. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24138) Llap external client flow is broken due to netty shading

2020-09-11 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-24138:


Assignee: Ayush Saxena  (was: Shubham Chaurasia)

> Llap external client flow is broken due to netty shading
> 
>
> Key: HIVE-24138
> URL: https://issues.apache.org/jira/browse/HIVE-24138
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Ayush Saxena
>Priority: Critical
>
> We shaded netty in hive-exec in - 
> https://issues.apache.org/jira/browse/HIVE-23073
> This breaks LLAP external client flow on LLAP daemon side - 
> LLAP daemon stacktrace - 
> {code}
> 2020-09-09T18:22:13,413  INFO [TezTR-222977_4_0_0_0_0 
> (497418324441977_0004_0_00_00_0)] llap.LlapOutputFormat: Returning 
> writer for: attempt_497418324441977_0004_0_00_00_0
> 2020-09-09T18:22:13,419 ERROR [TezTR-222977_4_0_0_0_0 
> (497418324441977_0004_0_00_00_0)] tez.MapRecordSource: 
> java.lang.NoSuchMethodError: 
> org.apache.arrow.memory.BufferAllocator.buffer(I)Lorg/apache/hive/io/netty/buffer/ArrowBuf;
>   at 
> org.apache.hadoop.hive.llap.WritableByteChannelAdapter.write(WritableByteChannelAdapter.java:96)
>   at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:74)
>   at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:57)
>   at 
> org.apache.arrow.vector.ipc.WriteChannel.writeIntLittleEndian(WriteChannel.java:89)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:88)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.ensureStarted(ArrowWriter.java:130)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:102)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:85)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:137)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:172)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Arrow method signature mismatch mainly happens due to the fact that arrow 
> contains some classes which are packaged under {{io.netty.buffer.*}} - 
> {code}
> io.netty.buffer.ArrowBuf
> io.netty.buffer.ExpandableByteBuf
> io.netty.buffer.LargeBuffer
> io.netty.buffer.MutableWrappedByteBuf
> io.nett

[jira] [Commented] (HIVE-24138) Llap external client flow is broken due to netty shading

2020-09-11 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194217#comment-17194217
 ] 

Shubham Chaurasia commented on HIVE-24138:
--

[~abstractdog] [~thejas] [~ashutoshc]

Should we try to upgrade to [hadoop-3.1.4 which already is on 
4.1.48.Final|https://github.com/apache/hadoop/blob/rel/release-3.1.4/hadoop-project/pom.xml#L790]
 and remove netty shading ? 

cc [~anishek] [~ayushtkn]

> Llap external client flow is broken due to netty shading
> 
>
> Key: HIVE-24138
> URL: https://issues.apache.org/jira/browse/HIVE-24138
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Critical
>
> We shaded netty in hive-exec in - 
> https://issues.apache.org/jira/browse/HIVE-23073
> This breaks LLAP external client flow on LLAP daemon side - 
> LLAP daemon stacktrace - 
> {code}
> 2020-09-09T18:22:13,413  INFO [TezTR-222977_4_0_0_0_0 
> (497418324441977_0004_0_00_00_0)] llap.LlapOutputFormat: Returning 
> writer for: attempt_497418324441977_0004_0_00_00_0
> 2020-09-09T18:22:13,419 ERROR [TezTR-222977_4_0_0_0_0 
> (497418324441977_0004_0_00_00_0)] tez.MapRecordSource: 
> java.lang.NoSuchMethodError: 
> org.apache.arrow.memory.BufferAllocator.buffer(I)Lorg/apache/hive/io/netty/buffer/ArrowBuf;
>   at 
> org.apache.hadoop.hive.llap.WritableByteChannelAdapter.write(WritableByteChannelAdapter.java:96)
>   at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:74)
>   at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:57)
>   at 
> org.apache.arrow.vector.ipc.WriteChannel.writeIntLittleEndian(WriteChannel.java:89)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:88)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.ensureStarted(ArrowWriter.java:130)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:102)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:85)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:137)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:172)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Arrow method signature mismatch mainly ha

[jira] [Commented] (HIVE-24138) Llap external client flow is broken due to netty shading

2020-09-10 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193571#comment-17193571
 ] 

Shubham Chaurasia commented on HIVE-24138:
--

[~abstractdog] [~thejas] [~ashutoshc]

Any suggestions on how to proceed on this ? 

> Llap external client flow is broken due to netty shading
> 
>
> Key: HIVE-24138
> URL: https://issues.apache.org/jira/browse/HIVE-24138
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Critical
>
> We shaded netty in hive-exec in - 
> https://issues.apache.org/jira/browse/HIVE-23073
> This breaks LLAP external client flow on LLAP daemon side - 
> LLAP daemon stacktrace - 
> {code}
> 2020-09-09T18:22:13,413  INFO [TezTR-222977_4_0_0_0_0 
> (497418324441977_0004_0_00_00_0)] llap.LlapOutputFormat: Returning 
> writer for: attempt_497418324441977_0004_0_00_00_0
> 2020-09-09T18:22:13,419 ERROR [TezTR-222977_4_0_0_0_0 
> (497418324441977_0004_0_00_00_0)] tez.MapRecordSource: 
> java.lang.NoSuchMethodError: 
> org.apache.arrow.memory.BufferAllocator.buffer(I)Lorg/apache/hive/io/netty/buffer/ArrowBuf;
>   at 
> org.apache.hadoop.hive.llap.WritableByteChannelAdapter.write(WritableByteChannelAdapter.java:96)
>   at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:74)
>   at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:57)
>   at 
> org.apache.arrow.vector.ipc.WriteChannel.writeIntLittleEndian(WriteChannel.java:89)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:88)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.ensureStarted(ArrowWriter.java:130)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:102)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:85)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:137)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:172)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Arrow method signature mismatch mainly happens due to the fact that arrow 
> contains some classes which are packaged under {{io.netty.buffer.*}} - 
> {code}
> io.netty.buffer.ArrowBuf
> io.netty.buffer.Expandable

[jira] [Assigned] (HIVE-24138) Llap external client flow is broken due to netty shading

2020-09-10 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-24138:


Assignee: Shubham Chaurasia

> Llap external client flow is broken due to netty shading
> 
>
> Key: HIVE-24138
> URL: https://issues.apache.org/jira/browse/HIVE-24138
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Critical
>
> We shaded netty in hive-exec in - 
> https://issues.apache.org/jira/browse/HIVE-23073
> This breaks LLAP external client flow on LLAP daemon side - 
> LLAP daemon stacktrace - 
> {code}
> 2020-09-09T18:22:13,413  INFO [TezTR-222977_4_0_0_0_0 
> (497418324441977_0004_0_00_00_0)] llap.LlapOutputFormat: Returning 
> writer for: attempt_497418324441977_0004_0_00_00_0
> 2020-09-09T18:22:13,419 ERROR [TezTR-222977_4_0_0_0_0 
> (497418324441977_0004_0_00_00_0)] tez.MapRecordSource: 
> java.lang.NoSuchMethodError: 
> org.apache.arrow.memory.BufferAllocator.buffer(I)Lorg/apache/hive/io/netty/buffer/ArrowBuf;
>   at 
> org.apache.hadoop.hive.llap.WritableByteChannelAdapter.write(WritableByteChannelAdapter.java:96)
>   at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:74)
>   at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:57)
>   at 
> org.apache.arrow.vector.ipc.WriteChannel.writeIntLittleEndian(WriteChannel.java:89)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:88)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.ensureStarted(ArrowWriter.java:130)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:102)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:85)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:137)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:172)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Arrow method signature mismatch mainly happens due to the fact that arrow 
> contains some classes which are packaged under {{io.netty.buffer.*}} - 
> {code}
> io.netty.buffer.ArrowBuf
> io.netty.buffer.ExpandableByteBuf
> io.netty.buffer.LargeBuffer
> io.netty.buffer.MutableWrappedByteBuf
> io.netty.buffer.PooledB

[jira] [Updated] (HIVE-24138) Llap external client flow is broken due to netty shading

2020-09-09 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-24138:
-
Description: 
We shaded netty in hive-exec in - 
https://issues.apache.org/jira/browse/HIVE-23073

This breaks LLAP external client flow on LLAP daemon side - 

LLAP daemon stacktrace - 
{code}
2020-09-09T18:22:13,413  INFO [TezTR-222977_4_0_0_0_0 
(497418324441977_0004_0_00_00_0)] llap.LlapOutputFormat: Returning 
writer for: attempt_497418324441977_0004_0_00_00_0
2020-09-09T18:22:13,419 ERROR [TezTR-222977_4_0_0_0_0 
(497418324441977_0004_0_00_00_0)] tez.MapRecordSource: 
java.lang.NoSuchMethodError: 
org.apache.arrow.memory.BufferAllocator.buffer(I)Lorg/apache/hive/io/netty/buffer/ArrowBuf;
at 
org.apache.hadoop.hive.llap.WritableByteChannelAdapter.write(WritableByteChannelAdapter.java:96)
at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:74)
at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:57)
at 
org.apache.arrow.vector.ipc.WriteChannel.writeIntLittleEndian(WriteChannel.java:89)
at 
org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:88)
at 
org.apache.arrow.vector.ipc.ArrowWriter.ensureStarted(ArrowWriter.java:130)
at 
org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:102)
at 
org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:85)
at 
org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:46)
at 
org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:137)
at 
org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158)
at 
org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:172)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:842)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at 
org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}

Arrow method signature mismatch mainly happens due to the fact that arrow 
contains some classes which are packaged under {{io.netty.buffer.*}} - 
{code}
io.netty.buffer.ArrowBuf
io.netty.buffer.ExpandableByteBuf
io.netty.buffer.LargeBuffer
io.netty.buffer.MutableWrappedByteBuf
io.netty.buffer.PooledByteBufAllocatorL
io.netty.buffer.UnsafeDirectLittleEndian
{code}

Since we have relocated netty, these classes have also been relocated to 
{{org.apache.hive.io.netty.buffer.*}} and causing {{NoSuchMethodError}}.

cc [~anishek] [~thejas] [~abstractdog] [~irashid] [~bruce.robbins]

  was:
We shaded netty in hive-exec in - 
https://issues.apache.org/jira/browse/HIVE-23073

This breaks LLAP external client flow on LLAP daemon side - 
{code}
2020-09-09T18:22:13,413  INFO [TezTR-222977_4_0_0_0_0 
(497418324441977_0004_0_00_00_0)] lla

[jira] [Updated] (HIVE-24138) Llap external client flow is broken due to netty shading

2020-09-09 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-24138:
-
Description: 
We shaded netty in hive-exec in - 
https://issues.apache.org/jira/browse/HIVE-23073

This breaks LLAP external client flow on LLAP daemon side - 
{code}
2020-09-09T18:22:13,413  INFO [TezTR-222977_4_0_0_0_0 
(497418324441977_0004_0_00_00_0)] llap.LlapOutputFormat: Returning 
writer for: attempt_497418324441977_0004_0_00_00_0
2020-09-09T18:22:13,419 ERROR [TezTR-222977_4_0_0_0_0 
(497418324441977_0004_0_00_00_0)] tez.MapRecordSource: 
java.lang.NoSuchMethodError: 
org.apache.arrow.memory.BufferAllocator.buffer(I)Lorg/apache/hive/io/netty/buffer/ArrowBuf;
at 
org.apache.hadoop.hive.llap.WritableByteChannelAdapter.write(WritableByteChannelAdapter.java:96)
at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:74)
at org.apache.arrow.vector.ipc.WriteChannel.write(WriteChannel.java:57)
at 
org.apache.arrow.vector.ipc.WriteChannel.writeIntLittleEndian(WriteChannel.java:89)
at 
org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:88)
at 
org.apache.arrow.vector.ipc.ArrowWriter.ensureStarted(ArrowWriter.java:130)
at 
org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:102)
at 
org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:85)
at 
org.apache.hadoop.hive.llap.LlapArrowRecordWriter.write(LlapArrowRecordWriter.java:46)
at 
org.apache.hadoop.hive.ql.exec.vector.filesink.VectorFileSinkArrowOperator.process(VectorFileSinkArrowOperator.java:137)
at 
org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158)
at 
org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:172)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:842)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at 
org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}

Arrow method signature mismatch mainly happens due to the fact that arrow 
contains some classes which are packaged under {{io.netty.buffer.*}} - 
{code}
io.netty.buffer.ArrowBuf
io.netty.buffer.ExpandableByteBuf
io.netty.buffer.LargeBuffer
io.netty.buffer.MutableWrappedByteBuf
io.netty.buffer.PooledByteBufAllocatorL
io.netty.buffer.UnsafeDirectLittleEndian
{code}

Since we have relocated netty, these classes have also been relocated to 
{{org.apache.hive.io.netty.buffer.*}} and causing {{NoSuchMethodError}}.

  was:
We shaded netty in hive-exec in - 
https://issues.apache.org/jira/browse/HIVE-23073

This breaks LLAP external client flow on LLAP daemon side - 
{code}
2020-09-09T18:22:13,413  INFO [TezTR-222977_4_0_0_0_0 
(497418324441977_0004_0_00_00_0)] llap.LlapOutputFormat: Returning 
writer for: attempt_497418324441977_0004_0_00_00_0
2020-

[jira] [Updated] (HIVE-24059) Llap external client - Initial changes for running in cloud environment

2020-09-02 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-24059:
-
Attachment: HIVE-24059.01.patch

> Llap external client - Initial changes for running in cloud environment
> ---
>
> Key: HIVE-24059
> URL: https://issues.apache.org/jira/browse/HIVE-24059
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24059.01.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Please see problem description in 
> https://issues.apache.org/jira/browse/HIVE-24058
> Initial changes include - 
> 1. Moving LLAP discovery logic from client side to server (HS2 / get_splits) 
> side.
> 2. Opening additional RPC port in LLAP Daemon.
> 3. JWT Based authentication on this port.
> cc [~prasanth_j] [~jdere] [~anishek] [~thejas]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24059) Llap external client - Initial changes for running in cloud environment

2020-08-25 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183964#comment-17183964
 ] 

Shubham Chaurasia commented on HIVE-24059:
--

Fixed tests, all green now - 
http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-1418/3/pipeline

> Llap external client - Initial changes for running in cloud environment
> ---
>
> Key: HIVE-24059
> URL: https://issues.apache.org/jira/browse/HIVE-24059
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Please see problem description in 
> https://issues.apache.org/jira/browse/HIVE-24058
> Initial changes include - 
> 1. Moving LLAP discovery logic from client side to server (HS2 / get_splits) 
> side.
> 2. Opening additional RPC port in LLAP Daemon.
> 3. JWT Based authentication on this port.
> cc [~prasanth_j] [~jdere] [~anishek] [~thejas]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24059) Llap external client - Initial changes for running in cloud environment

2020-08-23 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182746#comment-17182746
 ] 

Shubham Chaurasia commented on HIVE-24059:
--

[~prasanth_j] [~jdere] Can you please review ?

> Llap external client - Initial changes for running in cloud environment
> ---
>
> Key: HIVE-24059
> URL: https://issues.apache.org/jira/browse/HIVE-24059
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Please see problem description in 
> https://issues.apache.org/jira/browse/HIVE-24058
> Initial changes include - 
> 1. Moving LLAP discovery logic from client side to server (HS2 / get_splits) 
> side.
> 2. Opening additional RPC port in LLAP Daemon.
> 3. JWT Based authentication on this port.
> cc [~prasanth_j] [~jdere] [~anishek] [~thejas]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24059) Llap external client - Initial changes for running in cloud environment

2020-08-23 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-24059:
-
Description: 
Please see problem description in 
https://issues.apache.org/jira/browse/HIVE-24058

Initial changes include - 

1. Moving LLAP discovery logic from client side to server (HS2 / get_splits) 
side.
2. Opening additional RPC port in LLAP Daemon.
3. JWT Based authentication on this port.


cc [~prasanth_j] [~jdere] [~anishek] [~thejas]

  was:
Please see problem description in 
https://issues.apache.org/jira/browse/HIVE-24058

Initial changes include - 

1. Moving LLAP discovery logic from client side to server (HS2 / get_splits) 
side.
2. Opening additional RPC port in LLAP Daemon.
3. JWT Based authentication on this port.



> Llap external client - Initial changes for running in cloud environment
> ---
>
> Key: HIVE-24059
> URL: https://issues.apache.org/jira/browse/HIVE-24059
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Please see problem description in 
> https://issues.apache.org/jira/browse/HIVE-24058
> Initial changes include - 
> 1. Moving LLAP discovery logic from client side to server (HS2 / get_splits) 
> side.
> 2. Opening additional RPC port in LLAP Daemon.
> 3. JWT Based authentication on this port.
> cc [~prasanth_j] [~jdere] [~anishek] [~thejas]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24059) Llap external client - Initial changes for running in cloud environment

2020-08-23 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182744#comment-17182744
 ] 

Shubham Chaurasia commented on HIVE-24059:
--

This patch uses two env variables - 

{{IS_CLOUD_DEPLOYMENT}} - if we HS2 and LLAP are running in cloud env. 
{{PUBLIC_HOSTNAME}} - public hostname which can be reached from outside cloud.

Both these variables need to be set on HS2 and LLAP machines for this patch to 
work correctly.


> Llap external client - Initial changes for running in cloud environment
> ---
>
> Key: HIVE-24059
> URL: https://issues.apache.org/jira/browse/HIVE-24059
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Please see problem description in 
> https://issues.apache.org/jira/browse/HIVE-24058
> Initial changes include - 
> 1. Moving LLAP discovery logic from client side to server (HS2 / get_splits) 
> side.
> 2. Opening additional RPC port in LLAP Daemon.
> 3. JWT Based authentication on this port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24059) Llap external client - Initial changes for running in cloud environment

2020-08-23 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-24059:



> Llap external client - Initial changes for running in cloud environment
> ---
>
> Key: HIVE-24059
> URL: https://issues.apache.org/jira/browse/HIVE-24059
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Please see problem description in 
> https://issues.apache.org/jira/browse/HIVE-24058
> Initial changes include - 
> 1. Moving LLAP discovery logic from client side to server (HS2 / get_splits) 
> side.
> 2. Opening additional RPC port in LLAP Daemon.
> 3. JWT Based authentication on this port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24058) Llap external client - Enhancements for running in cloud environment

2020-08-23 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-24058:



> Llap external client - Enhancements for running in cloud environment
> 
>
> Key: HIVE-24058
> URL: https://issues.apache.org/jira/browse/HIVE-24058
> Project: Hive
>  Issue Type: Task
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> When we query using llap external client library, following happens currently 
> - 
> 1. We first need to get splits using 
> [LlapBaseInputFormat#getSplits()|https://github.com/apache/hive/blob/rel/release-3.1.2/llap-ext-client/src/java/org/apache/hadoop/hive/llap/LlapBaseInputFormat.java#L226],
>  this just needs Hive server JDBC url. 
> 2. We then submit those splits to llap and obtain record reader to read data 
> using 
> [LlapBaseInputFormat#getRecordReader()|https://github.com/apache/hive/blob/rel/release-3.1.2/llap-ext-client/src/java/org/apache/hadoop/hive/llap/LlapBaseInputFormat.java#L140].
>  In this step we need following at client side -
> - {{hive.zookeeper.quorum}}
> -{{hive.llap.daemon.service.hosts}}
> We need to connect to zk to discover llap daemons.
> 3. Record reader so obtained needs to [initiate a TCP connection from client 
> to LLAP Daemon to submit the 
> split|https://github.com/apache/hive/blob/rel/release-3.1.2/llap-ext-client/src/java/org/apache/hadoop/hive/llap/LlapBaseInputFormat.java#L185].
> 4. It also needs to [initiate another TCP connection from client to output 
> format port in LLAP Daemon to read the 
> data|https://github.com/apache/hive/blob/rel/release-3.1.2/llap-ext-client/src/java/org/apache/hadoop/hive/llap/LlapBaseInputFormat.java#L201].
> In cloud based deployments, we may not be able to make direct connections to 
> Zk registry and LLAP daemons from client as it might run outside vpc. 
> For 2, we can move daemon discovery logic to get_splits UDF itself which will 
> run in HS2.  
> For scenarios like 3 and 4, we can expose additional ports on LLAP with 
> additional auth mechanism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23339) SBA does not check permissions for DB location specified in Create or Alter database query

2020-07-31 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168663#comment-17168663
 ] 

Shubham Chaurasia commented on HIVE-23339:
--

Thanks for the review and commit [~mgergely]. Closing it. 

Note - It changes API in {{HiveAuthorizationProvider}} from 

{code:java}
public void authorize(Privilege[] readRequiredPriv, Privilege[] 
writeRequiredPriv) throws HiveException,  AuthorizationException;
{code}

to 

{code:java}
void authorizeDbLevelOperations(Privilege[] readRequiredPriv, Privilege[] 
writeRequiredPriv, Collection inputs, Collection 
outputs) throws HiveException, AuthorizationException;
{code}

> SBA does not check permissions for DB location specified in Create or Alter 
> database query
> --
>
> Key: HIVE-23339
> URL: https://issues.apache.org/jira/browse/HIVE-23339
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.0, 4.0.0
>Reporter: Riju Trivedi
>Assignee: Shubham Chaurasia
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23339.01.patch, HIVE-23339.02.patch, 
> HIVE-23339.03.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With doAs=true and StorageBasedAuthorization provider, create database with 
> specific location succeeds even if user doesn't have access to that path.
>  
> {code:java}
>   hadoop fs -ls -d /tmp/cannot_write
>  drwx-- - hive hadoop 0 2020-04-01 22:53 /tmp/cannot_write
> create a database under /tmp/cannot_write. We would expect it to fail, but is 
> actually created successfully with "hive" as the owner:
> rtrivedi@bdp01:~> beeline -e "create database rtrivedi_1 location 
> '/tmp/cannot_write/rtrivedi_1'"
>  INFO : OK
>  No rows affected (0.116 seconds)
> hive@hpchdd2e:~> hadoop fs -ls /tmp/cannot_write
>  Found 1 items
>  drwx-- - hive hadoop 0 2020-04-01 23:05 /tmp/cannot_write/rtrivedi_1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23339) SBA does not check permissions for DB location specified in Create or Alter database query

2020-07-31 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23339:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> SBA does not check permissions for DB location specified in Create or Alter 
> database query
> --
>
> Key: HIVE-23339
> URL: https://issues.apache.org/jira/browse/HIVE-23339
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.0, 4.0.0
>Reporter: Riju Trivedi
>Assignee: Shubham Chaurasia
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23339.01.patch, HIVE-23339.02.patch, 
> HIVE-23339.03.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With doAs=true and StorageBasedAuthorization provider, create database with 
> specific location succeeds even if user doesn't have access to that path.
>  
> {code:java}
>   hadoop fs -ls -d /tmp/cannot_write
>  drwx-- - hive hadoop 0 2020-04-01 22:53 /tmp/cannot_write
> create a database under /tmp/cannot_write. We would expect it to fail, but is 
> actually created successfully with "hive" as the owner:
> rtrivedi@bdp01:~> beeline -e "create database rtrivedi_1 location 
> '/tmp/cannot_write/rtrivedi_1'"
>  INFO : OK
>  No rows affected (0.116 seconds)
> hive@hpchdd2e:~> hadoop fs -ls /tmp/cannot_write
>  Found 1 items
>  drwx-- - hive hadoop 0 2020-04-01 23:05 /tmp/cannot_write/rtrivedi_1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23339) SBA does not check permissions for DB location specified in Create or Alter database query

2020-07-31 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23339:
-
Fix Version/s: 4.0.0

> SBA does not check permissions for DB location specified in Create or Alter 
> database query
> --
>
> Key: HIVE-23339
> URL: https://issues.apache.org/jira/browse/HIVE-23339
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.0, 4.0.0
>Reporter: Riju Trivedi
>Assignee: Shubham Chaurasia
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23339.01.patch, HIVE-23339.02.patch, 
> HIVE-23339.03.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With doAs=true and StorageBasedAuthorization provider, create database with 
> specific location succeeds even if user doesn't have access to that path.
>  
> {code:java}
>   hadoop fs -ls -d /tmp/cannot_write
>  drwx-- - hive hadoop 0 2020-04-01 22:53 /tmp/cannot_write
> create a database under /tmp/cannot_write. We would expect it to fail, but is 
> actually created successfully with "hive" as the owner:
> rtrivedi@bdp01:~> beeline -e "create database rtrivedi_1 location 
> '/tmp/cannot_write/rtrivedi_1'"
>  INFO : OK
>  No rows affected (0.116 seconds)
> hive@hpchdd2e:~> hadoop fs -ls /tmp/cannot_write
>  Found 1 items
>  drwx-- - hive hadoop 0 2020-04-01 23:05 /tmp/cannot_write/rtrivedi_1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23339) SBA does not check permissions for DB location specified in Create or Alter database query

2020-07-31 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23339:
-
Affects Version/s: 4.0.0

> SBA does not check permissions for DB location specified in Create or Alter 
> database query
> --
>
> Key: HIVE-23339
> URL: https://issues.apache.org/jira/browse/HIVE-23339
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.0, 4.0.0
>Reporter: Riju Trivedi
>Assignee: Shubham Chaurasia
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HIVE-23339.01.patch, HIVE-23339.02.patch, 
> HIVE-23339.03.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With doAs=true and StorageBasedAuthorization provider, create database with 
> specific location succeeds even if user doesn't have access to that path.
>  
> {code:java}
>   hadoop fs -ls -d /tmp/cannot_write
>  drwx-- - hive hadoop 0 2020-04-01 22:53 /tmp/cannot_write
> create a database under /tmp/cannot_write. We would expect it to fail, but is 
> actually created successfully with "hive" as the owner:
> rtrivedi@bdp01:~> beeline -e "create database rtrivedi_1 location 
> '/tmp/cannot_write/rtrivedi_1'"
>  INFO : OK
>  No rows affected (0.116 seconds)
> hive@hpchdd2e:~> hadoop fs -ls /tmp/cannot_write
>  Found 1 items
>  drwx-- - hive hadoop 0 2020-04-01 23:05 /tmp/cannot_write/rtrivedi_1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23339) SBA does not check permissions for DB location specified in Create or Alter database query

2020-07-31 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23339:
-
Attachment: HIVE-23339.03.patch

> SBA does not check permissions for DB location specified in Create or Alter 
> database query
> --
>
> Key: HIVE-23339
> URL: https://issues.apache.org/jira/browse/HIVE-23339
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.0
>Reporter: Riju Trivedi
>Assignee: Shubham Chaurasia
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HIVE-23339.01.patch, HIVE-23339.02.patch, 
> HIVE-23339.03.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With doAs=true and StorageBasedAuthorization provider, create database with 
> specific location succeeds even if user doesn't have access to that path.
>  
> {code:java}
>   hadoop fs -ls -d /tmp/cannot_write
>  drwx-- - hive hadoop 0 2020-04-01 22:53 /tmp/cannot_write
> create a database under /tmp/cannot_write. We would expect it to fail, but is 
> actually created successfully with "hive" as the owner:
> rtrivedi@bdp01:~> beeline -e "create database rtrivedi_1 location 
> '/tmp/cannot_write/rtrivedi_1'"
>  INFO : OK
>  No rows affected (0.116 seconds)
> hive@hpchdd2e:~> hadoop fs -ls /tmp/cannot_write
>  Found 1 items
>  drwx-- - hive hadoop 0 2020-04-01 23:05 /tmp/cannot_write/rtrivedi_1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23339) SBA does not check permissions for DB location specified in Create or Alter database query

2020-07-14 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23339:
-
Summary: SBA does not check permissions for DB location specified in Create 
or Alter database query  (was: SBA does not check permissions for DB location 
specified in Create database query)

> SBA does not check permissions for DB location specified in Create or Alter 
> database query
> --
>
> Key: HIVE-23339
> URL: https://issues.apache.org/jira/browse/HIVE-23339
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.0
>Reporter: Riju Trivedi
>Assignee: Shubham Chaurasia
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HIVE-23339.01.patch, HIVE-23339.02.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> With doAs=true and StorageBasedAuthorization provider, create database with 
> specific location succeeds even if user doesn't have access to that path.
>  
> {code:java}
>   hadoop fs -ls -d /tmp/cannot_write
>  drwx-- - hive hadoop 0 2020-04-01 22:53 /tmp/cannot_write
> create a database under /tmp/cannot_write. We would expect it to fail, but is 
> actually created successfully with "hive" as the owner:
> rtrivedi@bdp01:~> beeline -e "create database rtrivedi_1 location 
> '/tmp/cannot_write/rtrivedi_1'"
>  INFO : OK
>  No rows affected (0.116 seconds)
> hive@hpchdd2e:~> hadoop fs -ls /tmp/cannot_write
>  Found 1 items
>  drwx-- - hive hadoop 0 2020-04-01 23:05 /tmp/cannot_write/rtrivedi_1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23339) SBA does not check permissions for DB location specified in Create database query

2020-07-14 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23339:
-
Attachment: HIVE-23339.02.patch

> SBA does not check permissions for DB location specified in Create database 
> query
> -
>
> Key: HIVE-23339
> URL: https://issues.apache.org/jira/browse/HIVE-23339
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.0
>Reporter: Riju Trivedi
>Assignee: Shubham Chaurasia
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HIVE-23339.01.patch, HIVE-23339.02.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> With doAs=true and StorageBasedAuthorization provider, create database with 
> specific location succeeds even if user doesn't have access to that path.
>  
> {code:java}
>   hadoop fs -ls -d /tmp/cannot_write
>  drwx-- - hive hadoop 0 2020-04-01 22:53 /tmp/cannot_write
> create a database under /tmp/cannot_write. We would expect it to fail, but is 
> actually created successfully with "hive" as the owner:
> rtrivedi@bdp01:~> beeline -e "create database rtrivedi_1 location 
> '/tmp/cannot_write/rtrivedi_1'"
>  INFO : OK
>  No rows affected (0.116 seconds)
> hive@hpchdd2e:~> hadoop fs -ls /tmp/cannot_write
>  Found 1 items
>  drwx-- - hive hadoop 0 2020-04-01 23:05 /tmp/cannot_write/rtrivedi_1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23339) SBA does not check permissions for DB location specified in Create database query

2020-05-11 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104347#comment-17104347
 ] 

Shubham Chaurasia commented on HIVE-23339:
--

Thanks for the pointers [~rtrivedi12]. 

Thanks for the review [~mgergely]. Based on our discussion, I agree that it 
would be cleaner to have an API with authorizer inputs and outputs rather than 
passing the properties in HiveConf as the current patch does. 

For context, currently we are having below API in {{HiveAuthorizationProvider}}

{code:java}
  public void authorize(Privilege[] readRequiredPriv,
  Privilege[] writeRequiredPriv) throws HiveException,
  AuthorizationException;
{code}

Now in {{StorageBasedAuthorizationProvider}} we need some additional 
information, in this case the custom location of database from 'CREATE 
DATABASE' query. Current patch achieves this by passing the location via 
HiveConf. To be able to pass inputs and outputs explicitly we would need 
something like below - 
 
{code:java}
  public void authorize(Privilege[] readRequiredPriv,
  Privilege[] writeRequiredPriv, Set inputs, Set 
outputs) throws HiveException,
  AuthorizationException;
{code}

But since {{HiveAuthorizationProvider}} is a public/pluggable interface, I am 
not sure about modifying it. 

[~hashutosh] [~thejas] [~mgergely]
Does the above API look correct ? How to we usually modify authorizer APIs (or 
any public API) in hive ? Do we have a doc/guideline for this ?


> SBA does not check permissions for DB location specified in Create database 
> query
> -
>
> Key: HIVE-23339
> URL: https://issues.apache.org/jira/browse/HIVE-23339
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.0
>Reporter: Riju Trivedi
>Assignee: Shubham Chaurasia
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HIVE-23339.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With doAs=true and StorageBasedAuthorization provider, create database with 
> specific location succeeds even if user doesn't have access to that path.
>  
> {code:java}
>   hadoop fs -ls -d /tmp/cannot_write
>  drwx-- - hive hadoop 0 2020-04-01 22:53 /tmp/cannot_write
> create a database under /tmp/cannot_write. We would expect it to fail, but is 
> actually created successfully with "hive" as the owner:
> rtrivedi@bdp01:~> beeline -e "create database rtrivedi_1 location 
> '/tmp/cannot_write/rtrivedi_1'"
>  INFO : OK
>  No rows affected (0.116 seconds)
> hive@hpchdd2e:~> hadoop fs -ls /tmp/cannot_write
>  Found 1 items
>  drwx-- - hive hadoop 0 2020-04-01 23:05 /tmp/cannot_write/rtrivedi_1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23339) SBA does not check permissions for DB location specified in Create database query

2020-05-11 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23339:
-
Attachment: HIVE-23339.01.patch
Status: Patch Available  (was: Open)

> SBA does not check permissions for DB location specified in Create database 
> query
> -
>
> Key: HIVE-23339
> URL: https://issues.apache.org/jira/browse/HIVE-23339
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.0
>Reporter: Riju Trivedi
>Assignee: Shubham Chaurasia
>Priority: Critical
> Attachments: HIVE-23339.01.patch
>
>
> With doAs=true and StorageBasedAuthorization provider, create database with 
> specific location succeeds even if user doesn't have access to that path.
>  
> {code:java}
>   hadoop fs -ls -d /tmp/cannot_write
>  drwx-- - hive hadoop 0 2020-04-01 22:53 /tmp/cannot_write
> create a database under /tmp/cannot_write. We would expect it to fail, but is 
> actually created successfully with "hive" as the owner:
> rtrivedi@bdp01:~> beeline -e "create database rtrivedi_1 location 
> '/tmp/cannot_write/rtrivedi_1'"
>  INFO : OK
>  No rows affected (0.116 seconds)
> hive@hpchdd2e:~> hadoop fs -ls /tmp/cannot_write
>  Found 1 items
>  drwx-- - hive hadoop 0 2020-04-01 23:05 /tmp/cannot_write/rtrivedi_1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23230) "get_splits" udf ignores limit constraint while creating splits

2020-04-23 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091203#comment-17091203
 ] 

Shubham Chaurasia commented on HIVE-23230:
--

[~adeshrao] 

HIVE-23230.2.patch looks good to me for fixing limit issue however these test 
failures seem related, all of them use get_splits(). I cannot access test 
report links above. Could you please check these locally ? and also reattach 
the same patch again.


cc [~sankarh]

> "get_splits" udf ignores limit constraint while creating splits
> ---
>
> Key: HIVE-23230
> URL: https://issues.apache.org/jira/browse/HIVE-23230
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.1.0
>Reporter: Adesh Kumar Rao
>Assignee: Adesh Kumar Rao
>Priority: Major
> Attachments: HIVE-23230.1.patch, HIVE-23230.2.patch, HIVE-23230.patch
>
>
> Issue: Running the query {noformat}select * from  limit n{noformat} 
> from spark via hive warehouse connector may return more rows than "n".
> This happens because "get_splits" udf creates splits ignoring the limit 
> constraint. These splits when submitted to multiple llap daemons will return 
> "n" rows each.
> How to reproduce: Needs spark-shell, hive-warehouse-connector and hive on 
> llap with more that 1 llap daemons running.
> run below commands via beeline to create and populate the table
>  
> {noformat}
> create table test (id int);
> insert into table test values (1);
> insert into table test values (2);
> insert into table test values (3);
> insert into table test values (4);
> insert into table test values (5);
> insert into table test values (6);
> insert into table test values (7);
> delete from test where id = 7;{noformat}
> now running below query via spark-shell
> {noformat}
> import com.hortonworks.hwc.HiveWarehouseSession 
> val hive = HiveWarehouseSession.session(spark).build() 
> hive.executeQuery("select * from test limit 1").show()
> {noformat}
> will return more than 1 rows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22842) Timestamp/date vectors in Arrow serializer should use correct calendar for value representation

2020-03-26 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22842:
-
Attachment: HIVE-22842.05.patch

> Timestamp/date vectors in Arrow serializer should use correct calendar for 
> value representation
> ---
>
> Key: HIVE-22842
> URL: https://issues.apache.org/jira/browse/HIVE-22842
> Project: Hive
>  Issue Type: Improvement
>Reporter: Jesus Camacho Rodriguez
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22842.01.patch, HIVE-22842.02.patch, 
> HIVE-22842.03.patch, HIVE-22842.04.patch, HIVE-22842.05.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22842) Timestamp/date vectors in Arrow serializer should use correct calendar for value representation

2020-03-25 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22842:
-
Attachment: HIVE-22842.04.patch

> Timestamp/date vectors in Arrow serializer should use correct calendar for 
> value representation
> ---
>
> Key: HIVE-22842
> URL: https://issues.apache.org/jira/browse/HIVE-22842
> Project: Hive
>  Issue Type: Improvement
>Reporter: Jesus Camacho Rodriguez
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22842.01.patch, HIVE-22842.02.patch, 
> HIVE-22842.03.patch, HIVE-22842.04.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23070) LLAP external client does not propagate orc confs to LLAP

2020-03-24 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23070:
-
Summary: LLAP external client does not propagate orc confs to LLAP  (was: 
Llap external client does not propagate orc confs to LLAP)

> LLAP external client does not propagate orc confs to LLAP
> -
>
> Key: HIVE-23070
> URL: https://issues.apache.org/jira/browse/HIVE-23070
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> When we query through llap external client, orc confs are not propagated 
> to(or not respected by) LLAP.
> I was trying to pass this conf {{orc.proleptic.gregorian.default}} while 
> reading data but it was not taking effect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22842) Timestamp/date vectors in Arrow serializer should use correct calendar for value representation

2020-03-24 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065507#comment-17065507
 ] 

Shubham Chaurasia commented on HIVE-22842:
--

[~jcamachorodriguez] 

Thanks for the review. 
Added tests with combinations of date and timestamp for both new and legacy 
files for orc, parquet and avro.
Also opened - https://issues.apache.org/jira/browse/HIVE-23070


> Timestamp/date vectors in Arrow serializer should use correct calendar for 
> value representation
> ---
>
> Key: HIVE-22842
> URL: https://issues.apache.org/jira/browse/HIVE-22842
> Project: Hive
>  Issue Type: Improvement
>Reporter: Jesus Camacho Rodriguez
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22842.01.patch, HIVE-22842.02.patch, 
> HIVE-22842.03.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23070) Llap external client does not propagate orc confs to LLAP

2020-03-24 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-23070:



> Llap external client does not propagate orc confs to LLAP
> -
>
> Key: HIVE-23070
> URL: https://issues.apache.org/jira/browse/HIVE-23070
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> When we query through llap external client, orc confs are not propagated 
> to(or not respected by) LLAP.
> I was trying to pass this conf {{orc.proleptic.gregorian.default}} while 
> reading data but it was not taking effect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22842) Timestamp/date vectors in Arrow serializer should use correct calendar for value representation

2020-03-24 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22842:
-
Attachment: HIVE-22842.03.patch

> Timestamp/date vectors in Arrow serializer should use correct calendar for 
> value representation
> ---
>
> Key: HIVE-22842
> URL: https://issues.apache.org/jira/browse/HIVE-22842
> Project: Hive
>  Issue Type: Improvement
>Reporter: Jesus Camacho Rodriguez
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22842.01.patch, HIVE-22842.02.patch, 
> HIVE-22842.03.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23034) Arrow serializer should not keep the reference of arrow offset and validity buffers

2020-03-17 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23034:
-
Attachment: HIVE-23034.01.patch
Status: Patch Available  (was: Open)

> Arrow serializer should not keep the reference of arrow offset and validity 
> buffers
> ---
>
> Key: HIVE-23034
> URL: https://issues.apache.org/jira/browse/HIVE-23034
> Project: Hive
>  Issue Type: Bug
>  Components: llap, Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23034.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, a part of writeList() method in arrow serializer is implemented 
> like - 
> {code:java}
> final ArrowBuf offsetBuffer = arrowVector.getOffsetBuffer();
> int nextOffset = 0;
> for (int rowIndex = 0; rowIndex < size; rowIndex++) {
>   int selectedIndex = rowIndex;
>   if (vectorizedRowBatch.selectedInUse) {
> selectedIndex = vectorizedRowBatch.selected[rowIndex];
>   }
>   if (hiveVector.isNull[selectedIndex]) {
> offsetBuffer.setInt(rowIndex * OFFSET_WIDTH, nextOffset);
>   } else {
> offsetBuffer.setInt(rowIndex * OFFSET_WIDTH, nextOffset);
> nextOffset += (int) hiveVector.lengths[selectedIndex];
> arrowVector.setNotNull(rowIndex);
>   }
> }
> offsetBuffer.setInt(size * OFFSET_WIDTH, nextOffset);
> {code}
> 1) Here we obtain a reference to {{final ArrowBuf offsetBuffer = 
> arrowVector.getOffsetBuffer();}} and keep updating the arrow vector and 
> offset vector. 
> Problem - 
> {{arrowVector.setNotNull(rowIndex)}} keeps checking the index and reallocates 
> the offset and validity buffers when a threshold is crossed, updates the 
> references internally and also releases the old buffers (which decrements the 
> buffer reference count). Now the reference which we obtained in 1) becomes 
> obsolete. Furthermore if try to read or write old buffer, we see - 
> {code:java}
> Caused by: io.netty.util.IllegalReferenceCountException: refCnt: 0
>   at 
> io.netty.buffer.AbstractByteBuf.ensureAccessible(AbstractByteBuf.java:1413)
>   at io.netty.buffer.ArrowBuf.checkIndexD(ArrowBuf.java:131)
>   at io.netty.buffer.ArrowBuf.chk(ArrowBuf.java:162)
>   at io.netty.buffer.ArrowBuf.setInt(ArrowBuf.java:656)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeList(Serializer.java:432)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:285)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeStruct(Serializer.java:352)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:288)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeList(Serializer.java:419)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:285)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.serializeBatch(Serializer.java:205)
> {code}
>  
> Solution - 
> This can be fixed by getting the buffers each time ( 
> {{arrowVector.getOffsetBuffer()}} ) we want to update them. 
> In our internal tests, this is very frequently seen on arrow 0.8.0 but not on 
> 0.10.0 but should be handled the same way for 0.10.0 too as it does the same 
> thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22842) Timestamp/date vectors in Arrow serializer should use correct calendar for value representation

2020-03-17 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22842:
-
Attachment: HIVE-22842.02.patch

> Timestamp/date vectors in Arrow serializer should use correct calendar for 
> value representation
> ---
>
> Key: HIVE-22842
> URL: https://issues.apache.org/jira/browse/HIVE-22842
> Project: Hive
>  Issue Type: Improvement
>Reporter: Jesus Camacho Rodriguez
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22842.01.patch, HIVE-22842.02.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23034) Arrow serializer should not keep the reference of arrow offset and validity buffers

2020-03-17 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-23034:



> Arrow serializer should not keep the reference of arrow offset and validity 
> buffers
> ---
>
> Key: HIVE-23034
> URL: https://issues.apache.org/jira/browse/HIVE-23034
> Project: Hive
>  Issue Type: Bug
>  Components: llap, Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Currently, a part of writeList() method in arrow serializer is implemented 
> like - 
> {code:java}
> final ArrowBuf offsetBuffer = arrowVector.getOffsetBuffer();
> int nextOffset = 0;
> for (int rowIndex = 0; rowIndex < size; rowIndex++) {
>   int selectedIndex = rowIndex;
>   if (vectorizedRowBatch.selectedInUse) {
> selectedIndex = vectorizedRowBatch.selected[rowIndex];
>   }
>   if (hiveVector.isNull[selectedIndex]) {
> offsetBuffer.setInt(rowIndex * OFFSET_WIDTH, nextOffset);
>   } else {
> offsetBuffer.setInt(rowIndex * OFFSET_WIDTH, nextOffset);
> nextOffset += (int) hiveVector.lengths[selectedIndex];
> arrowVector.setNotNull(rowIndex);
>   }
> }
> offsetBuffer.setInt(size * OFFSET_WIDTH, nextOffset);
> {code}
> 1) Here we obtain a reference to {{final ArrowBuf offsetBuffer = 
> arrowVector.getOffsetBuffer();}} and keep updating the arrow vector and 
> offset vector. 
> Problem - 
> {{arrowVector.setNotNull(rowIndex)}} keeps checking the index and reallocates 
> the offset and validity buffers when a threshold is crossed, updates the 
> references internally and also releases the old buffers (which decrements the 
> buffer reference count). Now the reference which we obtained in 1) becomes 
> obsolete. Furthermore if try to read or write old buffer, we see - 
> {code:java}
> Caused by: io.netty.util.IllegalReferenceCountException: refCnt: 0
>   at 
> io.netty.buffer.AbstractByteBuf.ensureAccessible(AbstractByteBuf.java:1413)
>   at io.netty.buffer.ArrowBuf.checkIndexD(ArrowBuf.java:131)
>   at io.netty.buffer.ArrowBuf.chk(ArrowBuf.java:162)
>   at io.netty.buffer.ArrowBuf.setInt(ArrowBuf.java:656)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeList(Serializer.java:432)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:285)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeStruct(Serializer.java:352)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:288)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.writeList(Serializer.java:419)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.write(Serializer.java:285)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Serializer.serializeBatch(Serializer.java:205)
> {code}
>  
> Solution - 
> This can be fixed by getting the buffers each time ( 
> {{arrowVector.getOffsetBuffer()}} ) we want to update them. 
> In our internal tests, this is very frequently seen on arrow 0.8.0 but not on 
> 0.10.0 but should be handled the same way for 0.10.0 too as it does the same 
> thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23022) Arrow deserializer should ensure size of hive vector equal to arrow vector

2020-03-13 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23022:
-
Attachment: HIVE-23022.01.patch
Status: Patch Available  (was: Open)

> Arrow deserializer should ensure size of hive vector equal to arrow vector
> --
>
> Key: HIVE-23022
> URL: https://issues.apache.org/jira/browse/HIVE-23022
> Project: Hive
>  Issue Type: Bug
>  Components: llap, Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23022.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
> some cases does not set the size of hive vector correctly. Size of hive 
> vector should be set at least equal to arrow vector to be able to read 
> (accommodate) it fully.
> Following exception can be seen when we try to read (using 
> {{LlapArrowRowInputFormat}} ) some table which contains complex types (struct 
> nested in array to be specific) and number of rows in table is more than 
> default (1024) batch/vector size.
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
>   ... 23 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23022) Arrow deserializer should ensure size of hive vector equal to arrow vector

2020-03-13 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23022:
-
Description: 
Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
some cases does not set the size of hive vector correctly. Size of hive vector 
should be set at least equal to arrow vector to be able to read (accommodate) 
it fully.

Following exception can be seen when we try to read (using 
{{LlapArrowRowInputFormat}} ) some table which contains complex types (struct 
nested in array to be specific) and number of rows in table is more than 
default (1024) batch/vector size.

{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
  at 
org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
  at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
  ... 23 more
{code}




  was:
Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
some cases does not set the size of hive vector correctly. Size of hive vector 
should be set at least equal to arrow vector to be able to read (accommodate) 
it fully.

Following exception can be seen when we try to read some table which contains 
complex types (struct nested in array to be specific) and number of rows in 
table is more than default (1024) batch/vector size.

{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
  at 
org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
  at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
  ... 23 more
{code}





> Arrow deserializer should ensure size of hive vector equal to arrow vector
> --
>
> Key: HIVE-23022
> URL: https://issues.apache.org/jira/browse/HIVE-23022
> Project: Hive
>  Issue Type: Bug
>  Components: llap, Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
> some cases does not set the size of hive vector correctly. Size of hive 
> vector should be set at least equal to arrow vector to be able to read 
> (accommodate) it fully.
> Following exception can be seen when we try to read (using 
> {{LlapArrowRowInputFormat}} ) some table which contains complex types (struct 
> nested in array to be specific) and number of rows in table is more than 
> default (1024) batch/vector size.
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
>   ... 23 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23022) Arrow deserializer should ensure size of hive vector equal to arrow vector

2020-03-13 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23022:
-
Description: 
Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
some cases does not set the size of hive vector correctly. Size of hive vector 
should be set at least equal to arrow vector to be able to read (accommodate) 
it fully.

Following exception can be seen when we try to read some table which contains 
complex types (struct nested in array to be specific) and number of rows in 
table is more than default (1024) batch/vector size.

{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
  at 
org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
  at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
  ... 23 more
{code}




  was:
Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
some cases does not set the size of hive vector correctly. Size of hive vector 
should be set at least equal to arrow vector to be able to read (accommodate) 
it fully.

Following exception can be seen when we try to read some table which contains 
complex types (struct nested in list to be specific) and number of rows in 
table is more than default (1024) batch/vector size.

{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
  at 
org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
  at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
  ... 23 more
{code}





> Arrow deserializer should ensure size of hive vector equal to arrow vector
> --
>
> Key: HIVE-23022
> URL: https://issues.apache.org/jira/browse/HIVE-23022
> Project: Hive
>  Issue Type: Bug
>  Components: llap, Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
> some cases does not set the size of hive vector correctly. Size of hive 
> vector should be set at least equal to arrow vector to be able to read 
> (accommodate) it fully.
> Following exception can be seen when we try to read some table which contains 
> complex types (struct nested in array to be specific) and number of rows in 
> table is more than default (1024) batch/vector size.
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
>   ... 23 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23022) Arrow deserializer should ensure size of hive vector equal to arrow vector

2020-03-13 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-23022:
-
Description: 
Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
some cases does not set the size of hive vector correctly. Size of hive vector 
should be set at least equal to arrow vector to be able to read (accommodate) 
it fully.

Following exception can be seen when we try to read some table which contains 
complex types (struct nested in list to be specific) and number of rows in 
table is more than default (1024) batch/vector size.

{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
  at 
org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
  at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
  ... 23 more
{code}




  was:
Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
some cases does not set the size of hive vector correctly. Size of hive vector 
should be set at least equal to arrow vector to be able to read (accommodate) 
it fully.

Following exception can be seen when we try to read some table which contains 
complex types (struct nested in list to be specific) and table size is more 
than default (1024) batch/vector size.

{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
  at org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
  at 
org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
  at 
org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
  at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
  ... 23 more
{code}





> Arrow deserializer should ensure size of hive vector equal to arrow vector
> --
>
> Key: HIVE-23022
> URL: https://issues.apache.org/jira/browse/HIVE-23022
> Project: Hive
>  Issue Type: Bug
>  Components: llap, Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
> some cases does not set the size of hive vector correctly. Size of hive 
> vector should be set at least equal to arrow vector to be able to read 
> (accommodate) it fully.
> Following exception can be seen when we try to read some table which contains 
> complex types (struct nested in list to be specific) and number of rows in 
> table is more than default (1024) batch/vector size.
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
>   ... 23 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23022) Arrow deserializer should ensure size of hive vector equal to arrow vector

2020-03-13 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-23022:



> Arrow deserializer should ensure size of hive vector equal to arrow vector
> --
>
> Key: HIVE-23022
> URL: https://issues.apache.org/jira/browse/HIVE-23022
> Project: Hive
>  Issue Type: Bug
>  Components: llap, Serializers/Deserializers
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Arrow deserializer - {{org.apache.hadoop.hive.ql.io.arrow.Deserializer}} in 
> some cases does not set the size of hive vector correctly. Size of hive 
> vector should be set at least equal to arrow vector to be able to read 
> (accommodate) it fully.
> Following exception can be seen when we try to read some table which contains 
> complex types (struct nested in list to be specific) and table size is more 
> than default (1024) batch/vector size.
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.readStruct(Deserializer.java:440)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:143)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.readList(Deserializer.java:394)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.read(Deserializer.java:137)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.Deserializer.deserialize(Deserializer.java:122)
>   at 
> org.apache.hadoop.hive.ql.io.arrow.ArrowColumnarBatchSerDe.deserialize(ArrowColumnarBatchSerDe.java:284)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:75)
>   ... 23 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22842) Timestamp/date vectors in Arrow serializer should use correct calendar for value representation

2020-03-10 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22842:
-
Attachment: HIVE-22842.01.patch
Status: Patch Available  (was: Open)

> Timestamp/date vectors in Arrow serializer should use correct calendar for 
> value representation
> ---
>
> Key: HIVE-22842
> URL: https://issues.apache.org/jira/browse/HIVE-22842
> Project: Hive
>  Issue Type: Improvement
>Reporter: Jesus Camacho Rodriguez
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22842.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22973) Handle 0 length batches in LlapArrowRowRecordReader

2020-03-06 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053402#comment-17053402
 ] 

Shubham Chaurasia commented on HIVE-22973:
--

Thanks [~jdere]

> Handle 0 length batches in LlapArrowRowRecordReader
> ---
>
> Key: HIVE-22973
> URL: https://issues.apache.org/jira/browse/HIVE-22973
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-22973.01.patch, HIVE-22973.02.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/HIVE-22856, we allowed 
> {{LlapArrowBatchRecordReader}} to permit 0 length arrow batches. 
> {{LlapArrowRowRecordReader}} which is a wrapper over 
> {{LlapArrowBatchRecordReader}} should also handle this.
> On one of the systems (cannot be reproduced easily) where we were running 
> test {{TestJdbcWithMiniLlapVectorArrow}}, we saw following exception - 
> {code:java}
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.173 s <<< 
> FAILURE! - in org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow
> testLlapInputFormatEndToEnd(org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow)
>   Time elapsed: 6.476 s  <<< ERROR!
> java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:80)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:41)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:540)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:504)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.testLlapInputFormatEndToEnd(BaseJdbcWithMiniLlap.java:236)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:77)
>   ... 13 more
> {code}
> cc [~maheshk114] [~jdere]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22973) Handle 0 length batches in LlapArrowRowRecordReader

2020-03-03 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050962#comment-17050962
 ] 

Shubham Chaurasia commented on HIVE-22973:
--

[~maheshk114] [~jdere]

Can you please review ? 

> Handle 0 length batches in LlapArrowRowRecordReader
> ---
>
> Key: HIVE-22973
> URL: https://issues.apache.org/jira/browse/HIVE-22973
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22973.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/HIVE-22856, we allowed 
> {{LlapArrowBatchRecordReader}} to permit 0 length arrow batches. 
> {{LlapArrowRowRecordReader}} which is a wrapper over 
> {{LlapArrowBatchRecordReader}} should also handle this.
> On one of the systems (cannot be reproduced easily) where we were running 
> test {{TestJdbcWithMiniLlapVectorArrow}}, we saw following exception - 
> {code:java}
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.173 s <<< 
> FAILURE! - in org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow
> testLlapInputFormatEndToEnd(org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow)
>   Time elapsed: 6.476 s  <<< ERROR!
> java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:80)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:41)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:540)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:504)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.testLlapInputFormatEndToEnd(BaseJdbcWithMiniLlap.java:236)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:77)
>   ... 13 more
> {code}
> cc [~maheshk114] [~jdere]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22973) Handle 0 length batches in LlapArrowRowRecordReader

2020-03-03 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22973:
-
Description: 
In https://issues.apache.org/jira/browse/HIVE-22856, we allowed 
{{LlapArrowBatchRecordReader}} to permit 0 length arrow batches. 
{{LlapArrowRowRecordReader}} which is a wrapper over 
{{LlapArrowBatchRecordReader}} should also handle this.

On one of the systems (cannot be reproduced easily) where we were running test 
{{TestJdbcWithMiniLlapVectorArrow}}, we saw following exception - 

{code:java}
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.173 s <<< 
FAILURE! - in org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow
testLlapInputFormatEndToEnd(org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow)
  Time elapsed: 6.476 s  <<< ERROR!
java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0
at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:80)
at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:41)
at 
org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:540)
at 
org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:504)
at 
org.apache.hive.jdbc.BaseJdbcWithMiniLlap.testLlapInputFormatEndToEnd(BaseJdbcWithMiniLlap.java:236)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:77)
... 13 more
{code}


cc [~maheshk114] [~jdere]

  was:
In https://issues.apache.org/jira/browse/HIVE-22856, we allowed 
{{LlapArrowBatchRecordReader}} to permit 0 length arrow batches. 
{{LlapArrowRowRecordReader}} which is a wrapper over 
{{LlapArrowBatchRecordReader}} should also handle this.

On one of the systems (cannot be reproduced easily) where we were running test 
{{TestJdbcWithMiniLlapVectorArrow}}, we saw following exception - 

{code:java}
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.173 s <<< 
FAILURE! - in org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow
testLlapInputFormatEndToEnd(org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow)
  Time elapsed: 6.476 s  <<< ERROR!
java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0
at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:80)
at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:41)
at 
org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:540)
at 
org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:504)
at 
org.apache.hive.jdbc.BaseJdbcWithMiniLlap.testLlapInputFormatEndToEnd(BaseJdbcWithMiniLlap.java:236)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at 
org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:77)
... 13 more
{code}



> Handle 0 length batches in LlapArrowRowRecordReader
> ---
>
> Key: HIVE-22973
> URL: https://issues.apache.org/jira/browse/HIVE-22973
> Project: Hi

[jira] [Updated] (HIVE-22973) Handle 0 length batches in LlapArrowRowRecordReader

2020-03-03 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22973:
-
Attachment: HIVE-22973.01.patch
Status: Patch Available  (was: Open)

> Handle 0 length batches in LlapArrowRowRecordReader
> ---
>
> Key: HIVE-22973
> URL: https://issues.apache.org/jira/browse/HIVE-22973
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22973.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/HIVE-22856, we allowed 
> {{LlapArrowBatchRecordReader}} to permit 0 length arrow batches. 
> {{LlapArrowRowRecordReader}} which is a wrapper over 
> {{LlapArrowBatchRecordReader}} should also handle this.
> On one of the systems (cannot be reproduced easily) where we were running 
> test {{TestJdbcWithMiniLlapVectorArrow}}, we saw following exception - 
> {code:java}
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.173 s <<< 
> FAILURE! - in org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow
> testLlapInputFormatEndToEnd(org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow)
>   Time elapsed: 6.476 s  <<< ERROR!
> java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:80)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:41)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:540)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:504)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.testLlapInputFormatEndToEnd(BaseJdbcWithMiniLlap.java:236)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:77)
>   ... 13 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-22973) Handle 0 length batches in LlapArrowRowRecordReader

2020-03-03 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-22973:



> Handle 0 length batches in LlapArrowRowRecordReader
> ---
>
> Key: HIVE-22973
> URL: https://issues.apache.org/jira/browse/HIVE-22973
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> In https://issues.apache.org/jira/browse/HIVE-22856, we allowed 
> {{LlapArrowBatchRecordReader}} to permit 0 length arrow batches. 
> {{LlapArrowRowRecordReader}} which is a wrapper over 
> {{LlapArrowBatchRecordReader}} should also handle this.
> On one of the systems (cannot be reproduced easily) where we were running 
> test {{TestJdbcWithMiniLlapVectorArrow}}, we saw following exception - 
> {code:java}
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 30.173 s <<< 
> FAILURE! - in org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow
> testLlapInputFormatEndToEnd(org.apache.hive.jdbc.TestJdbcWithMiniLlapVectorArrow)
>   Time elapsed: 6.476 s  <<< ERROR!
> java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:80)
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:41)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:540)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.processQuery(BaseJdbcWithMiniLlap.java:504)
>   at 
> org.apache.hive.jdbc.BaseJdbcWithMiniLlap.testLlapInputFormatEndToEnd(BaseJdbcWithMiniLlap.java:236)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.hadoop.hive.llap.LlapArrowRowRecordReader.next(LlapArrowRowRecordReader.java:77)
>   ... 13 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-03-01 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048777#comment-17048777
 ] 

Shubham Chaurasia commented on HIVE-22840:
--

[~jcamachorodriguez] Oh you already committed. Thanks!

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-22840.03.patch, HIVE-22840.04.patch, 
> HIVE-22840.05.patch, HIVE-22840.1.patch, HIVE-22840.2.patch, HIVE-22840.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-03-01 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048776#comment-17048776
 ] 

Shubham Chaurasia commented on HIVE-22840:
--

[~jcamachorodriguez] Tests are all green now. 

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-22840.03.patch, HIVE-22840.04.patch, 
> HIVE-22840.05.patch, HIVE-22840.1.patch, HIVE-22840.2.patch, HIVE-22840.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-03-01 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048774#comment-17048774
 ] 

Shubham Chaurasia commented on HIVE-22903:
--

[~rameshkumar] Tests are all green now. Can you please have a look at the patch 
? 

> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22903.01.patch, HIVE-22903.02.patch, 
> HIVE-22903.03.patch, HIVE-22903.04.patch, HIVE-22903.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-02-28 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22903:
-
Attachment: HIVE-22903.04.patch

> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22903.01.patch, HIVE-22903.02.patch, 
> HIVE-22903.03.patch, HIVE-22903.04.patch, HIVE-22903.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-27 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22840:
-
Attachment: HIVE-22840.05.patch

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22840.03.patch, HIVE-22840.04.patch, 
> HIVE-22840.05.patch, HIVE-22840.1.patch, HIVE-22840.2.patch, HIVE-22840.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-02-27 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22903:
-
Attachment: HIVE-22903.03.patch

> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22903.01.patch, HIVE-22903.02.patch, 
> HIVE-22903.03.patch, HIVE-22903.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-02-27 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047201#comment-17047201
 ] 

Shubham Chaurasia commented on HIVE-22903:
--

Attaching the patch again as the tests didn't trigger.

> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22903.01.patch, HIVE-22903.02.patch, 
> HIVE-22903.03.patch, HIVE-22903.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-02-27 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046793#comment-17046793
 ] 

Shubham Chaurasia commented on HIVE-22903:
--

[~rameshkumar] 
Thanks for the suggestions. Yes, it was not related to constants. It's related 
to batch size.

Resetting evaluator only when isLastGroupBatch=true fixed all the cases. Fixed 
https://issues.apache.org/jira/browse/HIVE-22909 as well. I uploaded a new 
patch with this approach.

{code:java}
if (!isPartitionOrderBy) {
  // To keep the row counting correct, don't reset for row_number evaluator 
if it's not a isLastGroupBatch
  if (!isLastGroupBatch && isRowNumberFunction()) {
return;
  }
  groupBatches.resetEvaluators();
}
{code}

However I think this can be safely generalized for all the functions like - 
{code:java}
if (!isPartitionOrderBy && isLastGroupBatch) {
  groupBatches.resetEvaluators();
}
{code}
Will give this a try tomorrow.

> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22903.01.patch, HIVE-22903.02.patch, 
> HIVE-22903.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-02-27 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22903:
-
Attachment: HIVE-22903.02.patch

> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22903.01.patch, HIVE-22903.02.patch, 
> HIVE-22903.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-26 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22840:
-
Attachment: HIVE-22840.04.patch

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22840.03.patch, HIVE-22840.04.patch, 
> HIVE-22840.1.patch, HIVE-22840.2.patch, HIVE-22840.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-25 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22840:
-
Attachment: HIVE-22840.03.patch

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22840.03.patch, HIVE-22840.1.patch, 
> HIVE-22840.2.patch, HIVE-22840.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-23 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043154#comment-17043154
 ] 

Shubham Chaurasia commented on HIVE-22840:
--

[~jcamachorodriguez]

Oh sorry, created a PR - https://github.com/apache/hive/pull/922
Latest patch was - 
https://issues.apache.org/jira/secure/attachment/12993999/HIVE-22840.patch

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22840.1.patch, HIVE-22840.2.patch, HIVE-22840.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-20 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041011#comment-17041011
 ] 

Shubham Chaurasia commented on HIVE-22840:
--

[~abstractdog] [~jcamachorodriguez]

Can you please review ? 
Moved {{CalendarUtils}} from hive-common to storage-api to prevent cyclic 
dependency (hive-common already depends on storage-api).

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
> Attachments: HIVE-22840.1.patch, HIVE-22840.2.patch, HIVE-22840.patch
>
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-20 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22840:
-
Attachment: HIVE-22840.patch

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
> Attachments: HIVE-22840.1.patch, HIVE-22840.2.patch, HIVE-22840.patch
>
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-02-19 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040666#comment-17040666
 ] 

Shubham Chaurasia commented on HIVE-22903:
--

[~rameshkumar] Thanks a lot for the review. 

{quote}
We should probably loop through the groupBatches and skip reseting if it is a 
row_number and a constant(And probably this might fix 
https://issues.apache.org/jira/browse/HIVE-22909 too).
{quote}
Sorry I could not understand this. Currently, it's like

{code:java}
if (!isPartitionOrderBy && !skipResetEvaluatorsForRowNumber) {
  groupBatches.resetEvaluators();
}
{code}

Does looping though groupBatches (evaluators ? ) mean something like 
{code:java}
public void resetEvaluators() {
for (VectorPTFEvaluatorBase evaluator : evaluators) {
if (!isPartitionOrderBy && !skipResetEvaluatorsForRowNumber) {
  evaluator.resetEvaluator();
   }
}
  }
{code}

I was confused because these flags isPartitionOrderBy and 
skipResetEvaluatorsForRowNumber are common to all the evaluators and would not 
change for a particular evaluator. 



> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22903.01.patch, HIVE-22903.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-02-19 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22903:
-
Attachment: HIVE-22903.patch

> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22903.01.patch, HIVE-22903.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-02-18 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039686#comment-17039686
 ] 

Shubham Chaurasia commented on HIVE-22903:
--

Found one more bug with row_number() while testing this one. Keeping it 
separate: https://issues.apache.org/jira/browse/HIVE-22909 as it's entirely 
different thing.

> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22903.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22909) Vectorized row_number() returns incorrect results in case it is called multiple times with different constant expressions

2020-02-18 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22909:
-
Summary: Vectorized row_number() returns incorrect results in case it is 
called multiple times with different constant expressions  (was: Vectorized 
row_number() returns incorrect results in case it is called multiple times with 
different constant expression)

> Vectorized row_number() returns incorrect results in case it is called 
> multiple times with different constant expressions
> -
>
> Key: HIVE-22909
> URL: https://issues.apache.org/jira/browse/HIVE-22909
> Project: Hive
>  Issue Type: Bug
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Vectorized row_number() returns incorrect results in case it is called 
> multiple times in the same query with different constant expressions.
> Example
> {code}
> select row_number() over(partition by 1) r1, row_number() over(partition by 
> 2) r2, t from over10k_n8 limit 1100;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-22909) Vectorized row_number() returns incorrect results in case it is called multiple times with different constant expression

2020-02-18 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-22909:



> Vectorized row_number() returns incorrect results in case it is called 
> multiple times with different constant expression
> 
>
> Key: HIVE-22909
> URL: https://issues.apache.org/jira/browse/HIVE-22909
> Project: Hive
>  Issue Type: Bug
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Vectorized row_number() returns incorrect results in case it is called 
> multiple times in the same query with different constant expressions.
> Example
> {code}
> select row_number() over(partition by 1) r1, row_number() over(partition by 
> 2) r2, t from over10k_n8 limit 1100;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-02-18 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22903:
-
Attachment: HIVE-22903.01.patch
Status: Patch Available  (was: Open)

> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22903.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-22903) Vectorized row_number() resets the row number after one batch in case of constant expression in partition clause

2020-02-18 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-22903:



> Vectorized row_number() resets the row number after one batch in case of 
> constant expression in partition clause
> 
>
> Key: HIVE-22903
> URL: https://issues.apache.org/jira/browse/HIVE-22903
> Project: Hive
>  Issue Type: Bug
>  Components: UDF, Vectorization
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> Vectorized row number implementation resets the row number when constant 
> expression is passed in partition clause.
> Repro Query
> {code}
> select row_number() over(partition by 1) r1, t from over10k_n8;
> Or
> select row_number() over() r1, t from over10k_n8;
> {code}
> where table over10k_n8 contains more than 1024 records.
> This happens because currently in VectorPTFOperator, we reset evaluators if 
> only partition clause is there.
> {code:java}
> // If we are only processing a PARTITION BY, reset our evaluators.
> if (!isPartitionOrderBy) {
>   groupBatches.resetEvaluators();
> }
> {code}
> To resolve, we should also check if the entire partition clause is a constant 
> expression, if it is so then we should not do 
> {{groupBatches.resetEvaluators()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-20312) Allow arrow clients to use their own BufferAllocator with LlapOutputFormatService

2020-02-10 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034096#comment-17034096
 ] 

Shubham Chaurasia commented on HIVE-20312:
--

Thanks [~jdere]

> Allow arrow clients to use their own BufferAllocator with 
> LlapOutputFormatService
> -
>
> Key: HIVE-20312
> URL: https://issues.apache.org/jira/browse/HIVE-20312
> Project: Hive
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Eric Wohlstadter
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20312.1.patch, HIVE-20312.2.patch, 
> HIVE-20312.3.patch
>
>
> Clients should be able to provide their own BufferAllocator to 
> LlapBaseInputFormat if allocator operations depend on client-side logic. For 
> example, clients may want to manage the allocator hierarchy per client-side 
> task, thread, etc.. 
> Currently the client is forced to use one global RootAllocator per process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-20312) Allow arrow clients to use their own BufferAllocator with LlapOutputFormatService

2020-02-10 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033757#comment-17033757
 ] 

Shubham Chaurasia commented on HIVE-20312:
--

[~jdere] [~maheshk114]

Looks like this was not merged. Can you please have a look and merge ?

cc [~anishek] [~thejas]

> Allow arrow clients to use their own BufferAllocator with 
> LlapOutputFormatService
> -
>
> Key: HIVE-20312
> URL: https://issues.apache.org/jira/browse/HIVE-20312
> Project: Hive
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Eric Wohlstadter
>Priority: Major
> Attachments: HIVE-20312.1.patch, HIVE-20312.2.patch, 
> HIVE-20312.3.patch
>
>
> Clients should be able to provide their own BufferAllocator to 
> LlapBaseInputFormat if allocator operations depend on client-side logic. For 
> example, clients may want to manage the allocator hierarchy per client-side 
> task, thread, etc.. 
> Currently the client is forced to use one global RootAllocator per process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-06 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22840:
-
Attachment: HIVE-22840.2.patch

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
> Attachments: HIVE-22840.1.patch, HIVE-22840.2.patch
>
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-06 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032114#comment-17032114
 ] 

Shubham Chaurasia edited comment on HIVE-22840 at 2/7/20 6:16 AM:
--

HIVE-22840.1.patch / HIVE-22840.2.patch depend on CalendarUtils class 
introduced in HIVE-22589. For now I have just added it. I will rebase the patch 
once HIVE-22589 is merged. 

cc [~jcamachorodriguez]


was (Author: shubhamchaurasia):
HIVE-22840.1.patch depends on CalendarUtils class introduced in HIVE-22589. For 
now I have just added it. I will rebase the patch once HIVE-22589 is merged. 

cc [~jcamachorodriguez]

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
> Attachments: HIVE-22840.1.patch, HIVE-22840.2.patch
>
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-06 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22840:
-
Description: 
HIVE-22405 added support for proleptic calendar. It uses java's 
SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in some 
scenarios. 

As a result of those race conditions, we see some exceptions like
{code:java}
1) java.lang.NumberFormatException: For input string: "" 
OR 
java.lang.NumberFormatException: For input string: ".821582E.821582E44"

OR

2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
at 
sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
at 
java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
{code}


This issue is to address those thread-safety issues/race conditions.

cc [~jcamachorodriguez] [~abstractdog] [~omalley]

  was:
HIVE-22405 added support for proleptic calendar. It uses java's 
SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in some 
scenarios. 

As a result of those race conditions, we see some exceptions like
{code:java}
1) java.lang.NumberFormatException: For input string: "" OR 
java.lang.NumberFormatException: For input string: ".821582E.821582E44"

OR

2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
at 
sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
at 
java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
{code}


This issue is to address those thread-safety issues/race conditions.

cc [~jcamachorodriguez] [~abstractdog] [~omalley]


> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
> Attachments: HIVE-22840.1.patch
>
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" 
> OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-06 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22840:
-
Description: 
HIVE-22405 added support for proleptic calendar. It uses java's 
SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in some 
scenarios. 

As a result of those race conditions, we see some exceptions like
{code:java}
1) java.lang.NumberFormatException: For input string: "" OR 
java.lang.NumberFormatException: For input string: ".821582E.821582E44"

OR

2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
at 
sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
at 
java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
{code}


This issue is to address those thread-safety issues/race conditions.

cc [~jcamachorodriguez] [~abstractdog] [~omalley]

  was:
HIVE-22405 added support for proleptic calendar. It uses java's 
SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in some 
scenarios. 

This issue is to address those thread-safety issues/race conditions.


> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
> Attachments: HIVE-22840.1.patch
>
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> As a result of those race conditions, we see some exceptions like
> {code:java}
> 1) java.lang.NumberFormatException: For input string: "" OR 
> java.lang.NumberFormatException: For input string: ".821582E.821582E44"
> OR
> 2) Caused by: java.lang.ArrayIndexOutOfBoundsException: -5325980
>   at 
> sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:453)
>   at 
> java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2397)
> {code}
> This issue is to address those thread-safety issues/race conditions.
> cc [~jcamachorodriguez] [~abstractdog] [~omalley]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-06 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22840:
-
Description: 
HIVE-22405 added support for proleptic calendar. It uses java's 
SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in some 
scenarios. 

This issue is to address those thread-safety issues/race conditions.

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
> Attachments: HIVE-22840.1.patch
>
>
> HIVE-22405 added support for proleptic calendar. It uses java's 
> SimpleDateFormat/Calendar APIs which are not thread-safe and cause race in 
> some scenarios. 
> This issue is to address those thread-safety issues/race conditions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-06 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22840:
-
Component/s: storage-api

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>  Components: storage-api
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
> Attachments: HIVE-22840.1.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-06 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032114#comment-17032114
 ] 

Shubham Chaurasia commented on HIVE-22840:
--

HIVE-22840.1.patch depends on CalendarUtils class introduced in HIVE-22589. For 
now I have just added it. I will rebase the patch once HIVE-22589 is merged. 

cc [~jcamachorodriguez]

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
> Attachments: HIVE-22840.1.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-06 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-22840:
-
Attachment: HIVE-22840.1.patch
Status: Patch Available  (was: Open)

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
> Attachments: HIVE-22840.1.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-22840) Race condition in formatters of TimestampColumnVector and DateColumnVector

2020-02-06 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-22840:


Assignee: Shubham Chaurasia

> Race condition in formatters of TimestampColumnVector and DateColumnVector 
> ---
>
> Key: HIVE-22840
> URL: https://issues.apache.org/jira/browse/HIVE-22840
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: Shubham Chaurasia
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-21641) Llap external client returns decimal columns in different precision/scale as compared to beeline

2019-11-25 Thread Shubham Chaurasia (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-21641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982168#comment-16982168
 ] 

Shubham Chaurasia commented on HIVE-21641:
--

Thanks [~kgyrtkirk] [~jcamachorodriguez]
I have opened https://issues.apache.org/jira/browse/HIVE-22541

> Llap external client returns decimal columns in different precision/scale as 
> compared to beeline
> 
>
> Key: HIVE-21641
> URL: https://issues.apache.org/jira/browse/HIVE-21641
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 3.1.1
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: Branch3Candidate, pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21641.1.patch, HIVE-21641.2.patch, 
> HIVE-21641.3.patch, HIVE-21641.4.patch, HIVE-21641.5.branch-3.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Llap external client gives different precision/scale as compared to when the 
> query is executed beeline. Consider the following results:
> Query:
> {code} 
> select avg(ss_ext_sales_price) my_avg from store_sales;
> {code} 
> Result from Beeline
> {code} 
> ++
> |   my_avg   |
> ++
> | 37.8923531030581611189434  |
> ++
> {code} 
> Result from Llap external client
> {code}
> +-+
> |   my_avg|
> +-+
> |37.892353|
> +-+
> {code}
>  
> This is due to Driver(beeline path) calls 
> [analyzeInternal()|https://github.com/apache/hive/blob/rel/release-3.1.1/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L328]
>  for getting result set schema which initializes 
> [resultSchema|https://github.com/apache/hive/blob/rel/release-3.1.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L333]
>  after some more transformations as compared to llap-ext-client which calls 
> [genLogicalPlan()|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseUtils.java#L561]
> Replacing {{genLogicalPlan()}} by {{analyze()}} resolves this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-22541) Inconsistent decimal precision/scale of resultset schema in analyzer.genLogicalPlan() as compared to analyzer.analyze()

2019-11-25 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia reassigned HIVE-22541:



> Inconsistent decimal precision/scale of resultset schema in 
> analyzer.genLogicalPlan() as compared to analyzer.analyze()
> ---
>
> Key: HIVE-22541
> URL: https://issues.apache.org/jira/browse/HIVE-22541
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 4.0.0
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-21641 handles decimal 
> scale/precision inconsistencies when we query using llap external client.
> [HIVE-21641 
> changes|https://issues.apache.org/jira/secure/attachment/12968006/HIVE-21641.4.patch]
>  {{analyzer.genLogicalPlan(ast)}} to {{analyzer.analyze(ast, ctx)}} to handle 
> this. However we should fix {{analyzer.genLogicalPlan(ast)}} to return 
> correct decimal precision/scale. 
> Please see 
> [this|https://issues.apache.org/jira/browse/HIVE-21641?focusedCommentId=16981513&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16981513]
>  and 
> [this|https://issues.apache.org/jira/browse/HIVE-21641?focusedCommentId=16982053&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16982053]
>  comment for more.
> cc [~jcamachorodriguez] [~kgyrtkirk]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-21641) Llap external client returns decimal columns in different precision/scale as compared to beeline

2019-11-22 Thread Shubham Chaurasia (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-21641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chaurasia updated HIVE-21641:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Llap external client returns decimal columns in different precision/scale as 
> compared to beeline
> 
>
> Key: HIVE-21641
> URL: https://issues.apache.org/jira/browse/HIVE-21641
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 3.1.1
>Reporter: Shubham Chaurasia
>Assignee: Shubham Chaurasia
>Priority: Major
>  Labels: Branch3Candidate, pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21641.1.patch, HIVE-21641.2.patch, 
> HIVE-21641.3.patch, HIVE-21641.4.patch, HIVE-21641.5.branch-3.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Llap external client gives different precision/scale as compared to when the 
> query is executed beeline. Consider the following results:
> Query:
> {code} 
> select avg(ss_ext_sales_price) my_avg from store_sales;
> {code} 
> Result from Beeline
> {code} 
> ++
> |   my_avg   |
> ++
> | 37.8923531030581611189434  |
> ++
> {code} 
> Result from Llap external client
> {code}
> +-+
> |   my_avg|
> +-+
> |37.892353|
> +-+
> {code}
>  
> This is due to Driver(beeline path) calls 
> [analyzeInternal()|https://github.com/apache/hive/blob/rel/release-3.1.1/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L328]
>  for getting result set schema which initializes 
> [resultSchema|https://github.com/apache/hive/blob/rel/release-3.1.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L333]
>  after some more transformations as compared to llap-ext-client which calls 
> [genLogicalPlan()|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseUtils.java#L561]
> Replacing {{genLogicalPlan()}} by {{analyze()}} resolves this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 182 matches

Mail list logo