[jira] [Comment Edited] (HIVE-11652) Avoid expensive call to removeAll in DefaultGraphWalker

2016-12-21 Thread Jesus Camacho Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15766626#comment-15766626
 ] 

Jesus Camacho Rodriguez edited comment on HIVE-11652 at 12/21/16 9:58 AM:
--

[~dhiraj.kumar], thanks for the feedback.

FWIW, when we worked on this patch, we focused on the _removeAll_ call because 
it was causing slowness for many queries. We run tests on multiple workloads 
w/o the patch, and we did not observe performance regression wrt previous 
version. However, it is indeed possible that our queries were not covering all 
possible cases.

The performance of the algorithm could be improved by storing the position of 
the last added child to the stack. However, this will impact the memory 
footprint too, as we will need to store an integer per node in the stack. I 
think that is acceptable, but we just should be aware that there is a trade-off 
here.

I would suggest you open a new JIRA case pointing to this one, describe the 
problem, and provide some query where this issue is noticeable so we can 
reproduce the issue in a different environment. If you have a clear idea on the 
solution, you could contribute a patch. Otherwise, I will try to look into the 
issue asap.

--

Btw, wrt 3., have you set _hive.driver.parallel.compilation_ to _true_? That 
should at least prevent the locking issue.


was (Author: jcamachorodriguez):
[~dhiraj.kumar], thanks for the feedback.

FWIW, when we worked on this patch, we focused on the _removeAll_ call because 
it was causing slowness for many queries. We run tests on multiple workloads 
w/o the patch, and we did not observe performance regression wrt previous 
version. However, it is indeed possible that our queries were not covering all 
possible cases.

The performance of the algorithm could be improved by storing the position of 
the last added child to the stack. However, this will impact the memory 
footprint too, as we will need to store an integer per AST node in the stack. I 
think that is acceptable, but we just should be aware that there is a trade-off 
here.

I would suggest you open a new JIRA case pointing to this one, describe the 
problem, and provide some query where this issue is noticeable so we can 
reproduce the issue in a different environment. If you have a clear idea on the 
solution, you could contribute a patch. Otherwise, I will try to look into the 
issue asap.

--

Btw, wrt 3., have you set _hive.driver.parallel.compilation_ to _true_? That 
should at least prevent the locking issue.

> Avoid expensive call to removeAll in DefaultGraphWalker
> ---
>
> Key: HIVE-11652
> URL: https://issues.apache.org/jira/browse/HIVE-11652
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer, Physical Optimizer
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Fix For: 1.3.0, 2.0.0
>
> Attachments: HIVE-11652.01.patch, HIVE-11652.02.patch, 
> HIVE-11652.patch
>
>
> When the plan is too large, the removeAll call in DefaultGraphWalker (line 
> 140) will take very long as it will have to go through the list looking for 
> each of the nodes. We try to get rid of this call by rewriting the logic in 
> the walker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HIVE-11652) Avoid expensive call to removeAll in DefaultGraphWalker

2016-12-21 Thread Dhiraj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767191#comment-15767191
 ] 

Dhiraj Kumar edited comment on HIVE-11652 at 12/21/16 2:33 PM:
---

[~jcamachorodriguez] As suggested, I have create another bug at 
[HIVE-15486|https://issues.apache.org/jira/browse/HIVE-15486] 

I have attached a larger version of the same query that I had put here in the 
trail. It has 50K elements inside IN clause. Although it seems a convoluted 
query, we do have similar query running on our production system. 

We did not face problem until the upgrade to 2.1 from an older version. 
Although, this codepath is not the only problem for this query. But it accounts 
for 50% of the time consumed at hive processing. 

I did not get your suggestion of keeping last added child in the stack. Since 
the node peeked from stack would keep changing,  the list of child would change 
too for that node. Keeping the last added won't be able to help since you have 
lost list of children once you peeked another node from stack. I believe you 
need to keep all the children of node as well. 

In fact that finding position itself is not a problem, but 
ASTNode.getChildren() invocation is problem.  

I have to pick something in ASTNode.java as well,  why to add single child at 
at time to ret_vec. Why not return all the children() in one shot? 

{code}
public ArrayList getChildren() {
if (super.getChildCount() == 0) {
  return null;
}

ArrayList ret_vec = new ArrayList();
for (int i = 0; i < super.getChildCount(); ++i) {
  ret_vec.add((Node) super.getChild(i));
}

return ret_vec;
  }
{code}

I will add some profiler output to bug as well. 


was (Author: dhiraj.kumar):
[~jcamachorodriguez] As suggested, I have create another bug at 
[https://issues.apache.org/jira/browse/HIVE-15486] 

I have attached a larger version of the same query that I had put here in the 
trail. It has 50K elements inside IN clause. Although it seems a convoluted 
query, we do have similar query running on our production system. 

We did not face problem until the upgrade to 2.1 from an older version. 
Although, this codepath is not the only problem for this query. But it accounts 
for 50% of the time consumed at hive processing. 

I did not get your suggestion of keeping last added child in the stack. Since 
the node peeked from stack would keep changing,  the list of child would change 
too for that node. Keeping the last added won't be able to help since you have 
lost list of children once you peeked another node from stack. I believe you 
need to keep all the children of node as well. 

In fact that finding position itself is not a problem, but 
ASTNode.getChildren() invocation is problem.  

I have to pick something in ASTNode.java as well,  why to add single child at 
at time to ret_vec. Why not return all the children() in one shot? 

{code}
public ArrayList getChildren() {
if (super.getChildCount() == 0) {
  return null;
}

ArrayList ret_vec = new ArrayList();
for (int i = 0; i < super.getChildCount(); ++i) {
  ret_vec.add((Node) super.getChild(i));
}

return ret_vec;
  }
{code}

I will add some profiler output to bug as well. 

> Avoid expensive call to removeAll in DefaultGraphWalker
> ---
>
> Key: HIVE-11652
> URL: https://issues.apache.org/jira/browse/HIVE-11652
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer, Physical Optimizer
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Fix For: 1.3.0, 2.0.0
>
> Attachments: HIVE-11652.01.patch, HIVE-11652.02.patch, 
> HIVE-11652.patch
>
>
> When the plan is too large, the removeAll call in DefaultGraphWalker (line 
> 140) will take very long as it will have to go through the list looking for 
> each of the nodes. We try to get rid of this call by rewriting the logic in 
> the walker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)