[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772404#action_12772404
 ] 

Ashutosh Chauhan commented on PIG-1038:
---

I think its a useful optimization. I presume this will be implemented as a 
visitor in MapReduceLauncher which visits on compiled MR plan. Design looks 
good. I have few questions:

bq. 1.1 Discover if we use sort/distinct in nested foreach plan.
How are you planning to discover ? Depending on some pattern like LR in 
map-plan followed by POPackage, POForeach, POSort  in reduce-plan?

Kind of orthogonal but related to this issue. We have rule-based optimizer 
framework in front-end, it seems to me that similar optimizer framework is 
required in backend too to refactor all the optimizer visitors we currently 
have and to add  similar kind of optimizations easily in future. 
There are seven optimizations in front-end expressed through rules. On the 
other hand after addition of this one we will have nine optimization visitors 
in backend. May be we can think about it to avoid lot of rework every time such 
optimization is added.

> Optimize nested distinct/sort to use secondary key
> --
>
> Key: PIG-1038
> URL: https://issues.apache.org/jira/browse/PIG-1038
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.4.0
>Reporter: Olga Natkovich
>Assignee: Daniel Dai
> Fix For: 0.6.0
>
>
> If nested foreach plan contains sort/distinct, it is possible to use hadoop 
> secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
> query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
> D = order A by $1;
> generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
> D = A.$1;
> E = distinct D;
> generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct 
> D" to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-11-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772410#action_12772410
 ] 

Ashutosh Chauhan commented on PIG-1037:
---

I am kinda late on this, but I would appreciate if someone can provide brief 
description of how this patch improves the memory layout and alleviates the 
spill problem. I took a quick look at the patch. 
According to my understanding, previously when memory is about to get exhausted 
Pig will start writing to the disk one tuple at a time. With this new patch, 
once the memory limit is hit whole bag is spilled to disk, at that point 
in-memory bag contains no tuples. If in-memory bag fills again, all of its 
content are spilled to disk in entirety again and so on.. So this patch ensures 
that we are not spilling one tuple at a time, but a full bag a time. Is this 
correct or am I missing something ?

> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Fix For: 0.6.0
>
> Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.