[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776729#action_12776729
 ] 

Pradeep Kamath commented on PIG-1038:
-------------------------------------

In JobControlCompiler:
======================
{code}
583                     jobConf.set("pig.secondarySortOrder",
584                             
ObjectSerializer.serialize(mro.getSecondarySortOrder()));
585                 }
{code}
Looks like above code should be set in the case of non order by Mro which uses
secondary key

{code}
638                 valuea = ((Tuple)wa.getValueAsPigType()).get(0);
{code}
We should put a comment explaining that we extract the first field out since 
that represents
the actual group by key.

In SecondaryKeyOptimizer:
========================
{code}
            } else if (mapLeaf instanceof POUnion || mapLeaf instanceof 
POSplit) {
{code}

The above should not contain POSplit since POSplit would only occur after multi 
query optimization
which happens later.

{code}
94             } else if (plan.getRoots().size() != 1) {
 95                 // POLocalRearrange plan contains more than 1 root.
 96                 // Probably there is an Expression operator in the local
 97                 // rearrangement plan,
 98                 // skip secondary key optimizing
 99                 return null;
{code}
Should we do continue nextPlan instead of return null here since this is 
similar to udf or constant in 
local rearrange case

{code}
105.                columnChainInfo.insert(false, columns, DataType.TUPLE);
{code}
It would useful to put a comment explaining this is put into the 
ColumnChainInfo only for comparing that
different components of SortKeyInfo are all coming from the same input index. 
Also should the datatype be
BAG?

{code}
118                        log.debug(node + " have more than 1 predecessor");
{code}
predecessor should change to successor.

{code}
217             if (currentNode instanceof POPackage
218                     || currentNode instanceof POFilter
219                     || currentNode instanceof POLimit) {
{code}
In line 217 we should ensure, we don't optimize when we encounter POJoinPackage 
using something like
{code}
if ((currentNode instanceof POPackage && !(currentNode instanceof 
POJoinPackage))
{code}

{code}
307.             int errorCode = 1000;
327                     int errorCode = 1000;
526.             int errorCode = 1000;
{code}
This error code is already in use 

{code}
336             } else if (mapLeaf instanceof POUnion || mapLeaf instanceof 
POSplit) {
337                 List<PhysicalOperator> preds = mr.mapPlan
338                         .getPredecessors(mapLeaf);
339                 for (PhysicalOperator pred : preds) {
340                     POLocalRearrange rearrange = (POLocalRearrange) pred;
341                     rearrange.setUseSecondaryKey(true);
342                     if (rearrange.getIndex() == indexOfRearrangeToChange) 
// Try
343                                                                           
// to
344                                                                           
// find
345                                                                           
// the
346                                                                           
// POLocalRearrange
347                                                                           
// for
348                                                                           
// the
349                                                                           
// secondary
350                                                                           
// key
351                         setSecondaryPlan(mr.mapPlan, rearrange,
352                                 secondarySortKeyInfo);
353                 }   
354             }
{code}
The above should not contain POSplit since POSplit would only occur after multi 
query optimization
which happens later.

Also in the if statement on line 342, what if the condition evaluates to false 
- should we throw an Exception like earlier in the same
method?

{code}
530             if (r)
531                 sawInvalidPhysicalOper = true;
..
557                 if (r) // if we saw physical operator other than project in 
sort
558                        // plan
559                     return;
{code}
At line 559 should we be setting sawInvalidPhysicalOper?

General comments:
=================
A comment on ColumnChainInfo and SortKeyInfo explaining how it tracks to 
POProjects in the plan would be useful

POMultiQueryPackage should not change since SecondaryKeyOptimizer runs before
MultiQueryOptimizer.




> Optimize nested distinct/sort to use secondary key
> --------------------------------------------------
>
>                 Key: PIG-1038
>                 URL: https://issues.apache.org/jira/browse/PIG-1038
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.4.0
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch
>
>
> If nested foreach plan contains sort/distinct, it is possible to use hadoop 
> secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
> query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = order A by $1;
>     generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = A.$1;
>     E = distinct D;
>     generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct 
> D" to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to