[jira] Commented: (PIG-1482) Pig gets confused when more than one loader is involved

Thejas M Nair (JIRA) Fri, 27 Aug 2010 16:36:17 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903714#action_12903714
 ]


Thejas M Nair commented on PIG-1482:
------------------------------------

Patch review comments -
- Schema.java
{code}
        public FieldSchema(String a, Schema s, byte t)  throws 
FrontendException {
            alias = a;
            schema = s;
            log.debug("t: " + t + " Bag: " + DataType.BAG + " tuple: " + 
DataType.TUPLE);
            
            /*
             * The following check is removed because it may not be always 
true. As a matter of 
             * fact, the condition can be produced using other constructors 
anyway.
             *
            if ((null != s) && !(DataType.isSchemaType(t))) {
                int errCode = 1020;
                throw new FrontendException("Only a BAG or TUPLE can have 
schemas. Got "
                        + DataType.findTypeName(t), errCode, 
PigException.INPUT);
            }
            */
{code}
I think some other code paths might be relying on this constructor for error 
checking. It would be safer to create another constructor with a check boolean 
argument 
{code}
        public FieldSchema(String a, Schema s, byte t, boolean innerTypeCheck)  
{code}
 and call that from above constructor and from FieldSchema.copyAndLink(..)


- In LOStream.java.getSchema() mIsSchemaComputed is used to keep track of 
whether the fieldschema parents have been set.
I think it will be better to use a different variable for the purpose - it will 
be more readable, and also not likely to break any assumptions people are 
likely to make about this variable that is from the LogicalOperator class.

- TypeCheckingVisitor.java insertCastForUDF is called on input of udf , it 
seems like same logic should be used for other expressions as well (instead of 
insertCast(.) ). Also, insertCastForUDF(..) and insertCast(..) have only two 
lines different, we can share rest of the code.
{code}
private void insertCastForUDF(LOUserFunc udf,
                FieldSchema fromFS, FieldSchema toFs, ExpressionOperator 
predecessor){

toFs.setParent( fromFS.canonicalName, predecessor );
insertCast(udf, toFs.type, toFs, predecessor);
}
{code}

- TypeCheckingVisitor.java In visit(LOCast), it seems like we can just pick any 
of the matching predecessor load functions, shouldn't we check if all the 
FuncSpec returned are the same ?
{code}
                for( Map.Entry<String, LogicalOperator> entry : 
canonicalMap.entrySet() ) {
                    FuncSpec loadFuncSpec = getLoadFuncSpec( entry.getValue(), 
entry.getKey() );
                    cast.setLoadFuncSpec( loadFuncSpec );
                }
{code}

- LOProject.java
the commented line can be removed -
{code}
+//                         mFieldSchema.setParent(fs.canonicalName, 
expressionOperator);
{code}






> Pig gets confused when more than one loader is involved
> -------------------------------------------------------
>
>                 Key: PIG-1482
>                 URL: https://issues.apache.org/jira/browse/PIG-1482
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ankur
>            Assignee: Xuefu Zhang
>             Fix For: 0.8.0
>
>         Attachments: jira-1482-final.patch, jira-1482-final.patch, 
> jira-1482-final.patch
>
>
> In case of two relations being loaded using different loader, joined, grouped 
> and projected, pig gets confused in trying to find appropriate loader for the 
> requested cast. Consider the following script :-
> A = LOAD 'data1' USING PigStorage() AS (s, m, l);
> B = FOREACH A GENERATE s#'k1' as v1, m#'k2' as v2, l#'k3' as v3;
> C = FOREACH B GENERATE v1, (v2 == 'v2' ? 1L : 0L) as v2:long, (v3 == 'v3' ? 1 
> :0) as v3:int;
> D = LOAD 'data2' USING TextLoader() AS (a);
> E = JOIN C BY v1, D BY a USING 'replicated';
> F = GROUP E BY (v1, a);
> G = FOREACH F GENERATE (chararray)group.v1, group.a;
>         
> dump G;
> This throws the error, stack trace of which is in the next comment

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1482) Pig gets confused when more than one loader is involved

Reply via email to