[ 
https://issues.apache.org/jira/browse/PIG-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981624#action_12981624
 ] 

Daniel Dai commented on PIG-847:
--------------------------------

After code review and prototyping, I don't think we need twoLevelAccess 
(Schema.twoLevelAccessRequired). The reasons are:
1. twoLevelAccess only exist in logical layer. In the physical layer, we don't 
have any notion of twoLevelAccess. No matter what the value of twoLevelAccess 
is, we will generate the same physical plan.
2. We do two level access for all bag access. I don't find any case we want to 
access the enclosing tuple of the bag directly.

Here is one example. Suppose we have a UDF which generate a bag:

{code}
class GenBag extends EvalFunc<DataBag> {
    @Override
    public DataBag exec(Tuple input) throws IOException {
        DataBag result = DefaultBagFactory.getInstance().newDefaultBag();
        Tuple t = DefaultTupleFactory.getInstance().newTuple();
        t.append(input.get(0));
        t.append(((Integer)input.get(0))*((Integer)input.get(0)));
    }
    @Override
    public Schema outputSchema(Schema input) {
        try {
            Schema tupleSchema = new Schema();
            for (int i=0;i<2;i++)
                tupleSchema.add(new FieldSchema(input.getField(0).alias, null, 
DataType.INTEGER));
            Schema bagSchema = new Schema();
            bagSchema.add(new FieldSchema(null, tupleSchema, DataType.TUPLE));
            bagSchema.setTwoLevelAccessRequired(false); // Play with 
twoLevelAccess
            return new Schema(new FieldSchema(this.getClass().getSimpleName(), 
bagSchema, DataType.BAG));
        } catch (FrontendException e) {
            return null;
        }
}
{code}
If we have a script: 
{code}
a = load '1.txt' as (a0:int, a1:int);
b = foreach a generate GenBag(a0, a1) as bg;
c = foreach b generate bg.$0;
dump c;
{code}
The goal for twoLevelAccess seems to control the meaning of bg.$0: Whether it 
means tuple or the first field of tuple. However, in reality, we only see user 
project the item inside tuple. Actually, in current code, even if we set 
twoLevelAccess to false, we still cannot project the tuple. So keep 
twoLevelAccess is meaningless and confusing. I propose to remove 
twoLevelAccess, all bag implicitly contain tuple, and bag projection implicitly 
goes to the item inside tuple. 

> Setting twoLevelAccessRequired field in a bag schema should not be required 
> to access fields in the tuples of the bag
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-847
>                 URL: https://issues.apache.org/jira/browse/PIG-847
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Pradeep Kamath
>            Assignee: Daniel Dai
>             Fix For: 0.9.0
>
>
> Currently Pig interprets the result type of a relation as a bag. However the 
> schema of the relation directly contains the schema describing the fields in 
> the tuples for the relation. However when a udf wants to return a bag or if 
> there is a bag in input data or if the user creates a bag constant, the 
> schema of the bag has one field schema which is that of the tuple. The 
> Tuple's schema has the types of the fields. To be able to access the fields 
> from the bag directly in such a case by using something like 
> <bagname>.<fieldname> or <bag>.<fieldposition>, the schema of the bag should 
> have the twoLevelAccess set to true so that pig's type system can get 
> traverse the tuple schema and get to the field in question. This is confusing 
> - we should try and see if we can avoid needing this extra flag. A possible 
> solution is to treat bags the same way - whether they represent relations or 
> real bags. Another way is to introduce a special "relation" datatype for the 
> result type of a relation and bag type would be used only for true bags. In 
> this case, we would always need bag schema to have a tuple schema which would 
> describe the fields. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to