[ 
https://issues.apache.org/jira/browse/PIG-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699810#action_12699810
 ] 

Santhosh Srinivasan commented on PIG-767:
-----------------------------------------

Firstly, the describe output is broken for bags in some cases. You will not see 
the inner tuple (t1 in your example). This can be fixed. It will not cause any 
problems for the runtime execution.

Bags are containers of tuples. There are no bags that do not contain tuples 
unless the bags are empty. As a result, UDFs that assume an inner bag of 
chararray will always get a bag with chararray.

I am pasting the output of similar queries and you should see the inner tuples 
in the output. Notice that you see the tuples in the bags. Also notice that the 
bags in the describe output do not have the inner tuples.

{code}

grunt> a = load '/user/sms/data/student_tab.data' as (name: chararray, age:int, 
gpa: float);
grunt> b = group a by age; 

grunt> describe b;
b: {group: int,a: {name: chararray,age: int,gpa: float}}

grunt> dump b;
(19,{(John,19,3.8F),(Jack,19,3.1F)})
(20,{(Joe,20,3.5F),(Harry,20,3.2F),(Govinda,20,4.0F)})

grunt> c = foreach b generate group, a.gpa;

grunt> describe c;
c: {group: int,gpa: {gpa: float}}

grunt> dump c;
(19,{(3.8F),(3.1F)})
(20,{(3.5F),(3.2F),(4.0F)})
{code}

> Schema reported from DESCRIBE and actual schema of inner bags are different.
> ----------------------------------------------------------------------------
>
>                 Key: PIG-767
>                 URL: https://issues.apache.org/jira/browse/PIG-767
>             Project: Pig
>          Issue Type: Bug
>            Reporter: George Mavromatis
>             Fix For: 0.2.0
>
>
> The following script:
> urlContents = LOAD 'inputdir' USING BinStorage() AS (url:bytearray, 
> pg:bytearray);
> -- describe and dump are in-sync
> DESCRIBE urlContents;
> DUMP urlContents;
> urlContentsG = GROUP urlContents BY url;
> DESCRIBE urlContentsG;
> urlContentsF = FOREACH urlContentsG GENERATE group,urlContents.pg;
> DESCRIBE urlContentsF;
> DUMP urlContentsF;
> Prints for the DESCRIBE commands:
> urlContents: {url: chararray,pg: chararray}
> urlContentsG: {group: chararray,urlContents: {url: chararray,pg: chararray}}
> urlContentsF: {group: chararray,pg: {pg: chararray}}
> The reported schemas for urlContentsG and urlContentsF are wrong. They are 
> also against the section "Schemas for Complex Data Types" in 
> http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_Schemas.
> As expected, actual data observed from DUMP urlContentsG and DUMP 
> urlContentsF do contain the tuple inside the inner bags.
> The correct schema for urlContentsG is:  {group: chararray,urlContents: 
> {t1:(url: chararray,pg: chararray)}}
> This may sound like a technicality, but it isn't. For instance, a UDF that 
> assumes an inner bag of {chararray} will not work with {(chararray)}. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to