[ 
https://issues.apache.org/jira/browse/PIG-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Li updated PIG-2824:
------------------------

    Attachment: 2824.png

Attached result from a benchmark loading 10GB data that has 60 million records 
with 16 fields. We compare three runs: Mapreduce, Pig with schema (no types) 
and Pig without schema.

For Mapreduce, an empty map function is specified.

For Pig, in order to isolate the loading time, we apply a filter to throw out 
all data after loading, and also disable the PushUpFilter optimization so 
Foreach will be processed after data loading. Also note there is no type in the 
schema so there is no type casting here.

We can see Pig without schema is much faster than Pig with schema, due to the 
saving of a Foreach for checking #fields.

(We can also see the overhead incurred by Pig than pure Mapreduce in this case.)
                
> Pushing checking number of fields into LoadFunc
> -----------------------------------------------
>
>                 Key: PIG-2824
>                 URL: https://issues.apache.org/jira/browse/PIG-2824
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0, 0.10.0
>            Reporter: Jie Li
>         Attachments: 2824.png
>
>
> As described in PIG-1188, if users define a schema (w or w/o types), we need 
> to check the number of fields after loading data, so if there are less fields 
> we need to pad null fields, and if there are more fields we need to throw 
> them away. 
> For schema with types, Pig used to insert a Foreach after the loader for type 
> casting which also checks #fields. For schema without types there was no such 
> Foreach, thus PIG-1188 inserted one just for checking #fields. Unfortunately, 
> Foreach is too expensive for such checking, and ideally we can push it into 
> the loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to