[ 
https://issues.apache.org/jira/browse/PIG-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419629#comment-13419629
 ] 

Jie Li commented on PIG-2824:
-----------------------------

Also run a comparison using TPC-H 19:

{code}
lineitem = load '$input/lineitem' USING PigStorage('|') as (l_orderkey, 
l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, 
l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, 
l_receiptdate,l_shipinstruct, l_shipmode, l_comment);

part = load '$input/part' USING PigStorage('|') as (p_partkey, p_name, p_mfgr, 
p_brand, p_type, p_size, p_container, p_retailprice, p_comment);

lpart = JOIN lineitem BY l_partkey, part by p_partkey;

fltResult = FILTER lpart BY 
  (
    p_brand == 'Brand#12'
        and p_container matches 'SM CASE|SM BOX|SM PACK|SM PKG'
        and l_quantity >= 1 and l_quantity <= 11
        and p_size >= 1 and p_size <= 5
        and l_shipmode matches 'AIR|AIR REG'
        and l_shipinstruct == 'DELIVER IN PERSON'
  ) 
  or 
  (
    p_brand == 'Brand#23'
        and p_container matches 'MED BAG|MED BOX|MED PKG|MED PACK'
        and l_quantity >= 10 and l_quantity <= 20
        and p_size >= 1 and p_size <= 10
        and l_shipmode matches 'AIR|AIR REG'
        and l_shipinstruct == 'DELIVER IN PERSON'
  )
  or
  (
        p_brand == 'Brand#34'
        and p_container matches 'LG CASE|LG BOX|LG PACK|LG PKG'
        and l_quantity >= 20 and l_quantity <= 30
        and p_size >= 1 and p_size <= 15
        and l_shipmode matches 'AIR|AIR REG'
        and l_shipinstruct == 'DELIVER IN PERSON'
  );
volume = FOREACH fltResult GENERATE l_extendedprice * (1 - l_discount);
grpResult = GROUP volume ALL;
revenue = FOREACH grpResult GENERATE SUM(volume);

store revenue into '$output/Q19out' USING PigStorage('|');
{code}

It consists of a join job which dominates the running time, and a light-weight 
group job. Below is the comparison of the map phase time for processing 10GB 
data:

||trunk||this patch||
|7m54s||7m22s|

The improvement is less significant as previous mini benchmark because half 
fields are pruned, but still we can see 30 seconds speed up (6%).
                
> Pushing checking number of fields into LoadFunc
> -----------------------------------------------
>
>                 Key: PIG-2824
>                 URL: https://issues.apache.org/jira/browse/PIG-2824
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0, 0.10.0
>            Reporter: Jie Li
>         Attachments: 2824.patch, 2824.png
>
>
> As described in PIG-1188, if users define a schema (w or w/o types), we need 
> to check the number of fields after loading data, so if there are less fields 
> we need to pad null fields, and if there are more fields we need to throw 
> them away. 
> For schema with types, Pig used to insert a Foreach after the loader for type 
> casting which also checks #fields. For schema without types there was no such 
> Foreach, thus PIG-1188 inserted one just for checking #fields. Unfortunately, 
> Foreach is too expensive for such checking, and ideally we can push it into 
> the loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to