[
https://issues.apache.org/jira/browse/PIG-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419629#comment-13419629
]
Jie Li commented on PIG-2824:
-----------------------------
Also run a comparison using TPC-H 19:
{code}
lineitem = load '$input/lineitem' USING PigStorage('|') as (l_orderkey,
l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount,
l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate,
l_receiptdate,l_shipinstruct, l_shipmode, l_comment);
part = load '$input/part' USING PigStorage('|') as (p_partkey, p_name, p_mfgr,
p_brand, p_type, p_size, p_container, p_retailprice, p_comment);
lpart = JOIN lineitem BY l_partkey, part by p_partkey;
fltResult = FILTER lpart BY
(
p_brand == 'Brand#12'
and p_container matches 'SM CASE|SM BOX|SM PACK|SM PKG'
and l_quantity >= 1 and l_quantity <= 11
and p_size >= 1 and p_size <= 5
and l_shipmode matches 'AIR|AIR REG'
and l_shipinstruct == 'DELIVER IN PERSON'
)
or
(
p_brand == 'Brand#23'
and p_container matches 'MED BAG|MED BOX|MED PKG|MED PACK'
and l_quantity >= 10 and l_quantity <= 20
and p_size >= 1 and p_size <= 10
and l_shipmode matches 'AIR|AIR REG'
and l_shipinstruct == 'DELIVER IN PERSON'
)
or
(
p_brand == 'Brand#34'
and p_container matches 'LG CASE|LG BOX|LG PACK|LG PKG'
and l_quantity >= 20 and l_quantity <= 30
and p_size >= 1 and p_size <= 15
and l_shipmode matches 'AIR|AIR REG'
and l_shipinstruct == 'DELIVER IN PERSON'
);
volume = FOREACH fltResult GENERATE l_extendedprice * (1 - l_discount);
grpResult = GROUP volume ALL;
revenue = FOREACH grpResult GENERATE SUM(volume);
store revenue into '$output/Q19out' USING PigStorage('|');
{code}
It consists of a join job which dominates the running time, and a light-weight
group job. Below is the comparison of the map phase time for processing 10GB
data:
||trunk||this patch||
|7m54s||7m22s|
The improvement is less significant as previous mini benchmark because half
fields are pruned, but still we can see 30 seconds speed up (6%).
> Pushing checking number of fields into LoadFunc
> -----------------------------------------------
>
> Key: PIG-2824
> URL: https://issues.apache.org/jira/browse/PIG-2824
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.9.0, 0.10.0
> Reporter: Jie Li
> Attachments: 2824.patch, 2824.png
>
>
> As described in PIG-1188, if users define a schema (w or w/o types), we need
> to check the number of fields after loading data, so if there are less fields
> we need to pad null fields, and if there are more fields we need to throw
> them away.
> For schema with types, Pig used to insert a Foreach after the loader for type
> casting which also checks #fields. For schema without types there was no such
> Foreach, thus PIG-1188 inserted one just for checking #fields. Unfortunately,
> Foreach is too expensive for such checking, and ideally we can push it into
> the loader.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira