[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

Yan Zhou (JIRA) Tue, 31 Aug 2010 16:37:37 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yan Zhou updated PIG-1501:
--------------------------

    Release Note: 
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following "test.pig" script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

[ Show » ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save 
HDFS space used to store the intermediate data used by PIG and potentially 
improve query execution speed. In general, the more intermediate data 
generated, the more storage and speedup benefits. There are no backward 
compatibility issues as result of this feature. An example is the following 
"test.pig" script: register pigperf.jar; A = load 
'/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, 
timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links); B1 = filter A by timespent == 4; B = load 
'/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by 
query_term, B by query_term using 'skewed' parallel 300; D = distinct C 
parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp 
/grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig 


> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>         Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

Reply via email to