[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893746#action_12893746
 ] 

Yan Zhou commented on PIG-1501:
-------------------------------

gzip and lzo2 are tried as the compression codecs;  TFile and RCFile are used 
as storage formats. The tests are PigMix's L3 and L11, and a variation of L3 
with full projection, hereafter referred as L3_1,  in order to expand the 
temporary data size. (In some cases, multiple runs are executed, particularly 
in presence of doubted system fluctuations.)  End-to-end elapsed times are 
recorded.

The results are on a 15-node cluster of  2 x Xeon L5420 2.50GHz/16G RAM boxes:

          uncompressed                TFile(lzo)                  TFile(gzip)   
       RCFile(lzo2)
L3        133684504                   19674398                 11513958         
   18092681
                 1'40"                              1'45"                       
    1'40"                     1'56"
                                                                                
                                       18094161
                                                                                
                                         1'46"

L3_1    3889095541              3697681875            2637742581         
3675818160
                 3'10"                               4'4"                       
     3'25"                        3'58"
                                                  3697666122                    
                         3675816707
                                                       3'10"                    
                                        3'22"
                                                  3697674414
                                                       3'5"

L11       25878480                   21368784                 15233146          
   21112892
                 1'52"                             1'52"                        
  1'57"                        1'59"
                                                                                
                                       21112892
                                                                                
                                          1'59"

A few observations are in order:

1) L3 has the highest compress ratio; while L3_1 and L11 much lower compression 
ratio;
2) gzip compress better compared with LZO2 with a little perf cost;
3) RC file should have seen much better compression as it's a columnar store. 
But the actual difference is marginal. It is probably because of L11's unique 
values, and many of  L3_1's random values like time stamp, plus the presence of 
map-typed columns. The conclusion from this observation is that compression of 
temporary intermediate data is not guaranteed to save disk space to a desired 
degree. It's subject to temporary data values being compressed upon. As result, 
this feature should be made configurable;
4)  The performance implications from these tests seem to be negligible within 
background noise or within a few percentages of the overall run times. But this 
is not conclusive yet. Larger and more real life queries would be more suitable 
for the comparison purpose ;
5) RCFile as above has not shown clear advantage in terms of better columnar 
compression ratio. Bu this observation could be data-sensitive.

> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to