[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893746#action_12893746 ]
Yan Zhou commented on PIG-1501: ------------------------------- gzip and lzo2 are tried as the compression codecs; TFile and RCFile are used as storage formats. The tests are PigMix's L3 and L11, and a variation of L3 with full projection, hereafter referred as L3_1, in order to expand the temporary data size. (In some cases, multiple runs are executed, particularly in presence of doubted system fluctuations.) End-to-end elapsed times are recorded. The results are on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM boxes: uncompressed TFile(lzo) TFile(gzip) RCFile(lzo2) L3 133684504 19674398 11513958 18092681 1'40" 1'45" 1'40" 1'56" 18094161 1'46" L3_1 3889095541 3697681875 2637742581 3675818160 3'10" 4'4" 3'25" 3'58" 3697666122 3675816707 3'10" 3'22" 3697674414 3'5" L11 25878480 21368784 15233146 21112892 1'52" 1'52" 1'57" 1'59" 21112892 1'59" A few observations are in order: 1) L3 has the highest compress ratio; while L3_1 and L11 much lower compression ratio; 2) gzip compress better compared with LZO2 with a little perf cost; 3) RC file should have seen much better compression as it's a columnar store. But the actual difference is marginal. It is probably because of L11's unique values, and many of L3_1's random values like time stamp, plus the presence of map-typed columns. The conclusion from this observation is that compression of temporary intermediate data is not guaranteed to save disk space to a desired degree. It's subject to temporary data values being compressed upon. As result, this feature should be made configurable; 4) The performance implications from these tests seem to be negligible within background noise or within a few percentages of the overall run times. But this is not conclusive yet. Larger and more real life queries would be more suitable for the comparison purpose ; 5) RCFile as above has not shown clear advantage in terms of better columnar compression ratio. Bu this observation could be data-sensitive. > need to investigate the impact of compression on pig performance > ---------------------------------------------------------------- > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test > Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.