[
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893746#action_12893746
]
Yan Zhou commented on PIG-1501:
-------------------------------
gzip and lzo2 are tried as the compression codecs; TFile and RCFile are used
as storage formats. The tests are PigMix's L3 and L11, and a variation of L3
with full projection, hereafter referred as L3_1, in order to expand the
temporary data size. (In some cases, multiple runs are executed, particularly
in presence of doubted system fluctuations.) End-to-end elapsed times are
recorded.
The results are on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM boxes:
uncompressed TFile(lzo) TFile(gzip)
RCFile(lzo2)
L3 133684504 19674398 11513958
18092681
1'40" 1'45"
1'40" 1'56"
18094161
1'46"
L3_1 3889095541 3697681875 2637742581
3675818160
3'10" 4'4"
3'25" 3'58"
3697666122
3675816707
3'10"
3'22"
3697674414
3'5"
L11 25878480 21368784 15233146
21112892
1'52" 1'52"
1'57" 1'59"
21112892
1'59"
A few observations are in order:
1) L3 has the highest compress ratio; while L3_1 and L11 much lower compression
ratio;
2) gzip compress better compared with LZO2 with a little perf cost;
3) RC file should have seen much better compression as it's a columnar store.
But the actual difference is marginal. It is probably because of L11's unique
values, and many of L3_1's random values like time stamp, plus the presence of
map-typed columns. The conclusion from this observation is that compression of
temporary intermediate data is not guaranteed to save disk space to a desired
degree. It's subject to temporary data values being compressed upon. As result,
this feature should be made configurable;
4) The performance implications from these tests seem to be negligible within
background noise or within a few percentages of the overall run times. But this
is not conclusive yet. Larger and more real life queries would be more suitable
for the comparison purpose ;
5) RCFile as above has not shown clear advantage in terms of better columnar
compression ratio. Bu this observation could be data-sensitive.
> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
> Issue Type: Test
> Reporter: Olga Natkovich
> Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We would like to understand how compressing map results as well as well as
> reducer output in a chain of MR jobs impacts performance. We can use PigMix
> queries for this investigation.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.