[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904848#action_12904848 ] Olga Natkovich commented on PIG-1501: - Ashutosh, The reason it is off by default is because the default compression is gzip which is really slow and most of the time not what you want. Because of the licensing issue with lzo, users need to setup it on their own. Once they do the setup, they can enable the compression. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902484#action_12902484 ] Thejas M Nair commented on PIG-1501: +1 need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: [jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
Thank for quick turnaround Tejas. Yan -Original Message- From: Thejas M Nair (JIRA) [mailto:j...@apache.org] Sent: Wednesday, August 25, 2010 8:54 AM To: pig-dev@hadoop.apache.org Subject: [jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance [ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902484#action_12902484 ] Thejas M Nair commented on PIG-1501: +1 need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902065#action_12902065 ] Thejas M Nair commented on PIG-1501: Comments on the patch - TFileStorage.java - getSchema() code that determines schema from data is same across TFileStorage and InterStorage . The code in BinStorage is also same, except that it does uses some deprecated functions. That can be moved to a common util class. (Yes, I should have moved it to a util class when I created InterStorage) TestTmpFileCompression.java - both tests test if TFile is getting used. I think one test can be changed to check if InterStorage gets used when compression is not turned on, or a check can be added to any other existing test case that runs MR job, to see if InterStorage gets used there. - log setup code is duplicated between setup and resetLog() . can be moved to common func SampleOptimizer.java - The following comment can be updated - // check that it is using BinaryStorage. to // check that it is using the temp file storage format. TFileRecordWriter.java , - the comment in following section does not seem to be valid anymore - {code} public TFileRecordWriter(Path file, String codec, Configuration conf) +throws IOException { +// hardcoded to use gzip and 1M as block size: may wish to be made configurable {code} need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900950#action_12900950 ] Yan Zhou commented on PIG-1501: --- The internal Hudson results are as follows: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] -1 javac. The applied patch generated 162 javac compiler warnings (more than the trunk's current 156 warnings). [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 427 release audit warnings (more than the trunk's current 425 warnings). The 6 javac warnings are from the use of a deprecated PigMapReduce.sJobConf field. But that deprecation is for intended for external use only and internal use should be ok. The 2 release audit warnings are on two html files, SampleOptimizer.html and org.apache.pig.impl.util.Utils.html. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897455#action_12897455 ] Thejas M Nair commented on PIG-1501: Why was TFile chosen over SequenceFile ? I am wondering if the additional unused features of TFile (index, metadata) result in any overhead compared to SequenceFile. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897496#action_12897496 ] Yan Zhou commented on PIG-1501: --- Please refer to HADOOP-3315 for overall Sequence File vs TFile comparison. It appears for compressed data, TFile performs better than SeqFile. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896993#action_12896993 ] Alan Gates commented on PIG-1501: - It's not surprising that RCFile performs badly here, since in every case every column in the row is used. This is known to be a bad use case for columnar storage. While for some data sets the better compression may overcome this, I suspect that in the general case the stitching costs will overwhelm any compression wins (as shown here). I'm +1 with going with lzo/Tfile. As the lzo libs are GPL we cannot ship with that as default. I wasn't clear from your last comment which you were proposing as the default. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897005#action_12897005 ] Yan Zhou commented on PIG-1501: --- The default is *not* using the compression on the intermediate data, which is the existing behavoir. For RC file, it is just a bit better in terms of compression ration than TFile. In terms of performance, the difference is within background noise. Stitching costs should be minimal. Actually, the full projection is the biggest advantage of RCFile over other columnar storage like zebra. I was surprised to see the compression improvement over TFile is marginal. The only cause I can think of is that the compression ratio is too sensitive to the data to pre-determine or even pre-estimate. lzo is under GPL. But it appears that Hadoop installation has it, at least in my test cluster. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897046#action_12897046 ] Alan Gates commented on PIG-1501: - You can install lzo with Hadoop (as Yahoo does on its grids) but you cannot ship lzo with Hadoop or Pig. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896620#action_12896620 ] Yan Zhou commented on PIG-1501: --- Unless there is any objection raised in the coming week, I'll go with LZO compression on TFile with the default option to disable compression that will be the old behavoir. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893746#action_12893746 ] Yan Zhou commented on PIG-1501: --- gzip and lzo2 are tried as the compression codecs; TFile and RCFile are used as storage formats. The tests are PigMix's L3 and L11, and a variation of L3 with full projection, hereafter referred as L3_1, in order to expand the temporary data size. (In some cases, multiple runs are executed, particularly in presence of doubted system fluctuations.) End-to-end elapsed times are recorded. The results are on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM boxes: uncompressedTFile(lzo) TFile(gzip) RCFile(lzo2) L3133684504 19674398 11513958 18092681 1'40 1'45 1'40 1'56 18094161 1'46 L3_13889095541 36976818752637742581 3675818160 3'10 4'4 3'253'58 3697666122 3675816707 3'10 3'22 3697674414 3'5 L11 25878480 21368784 15233146 21112892 1'52 1'52 1'571'59 21112892 1'59 A few observations are in order: 1) L3 has the highest compress ratio; while L3_1 and L11 much lower compression ratio; 2) gzip compress better compared with LZO2 with a little perf cost; 3) RC file should have seen much better compression as it's a columnar store. But the actual difference is marginal. It is probably because of L11's unique values, and many of L3_1's random values like time stamp, plus the presence of map-typed columns. The conclusion from this observation is that compression of temporary intermediate data is not guaranteed to save disk space to a desired degree. It's subject to temporary data values being compressed upon. As result, this feature should be made configurable; 4) The performance implications from these tests seem to be negligible within background noise or within a few percentages of the overall run times. But this is not conclusive yet. Larger and more real life queries would be more suitable for the comparison purpose ; 5) RCFile as above has not shown clear advantage in terms of better columnar compression ratio. Bu this observation could be data-sensitive. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888972#action_12888972 ] Alan Gates commented on PIG-1501: - Enabling compression directly on BinStorage as is will be bad. bzip is splittable but very slow, and gzip isn't splittable. To do this we need to look at using SequenceFiles for moving data between MR jobs. We can have a null key and value type of Tuple and use SequenceFileInput/OutputFormat. This will enable us to use the block level compression in sequence files. For now we can continue with the same serialization used in BinStorage, though in the future we may want to change this as well. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.