[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904848#action_12904848
 ] 

Olga Natkovich commented on PIG-1501:
-

Ashutosh,

The reason it is off by default is because the default compression is gzip 
which is really slow and most of the time not what you want. Because of the 
licensing issue with lzo, users need to setup it on their own. Once they do the 
setup, they can enable the compression.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-25 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902484#action_12902484
 ] 

Thejas M Nair commented on PIG-1501:


+1

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: [jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-25 Thread Yan Zhou
Thank for quick turnaround Tejas.

Yan

-Original Message-
From: Thejas M Nair (JIRA) [mailto:j...@apache.org] 
Sent: Wednesday, August 25, 2010 8:54 AM
To: pig-dev@hadoop.apache.org
Subject: [jira] Commented: (PIG-1501) need to investigate the impact of 
compression on pig performance


[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902484#action_12902484
 ] 

Thejas M Nair commented on PIG-1501:


+1

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-24 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902065#action_12902065
 ] 

Thejas M Nair commented on PIG-1501:


Comments on the patch -
TFileStorage.java 
- getSchema() code that determines schema from data is same across TFileStorage 
and InterStorage . The code in BinStorage is also same, except that it does 
uses some deprecated functions. That can be moved to a common util class.   
(Yes, I should have moved it to a util class when I created InterStorage)

TestTmpFileCompression.java
- both tests test if TFile is getting used. I think one test can be changed to 
check if InterStorage gets used when compression is not turned on, or a check 
can be added to any other existing test case that runs MR job, to see if 
InterStorage gets used there.
- log setup code is duplicated between setup and resetLog() . can be moved to 
common func

SampleOptimizer.java
- The following comment can be updated -
// check that it is using BinaryStorage.
to
// check that it is using the temp file storage format.


TFileRecordWriter.java ,
- the comment in following section does not seem to be valid anymore -
{code}
 public TFileRecordWriter(Path file, String codec, Configuration conf)
+throws IOException {
+// hardcoded to use gzip and 1M as block size: may wish to be made 
configurable
{code}




 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-20 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900950#action_12900950
 ] 

Yan Zhou commented on PIG-1501:
---

The internal Hudson results are as follows:

 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 9 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] -1 javac.  The applied patch generated 162 javac compiler 
warnings (more than the trunk's current 156 warnings).
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] -1 release audit.  The applied patch generated 427 release 
audit warnings (more than the trunk's current 425 warnings).

The 6 javac warnings are from the use of a deprecated PigMapReduce.sJobConf 
field. But that deprecation is for intended for external use only and internal 
use should be ok.

The 2 release audit warnings are on two html files, SampleOptimizer.html and 
org.apache.pig.impl.util.Utils.html.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-11 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897455#action_12897455
 ] 

Thejas M Nair commented on PIG-1501:


Why was TFile chosen over SequenceFile ? I am wondering if the additional 
unused features of TFile (index, metadata) result in any overhead compared to 
SequenceFile. 


 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-11 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897496#action_12897496
 ] 

Yan Zhou commented on PIG-1501:
---

Please refer to HADOOP-3315 for overall Sequence File vs TFile comparison. It 
appears for compressed data, TFile performs better than SeqFile.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896993#action_12896993
 ] 

Alan Gates commented on PIG-1501:
-

It's not surprising that RCFile performs badly here, since in every case every 
column in the row is used.  This is known to be a bad use case for columnar 
storage.  While for some data sets the better compression may overcome this, I 
suspect that in the general case the stitching costs will overwhelm any 
compression wins (as shown here).

I'm +1 with going with lzo/Tfile.  As the lzo libs are GPL we cannot ship with 
that as default.  I wasn't clear from your last comment which you were 
proposing as the default.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-10 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897005#action_12897005
 ] 

Yan Zhou commented on PIG-1501:
---

The default is *not* using the compression on the intermediate data, which is 
the existing behavoir.

For RC file, it is just a bit better in terms of compression ration  than 
TFile. In terms of performance, the difference is within background noise. 
Stitching costs should be minimal. Actually, the full projection is the 
biggest advantage of RCFile over other columnar storage like  zebra. I was 
surprised to see the compression improvement over TFile is marginal. The only 
cause I can think of is that the compression ratio is too sensitive to the data 
to pre-determine or even pre-estimate.

lzo is under GPL. But it appears that Hadoop installation has it, at least in 
my test cluster.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897046#action_12897046
 ] 

Alan Gates commented on PIG-1501:
-

You can install lzo with Hadoop (as Yahoo does on its grids) but you cannot 
ship lzo with Hadoop or Pig.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-09 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896620#action_12896620
 ] 

Yan Zhou commented on PIG-1501:
---

Unless there is any objection raised in the coming week, I'll go with LZO 
compression on TFile with the default option to disable compression that will 
be the old behavoir.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-07-29 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893746#action_12893746
 ] 

Yan Zhou commented on PIG-1501:
---

gzip and lzo2 are tried as the compression codecs;  TFile and RCFile are used 
as storage formats. The tests are PigMix's L3 and L11, and a variation of L3 
with full projection, hereafter referred as L3_1,  in order to expand the 
temporary data size. (In some cases, multiple runs are executed, particularly 
in presence of doubted system fluctuations.)  End-to-end elapsed times are 
recorded.

The results are on a 15-node cluster of  2 x Xeon L5420 2.50GHz/16G RAM boxes:

  uncompressedTFile(lzo)  TFile(gzip)   
   RCFile(lzo2)
L3133684504   19674398 11513958 
   18092681
 1'40  1'45   
1'40 1'56

   18094161

 1'46

L3_13889095541  36976818752637742581 
3675818160
 3'10   4'4   
 3'253'58
  3697666122
 3675816707
   3'10
3'22
  3697674414
   3'5

L11   25878480   21368784 15233146  
   21112892
 1'52 1'52
  1'571'59

   21112892

  1'59

A few observations are in order:

1) L3 has the highest compress ratio; while L3_1 and L11 much lower compression 
ratio;
2) gzip compress better compared with LZO2 with a little perf cost;
3) RC file should have seen much better compression as it's a columnar store. 
But the actual difference is marginal. It is probably because of L11's unique 
values, and many of  L3_1's random values like time stamp, plus the presence of 
map-typed columns. The conclusion from this observation is that compression of 
temporary intermediate data is not guaranteed to save disk space to a desired 
degree. It's subject to temporary data values being compressed upon. As result, 
this feature should be made configurable;
4)  The performance implications from these tests seem to be negligible within 
background noise or within a few percentages of the overall run times. But this 
is not conclusive yet. Larger and more real life queries would be more suitable 
for the comparison purpose ;
5) RCFile as above has not shown clear advantage in terms of better columnar 
compression ratio. Bu this observation could be data-sensitive.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-07-15 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888972#action_12888972
 ] 

Alan Gates commented on PIG-1501:
-

Enabling compression directly on BinStorage as is will be bad.  bzip is 
splittable but very slow, and gzip isn't splittable.

To do this we need to look at using SequenceFiles for moving data between MR 
jobs.  We can have a null key and value type of Tuple and use 
SequenceFileInput/OutputFormat.  This will enable us to use the block level 
compression in sequence files.  For now we can continue with the same 
serialization used in BinStorage, though in the future we may want to change 
this as well.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.