[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902065#action_12902065
 ] 

Thejas M Nair commented on PIG-1501:
------------------------------------

Comments on the patch -
TFileStorage.java 
- getSchema() code that determines schema from data is same across TFileStorage 
and InterStorage . The code in BinStorage is also same, except that it does 
uses some deprecated functions. That can be moved to a common util class.   
(Yes, I should have moved it to a util class when I created InterStorage)

TestTmpFileCompression.java
- both tests test if TFile is getting used. I think one test can be changed to 
check if InterStorage gets used when compression is not turned on, or a check 
can be added to any other existing test case that runs MR job, to see if 
InterStorage gets used there.
- log setup code is duplicated between setup and resetLog() . can be moved to 
common func

SampleOptimizer.java
- The following comment can be updated -
// check that it is using BinaryStorage.
to
// check that it is using the temp file storage format.


TFileRecordWriter.java ,
- the comment in following section does not seem to be valid anymore -
{code}
 public TFileRecordWriter(Path file, String codec, Configuration conf)
+                    throws IOException {
+        // hardcoded to use gzip and 1M as block size: may wish to be made 
configurable
{code}




> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>         Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to