[ 
https://issues.apache.org/jira/browse/PIG-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286769#comment-13286769
 ] 

Daniel Dai commented on PIG-2552:
---------------------------------

Sounds good.
                
> Better Property handling to deal with deprecation and variable substitution 
> of Hadoop config
> --------------------------------------------------------------------------------------------
>
>                 Key: PIG-2552
>                 URL: https://issues.apache.org/jira/browse/PIG-2552
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>             Fix For: 0.11
>
>
> Traditionally Pig handles hadoop configuration using PigContext.properties, 
> the flow is:
> 1. Instantiate Hadoop Configuration, read all entries and save into 
> PigContext.properties
> 2. adding system properties, pig.properties and "set" command in script into 
> PigContext.properties
> 3. Every time we need to instantiate a Hadoop Configuration, we iterate 
> PigContext.properties and add to Hadoop Configuration
> This approach does not deal with hadoop 23 deprecated config option. Eg, in 
> hadoop 23, "mapred.output.compression.codec" is replaced with 
> "mapreduce.output.fileoutputformat.compression.codec". mapred-default.xml 
> contains 
> "mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec".
>  In Pig script, user may override it with "set 
> mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec'". 
> This is what happen:
> 1. Pig instantiate Hadoop Configuration, and put 
> mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
>  into PigContext.properties
> 2. Adding 
> "mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec" to 
> PigContext.properties
> 3. When creating Hadoop Configuration to submit Hadoop job, Pig iterate 
> PigContext.properties, it first see 
> "mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec", 
> Hadoop Configuration translate it into the new property 
> "mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.BZip2Codec",
>  which is right until this point. Then Pig see 
> "mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec",
>  and overwrite the previous right entry. 
> In PIG-2508, we address the issue by using a Configuration to handle system 
> properties, pig.properties and "set" command, the flow is:
> 1. Instantiate Hadoop Configuration, adding system properties, 
> pig.properties, then read all entries and save into PigContext.properties
> 2. For every set command, instantiate Hadoop Configuration with 
> PigContext.properties, set the property to Configuration (Configuration 
> translate the old option into new option), then read all entries back into 
> PigContext.properies
> This works but is cumbersome when doing "set".
> In trunk, I want to use the following approach:
> 1. Write a subclass PigProperties extends Properties, and use it as 
> PigContext.properties. The interface for PigContext remains the same
> 2. In PigProperties, we maintain a set of hadoop properties, a set of system 
> properties, a set of pig.properties and a set of "set" command properties
> 3. Upon invoking PigContext.getProperties(), we instantiate hadoop 
> configuration, put all properties in a sequence, then flatten it into a 
> combined properties
> 4. We can do optimization to avoid recreating combined properties every time 
> we call PigContext.getProperties()
> The benefit for this approach:
> 1. Solve deprecate Hadoop config in a more clear way
> 2. Separate different layer of properties to ease the debugging, also provide 
> potential to show properties to the user at different level
> 3. Potential to add job level properties in the future
> 4. No backward incompatibility introduced

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to