[jira] [Updated] (PIG-2552) Better Property handling to deal with deprecation and variable substitution of Hadoop config

Daniel Dai (Updated) (JIRA) Fri, 24 Feb 2012 14:58:20 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Dai updated PIG-2552:
----------------------------

    Description: 
Traditionally Pig handles hadoop configuration using PigContext.properties, the 
flow is:
1. Instantiate Hadoop Configuration, read all entries and save into 
PigContext.properties
2. adding system properties, pig.properties and "set" command in script into 
PigContext.properties
3. Every time we need to instantiate a Hadoop Configuration, we iterate 
PigContext.properties and add to Hadoop Configuration

This approach does not deal with hadoop 23 deprecated config option. Eg, in 
hadoop 23, "mapred.output.compression.codec" is replaced with 
"mapreduce.output.fileoutputformat.compression.codec". mapred-default.xml 
contains 
"mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec".
 In Pig script, user may override it with "set mapred.output.compression.codec 
'org.apache.hadoop.io.compress.BZip2Codec'". This is what happen:
1. Pig instantiate Hadoop Configuration, and put 
mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
 into PigContext.properties
2. Adding 
"mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec" to 
PigContext.properties
3. When creating Hadoop Configuration to submit Hadoop job, Pig iterate 
PigContext.properties, it first see 
"mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec", 
Hadoop Configuration translate it into the new property 
"mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.BZip2Codec",
 which is right until this point. Then Pig see 
"mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec",
 and overwrite the previous right entry. 

In PIG-2508, we address the issue by using a Configuration to handle system 
properties, pig.properties and "set" command, the flow is:
1. Instantiate Hadoop Configuration, adding system properties, pig.properties, 
then read all entries and save into PigContext.properties
2. For every set command, instantiate Hadoop Configuration with 
PigContext.properties, set the property to Configuration (Configuration 
translate the old option into new option), then read all entries back into 
PigContext.properies

This works but is cumbersome when doing "set".

In trunk, I want to use the following approach:
1. Write a subclass PigProperties extends Properties, and use it as 
PigContext.properties. The interface for PigContext remains the same
2. In PigProperties, we maintain a set of hadoop properties, a set of system 
properties, a set of pig.properties and a set of "set" command properties
3. Upon invoking PigContext.getProperties(), we instantiate hadoop 
configuration, put all properties in a sequence, then flatten it into a 
combined properties
4. We can do optimization to avoid recreating combined properties every time we 
call PigContext.getProperties()

The benefit for this approach:
1. Solve deprecate Hadoop config in a more clear way
2. Separate different layer of properties to ease the debugging, also provide 
potential to show properties to the user at different level
3. Potential to add job level properties in the future
4. No backward incompatibility introduced

  was:
Traditionally Pig handles hadoop configuration using PigContext.properties, the 
flow is:
1. Instantiate Hadoop Configuration, read all entries and save into 
PigContext.properties
2. adding system properties, pig.properties and "set" command in script into 
PigContext.properties
3. Every time we need to instantiate a Hadoop Configuration, we iterate 
PigContext.properties and add to Hadoop Configuration

This approach does not deal with hadoop 23 deprecated config option. Eg, in 
hadoop 23, "mapred.output.compression.codec" is replaced with 
"mapreduce.output.fileoutputformat.compression.codec". mapred-default.xml 
contains 
"mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec".
 In Pig script, user may override it with "set mapred.output.compression.codec 
'org.apache.hadoop.io.compress.BZip2Codec'". This is what happen:
1. Pig instantiate Hadoop Configuration, and put 
mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
 into PigContext.properties
2. Adding 
"mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec" to 
PigContext.properties
3. When creating Hadoop Configuration to submit Hadoop job, Pig iterate 
PigContext.properties, it first see 
"mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec", 
Hadoop Configuration translate it into the new property 
"mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.BZip2Codec",
 which is right until this point. Then Pig see 
"mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec",
 and overwrite the previous right entry. 

In PIG-2508, we address the issue by using a Configuration to handle system 
properties, pig.properties and "set" command, the flow is:
1. Instantiate Hadoop Configuration, adding system properties, pig.properties, 
then read all entries and save into PigContext.properties
2. For every set command, instantiate Hadoop Configuration with 
PigContext.properties, set the property to Configuration (Configuration 
translate the old option into new option), then read all entries back into 
PigContext.properies

This works but is cumbersome when doing "set".

In trunk, I want to use the following approach:
1. Write a subclass PigProperties extends Properties, and use it as 
PigContext.properties. The interface for PigContext remains the same
2. In PigProperties, we maintain a set of hadoop properties, a set of system 
properties, a set of pig.properties and a set of "set" command properties
3. Upon invoking PigContext.getProperties(), we instantiate hadoop 
configuration, put all properties in a sequence, then flatten it into a 
combined properties
4. We can do optimization to avoid recreating combined properties every time we 
call PigContext.getProperties()

The benefit for this approach:
1. Solve deprecate Hadoop config in a more clear way
2. Separate different layer of properties to ease the debugging, also provide 
potential to show properties to the user at different level
3. No backward incompatibility introduced

    
> Better Property handling to deal with deprecation and variable substitution 
> of Hadoop config
> --------------------------------------------------------------------------------------------
>
>                 Key: PIG-2552
>                 URL: https://issues.apache.org/jira/browse/PIG-2552
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>             Fix For: 0.11
>
>
> Traditionally Pig handles hadoop configuration using PigContext.properties, 
> the flow is:
> 1. Instantiate Hadoop Configuration, read all entries and save into 
> PigContext.properties
> 2. adding system properties, pig.properties and "set" command in script into 
> PigContext.properties
> 3. Every time we need to instantiate a Hadoop Configuration, we iterate 
> PigContext.properties and add to Hadoop Configuration
> This approach does not deal with hadoop 23 deprecated config option. Eg, in 
> hadoop 23, "mapred.output.compression.codec" is replaced with 
> "mapreduce.output.fileoutputformat.compression.codec". mapred-default.xml 
> contains 
> "mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec".
>  In Pig script, user may override it with "set 
> mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec'". 
> This is what happen:
> 1. Pig instantiate Hadoop Configuration, and put 
> mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
>  into PigContext.properties
> 2. Adding 
> "mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec" to 
> PigContext.properties
> 3. When creating Hadoop Configuration to submit Hadoop job, Pig iterate 
> PigContext.properties, it first see 
> "mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec", 
> Hadoop Configuration translate it into the new property 
> "mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.BZip2Codec",
>  which is right until this point. Then Pig see 
> "mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.DefaultCodec",
>  and overwrite the previous right entry. 
> In PIG-2508, we address the issue by using a Configuration to handle system 
> properties, pig.properties and "set" command, the flow is:
> 1. Instantiate Hadoop Configuration, adding system properties, 
> pig.properties, then read all entries and save into PigContext.properties
> 2. For every set command, instantiate Hadoop Configuration with 
> PigContext.properties, set the property to Configuration (Configuration 
> translate the old option into new option), then read all entries back into 
> PigContext.properies
> This works but is cumbersome when doing "set".
> In trunk, I want to use the following approach:
> 1. Write a subclass PigProperties extends Properties, and use it as 
> PigContext.properties. The interface for PigContext remains the same
> 2. In PigProperties, we maintain a set of hadoop properties, a set of system 
> properties, a set of pig.properties and a set of "set" command properties
> 3. Upon invoking PigContext.getProperties(), we instantiate hadoop 
> configuration, put all properties in a sequence, then flatten it into a 
> combined properties
> 4. We can do optimization to avoid recreating combined properties every time 
> we call PigContext.getProperties()
> The benefit for this approach:
> 1. Solve deprecate Hadoop config in a more clear way
> 2. Separate different layer of properties to ease the debugging, also provide 
> potential to show properties to the user at different level
> 3. Potential to add job level properties in the future
> 4. No backward incompatibility introduced

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2552) Better Property handling to deal with deprecation and variable substitution of Hadoop config

Reply via email to