[ 
https://issues.apache.org/jira/browse/PIG-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Weise updated PIG-2508:
------------------------------

    Attachment: PIG-2508.4.patch

New patch for 0.9: Before using jobConf to set new values from script, set all 
current properties from pigContext.

                
> PIG can unpredictably ignore deprecated Hadoop config options
> -------------------------------------------------------------
>
>                 Key: PIG-2508
>                 URL: https://issues.apache.org/jira/browse/PIG-2508
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.2, 0.10
>            Reporter: Anupam Seth
>            Assignee: Thomas Weise
>            Priority: Blocker
>             Fix For: 0.10, 0.9.3
>
>         Attachments: PIG-2508.3.patch, PIG-2508.4.patch, PIG-2508.patch
>
>
> When deprecated config options are passed to a Pig job, it can unpredictably 
> ignore them and override them with values provided in the defaults due to a 
> "race condition"-like issue.
> This problem was first noticed as part of MAPREDUCE-3665, which was re-filed 
> as HADOOP-7993 so as for it to fall in the right component bucket of the code 
> being fixed. This JIRA fixed the bug on the Hadoop side of the code that 
> caused older deprecated config options to be ignored when they were also 
> specified in the defaults xml file with the newer config name or vice versa.
> However, the problem seemed to persist with Pig jobs and HADOOP-8021 was 
> filed to address the issue. 
> A careful step-by-step execution of the code in a debugger reveals an second 
> overlapping bug because of the way PIG is dealing with the configs.
> Not sure how / why this was not seen earlier, but the code in 
> HExecutionEngine.java#recomputeProperties currently mashes together the 
> default Hadoop configs and the user-specified properties into a Properties 
> object. Given that it uses a HashTable to store the properties, if we have a 
> config called "old.config.name" which is now deprecated and replaced by 
> "new.config.name" and if one type is specified in the defaults and another by 
> the user, we get a strange condition in which the repopulated Properties 
> object has [in an unpredictable ordering] the following:
> {code}
> config1.name=config1.value
> config2.name=config2.value
> ...
> old.config.name=old.config.value
> ...
> new.config.name=new.config.value
> ...
> configx.name=configx.value
> {code}
> When this Properties object gets converted into a Configuration object by the 
> ConfigurationUtil#toConfiguration() routine, the deprecation kicks in and 
> tries to resolve all old configs. Because the ordering is not guaranteed (and 
> because in the case of compress, the hash function consistently gives the new 
> config loaded from the defaults after the old one), the user-specified config 
> is ignored in favor of the default config (which from the point of view of 
> the Hadoop Configuration object is expected standard behavior to replace an 
> earlier specification of a config value with a later one).
> The fix for this is probably straightforward, but will require a re-write of 
> the a chunk of code in HExecutionEngine.java. Instead of mashing together a 
> JobConf object and a Properties object into a Configuration object that is 
> finally re-converted into a JobConf object, the code simply needs to 
> consistently and correctly populate a JobConf / Configuration object that can 
> handle deprecation instead of a "dumb" Java Properties object.
> We recently saw another potential occurrence of this bug where Pig seems to 
> honor only mapreduce.job.queuename parameter for specifying queue name and 
> ignores the parameter mapred.job.queue.name.
> Since this can break a lot of existing jobs that run fine on 0.20, marking 
> this as a blocker.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to