[ 
https://issues.apache.org/jira/browse/HCATALOG-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated HCATALOG-577:
--------------------------------

    Assignee: Sushanth Sowmyan
    
> HCatContext causes persistance of undesired jobConf parameters
> --------------------------------------------------------------
>
>                 Key: HCATALOG-577
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-577
>             Project: HCatalog
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>            Priority: Blocker
>             Fix For: 0.5
>
>         Attachments: HCAT-577.patch
>
>
> I've found a fairly interesting bug while experimenting with an e2e test case.
> Consider the following pig query :
> {code}
> a = load 'studenttab10k' using org.apache.hcatalog.pig.HCatLoader();
> b = foreach a generate name;
> c = distinct b;
> d = group c all;
> e = foreach d generate $1 as a;
> store e into 'pig_complex_6' using org.apache.hcatalog.pig.HCatStorer();
> exec;
> f = load 'pig_complex_6' using org.apache.hcatalog.pig.HCatLoader();
> g = foreach f generate flatten(a);
> {code}
> Now, with this query, we wind up grouping names into an array<string> in one 
> line.
> Say the result was supposed to say:
> –
> {(bob king),(bob ovid),(bob polk)}
> –
> what we actually get is:
> –
> {(bob king)}
> –
> The interesting thing about this is that after "e" gets generated, when 
> written out using HCatStorer, it has the abovementioned problem. If, however, 
> we store "e" using PigStorage, and then, in another pig job, we load e and 
> execute the rest, it works.
> On comparing jobConfs of the two stores, one using HCatStorer and PigStorage, 
> the important difference we noticed was that in the HCatStorer case, we have 
> an extra key, mapreduce.combine.class with value 
> "org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.DistinctCombiner$Combine"
>  On looking at that, we see that it basically just picks the first entry from 
> the bag, to perform a "distinct" operation. This was injected by pig on to 
> the previous load job done by HCatLoader as we perform a distinct operation 
> on "b" to get "c", but since HCat tries to store JobConfs so as to be usable 
> across multiple setLocation calls (and to cache things like tokens), we wind 
> up with the previous job's JobConf as well, thus resulting in the distinct 
> being applied to the HCatStorer output as well.
> This is bad behaviour, and we need to clear out HCatContext.INSTANCE between 
> pig Loader / Storer executions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to