Sushanth Sowmyan created HCATALOG-577:
-----------------------------------------

             Summary: HCatContext causes persistance of undesired jobConf 
parameters
                 Key: HCATALOG-577
                 URL: https://issues.apache.org/jira/browse/HCATALOG-577
             Project: HCatalog
          Issue Type: Bug
    Affects Versions: 0.5
            Reporter: Sushanth Sowmyan
            Priority: Blocker
             Fix For: 0.5


I've found a fairly interesting bug while experimenting with an e2e test case.

Consider the following pig query :

{code}
a = load 'studenttab10k' using org.apache.hcatalog.pig.HCatLoader();
b = foreach a generate name;
c = distinct b;
d = group c all;
e = foreach d generate $1 as a;
store e into 'pig_complex_6' using org.apache.hcatalog.pig.HCatStorer();
exec;
f = load 'pig_complex_6' using org.apache.hcatalog.pig.HCatLoader();
g = foreach f generate flatten(a);
{code}

Now, with this query, we wind up grouping names into an array<string> in one 
line.

Say the result was supposed to say:

–
{(bob king),(bob ovid),(bob polk)}
–

what we actually get is:
–
{(bob king)}
–

The interesting thing about this is that after "e" gets generated, when written 
out using HCatStorer, it has the abovementioned problem. If, however, we store 
"e" using PigStorage, and then, in another pig job, we load e and execute the 
rest, it works.
On comparing jobConfs of the two stores, one using HCatStorer and PigStorage, 
the important difference we noticed was that in the HCatStorer case, we have an 
extra key, mapreduce.combine.class with value 
"org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.DistinctCombiner$Combine"
 On looking at that, we see that it basically just picks the first entry from 
the bag, to perform a "distinct" operation. This was injected by pig on to the 
previous load job done by HCatLoader as we perform a distinct operation on "b" 
to get "c", but since HCat tries to store JobConfs so as to be usable across 
multiple setLocation calls (and to cache things like tokens), we wind up with 
the previous job's JobConf as well, thus resulting in the distinct being 
applied to the HCatStorer output as well.

This is bad behaviour, and we need to clear out HCatContext.INSTANCE between 
pig Loader / Storer executions.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to