Sushanth Sowmyan created HCATALOG-577:
-----------------------------------------
Summary: HCatContext causes persistance of undesired jobConf
parameters
Key: HCATALOG-577
URL: https://issues.apache.org/jira/browse/HCATALOG-577
Project: HCatalog
Issue Type: Bug
Affects Versions: 0.5
Reporter: Sushanth Sowmyan
Priority: Blocker
Fix For: 0.5
I've found a fairly interesting bug while experimenting with an e2e test case.
Consider the following pig query :
{code}
a = load 'studenttab10k' using org.apache.hcatalog.pig.HCatLoader();
b = foreach a generate name;
c = distinct b;
d = group c all;
e = foreach d generate $1 as a;
store e into 'pig_complex_6' using org.apache.hcatalog.pig.HCatStorer();
exec;
f = load 'pig_complex_6' using org.apache.hcatalog.pig.HCatLoader();
g = foreach f generate flatten(a);
{code}
Now, with this query, we wind up grouping names into an array<string> in one
line.
Say the result was supposed to say:
–
{(bob king),(bob ovid),(bob polk)}
–
what we actually get is:
–
{(bob king)}
–
The interesting thing about this is that after "e" gets generated, when written
out using HCatStorer, it has the abovementioned problem. If, however, we store
"e" using PigStorage, and then, in another pig job, we load e and execute the
rest, it works.
On comparing jobConfs of the two stores, one using HCatStorer and PigStorage,
the important difference we noticed was that in the HCatStorer case, we have an
extra key, mapreduce.combine.class with value
"org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.DistinctCombiner$Combine"
On looking at that, we see that it basically just picks the first entry from
the bag, to perform a "distinct" operation. This was injected by pig on to the
previous load job done by HCatLoader as we perform a distinct operation on "b"
to get "c", but since HCat tries to store JobConfs so as to be usable across
multiple setLocation calls (and to cache things like tokens), we wind up with
the previous job's JobConf as well, thus resulting in the distinct being
applied to the HCatStorer output as well.
This is bad behaviour, and we need to clear out HCatContext.INSTANCE between
pig Loader / Storer executions.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira