[
https://issues.apache.org/jira/browse/HCATALOG-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated HCATALOG-577:
--------------------------------
Assignee: Sushanth Sowmyan
> HCatContext causes persistance of undesired jobConf parameters
> --------------------------------------------------------------
>
> Key: HCATALOG-577
> URL: https://issues.apache.org/jira/browse/HCATALOG-577
> Project: HCatalog
> Issue Type: Bug
> Affects Versions: 0.5
> Reporter: Sushanth Sowmyan
> Assignee: Sushanth Sowmyan
> Priority: Blocker
> Fix For: 0.5
>
> Attachments: HCAT-577.patch
>
>
> I've found a fairly interesting bug while experimenting with an e2e test case.
> Consider the following pig query :
> {code}
> a = load 'studenttab10k' using org.apache.hcatalog.pig.HCatLoader();
> b = foreach a generate name;
> c = distinct b;
> d = group c all;
> e = foreach d generate $1 as a;
> store e into 'pig_complex_6' using org.apache.hcatalog.pig.HCatStorer();
> exec;
> f = load 'pig_complex_6' using org.apache.hcatalog.pig.HCatLoader();
> g = foreach f generate flatten(a);
> {code}
> Now, with this query, we wind up grouping names into an array<string> in one
> line.
> Say the result was supposed to say:
> –
> {(bob king),(bob ovid),(bob polk)}
> –
> what we actually get is:
> –
> {(bob king)}
> –
> The interesting thing about this is that after "e" gets generated, when
> written out using HCatStorer, it has the abovementioned problem. If, however,
> we store "e" using PigStorage, and then, in another pig job, we load e and
> execute the rest, it works.
> On comparing jobConfs of the two stores, one using HCatStorer and PigStorage,
> the important difference we noticed was that in the HCatStorer case, we have
> an extra key, mapreduce.combine.class with value
> "org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.DistinctCombiner$Combine"
> On looking at that, we see that it basically just picks the first entry from
> the bag, to perform a "distinct" operation. This was injected by pig on to
> the previous load job done by HCatLoader as we perform a distinct operation
> on "b" to get "c", but since HCat tries to store JobConfs so as to be usable
> across multiple setLocation calls (and to cache things like tokens), we wind
> up with the previous job's JobConf as well, thus resulting in the distinct
> being applied to the HCatStorer output as well.
> This is bad behaviour, and we need to clear out HCatContext.INSTANCE between
> pig Loader / Storer executions.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira