[ 
https://issues.apache.org/jira/browse/HCATALOG-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538571#comment-13538571
 ] 

Sushanth Sowmyan commented on HCATALOG-577:
-------------------------------------------

I'm afraid I haven't been following the recent changes to HCatContext closely, 
so I'd ask the others here on their opinion, but from reading code, is there 
any reason HCatContext should preserve any parameters that aren't prefixed with 
"hcat." ?

I've experimented, and if I patch it so that we keep only hcat.* parameters, 
this e2e test passes, and the rest of them aren't negatively impacted either.

So, the following patch fixes this issue:

{code}
diff --git a/core/src/main/java/org/apache/hcatalog/common/HCatContext.java 
b/core/src/main/java/org/apache/hcatalog/common/HCatContext.java
index df14dda..34e1af9 100644
--- a/core/src/main/java/org/apache/hcatalog/common/HCatContext.java
+++ b/core/src/main/java/org/apache/hcatalog/common/HCatContext.java
@@ -65,7 +65,7 @@ public enum HCatContext {
 
         if (conf != newConf) {
             for (Map.Entry<String, String> entry : conf) {
-                if ((entry.getKey().matches("hcat.*")) && 
(newConf.get(entry.getKey()) == null)) {
+                if (newConf.get(entry.getKey()) == null) {
                     newConf.set(entry.getKey(), entry.getValue());
                 }
             }
{code}
                
> HCatContext causes persistance of undesired jobConf parameters
> --------------------------------------------------------------
>
>                 Key: HCATALOG-577
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-577
>             Project: HCatalog
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Sushanth Sowmyan
>            Priority: Blocker
>             Fix For: 0.5
>
>
> I've found a fairly interesting bug while experimenting with an e2e test case.
> Consider the following pig query :
> {code}
> a = load 'studenttab10k' using org.apache.hcatalog.pig.HCatLoader();
> b = foreach a generate name;
> c = distinct b;
> d = group c all;
> e = foreach d generate $1 as a;
> store e into 'pig_complex_6' using org.apache.hcatalog.pig.HCatStorer();
> exec;
> f = load 'pig_complex_6' using org.apache.hcatalog.pig.HCatLoader();
> g = foreach f generate flatten(a);
> {code}
> Now, with this query, we wind up grouping names into an array<string> in one 
> line.
> Say the result was supposed to say:
> –
> {(bob king),(bob ovid),(bob polk)}
> –
> what we actually get is:
> –
> {(bob king)}
> –
> The interesting thing about this is that after "e" gets generated, when 
> written out using HCatStorer, it has the abovementioned problem. If, however, 
> we store "e" using PigStorage, and then, in another pig job, we load e and 
> execute the rest, it works.
> On comparing jobConfs of the two stores, one using HCatStorer and PigStorage, 
> the important difference we noticed was that in the HCatStorer case, we have 
> an extra key, mapreduce.combine.class with value 
> "org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.DistinctCombiner$Combine"
>  On looking at that, we see that it basically just picks the first entry from 
> the bag, to perform a "distinct" operation. This was injected by pig on to 
> the previous load job done by HCatLoader as we perform a distinct operation 
> on "b" to get "c", but since HCat tries to store JobConfs so as to be usable 
> across multiple setLocation calls (and to cache things like tokens), we wind 
> up with the previous job's JobConf as well, thus resulting in the distinct 
> being applied to the HCatStorer output as well.
> This is bad behaviour, and we need to clear out HCatContext.INSTANCE between 
> pig Loader / Storer executions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to