[jira] [Commented] (HIVE-10752) Revert HIVE-5193

2015-05-24 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557830#comment-14557830
 ] 

Mithun Radhakrishnan commented on HIVE-10752:
-

Sorry, chaps. I'm on vacation. Tagging [~cdrome], [~viraj] (who worked on the 
original bug HIVE-5193). I'm afraid I won't be able to look at this till 
Wednesday. 

> Revert HIVE-5193
> 
>
> Key: HIVE-10752
> URL: https://issues.apache.org/jira/browse/HIVE-10752
> Project: Hive
>  Issue Type: Sub-task
>  Components: HCatalog
>Affects Versions: 1.2.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
> Attachments: HIVE-10752.patch
>
>
> Revert HIVE-5193 since it causes pig+hcatalog not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10752) Revert HIVE-5193

2015-05-28 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563191#comment-14563191
 ] 

Mithun Radhakrishnan commented on HIVE-10752:
-

[~aihuaxu], doesn't the patch that [~viraj] posted on HIVE-10720 sort this out?

> Revert HIVE-5193
> 
>
> Key: HIVE-10752
> URL: https://issues.apache.org/jira/browse/HIVE-10752
> Project: Hive
>  Issue Type: Sub-task
>  Components: HCatalog
>Affects Versions: 1.2.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
> Attachments: HIVE-10752.patch
>
>
> Revert HIVE-5193 since it causes pig+hcatalog not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10752) Revert HIVE-5193

2015-05-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565401#comment-14565401
 ] 

Mithun Radhakrishnan commented on HIVE-10752:
-


bq. Given that HIVE-5193 broke some functionality and it was just for columnar 
table performance improvement, in addition that patch provided in HIVE-10720 
did still not solve the issue.

While I agree that HIVE-5193 did introduce a bug, I can't yet agree that we 
should revert it. [~viraj] is currently testing whether the one-liner posted in 
HIVE-10720 doesn't resolve the issue. (My understanding was that this does.) 
I'll let him confirm shortly.

In the meantime, please consider that the fix 
({{ColumnProjectionUtils.setReadColumnIDs(job.getConfiguration(), null);}}) is 
only applied when {{requiredFieldsInfo == null}}, which is shorthand for Pig 
requiring all columns. So the deserialization is not done in all cases. It's 
only for when all fields are required. There isn't any loss of performance in 
this case.

Am I missing something?

> Revert HIVE-5193
> 
>
> Key: HIVE-10752
> URL: https://issues.apache.org/jira/browse/HIVE-10752
> Project: Hive
>  Issue Type: Sub-task
>  Components: HCatalog
>Affects Versions: 1.2.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
> Attachments: HIVE-10752.patch
>
>
> Revert HIVE-5193 since it causes pig+hcatalog not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10720) Pig using HCatLoader to access RCFile and perform join but get incorrect result.

2015-05-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565426#comment-14565426
 ] 

Mithun Radhakrishnan commented on HIVE-10720:
-

Hey, [~aihuaxu]. Could you please post a stack-trace for the NPE?

> Pig using HCatLoader to access RCFile and perform join but get incorrect 
> result.
> 
>
> Key: HIVE-10720
> URL: https://issues.apache.org/jira/browse/HIVE-10720
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.3.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
> Attachments: HIVE-10720.patch
>
>
> {noformat}
> Create table tbl1 (key string, value string) stored as rcfile;
> Create table tbl2 (key string, value string);
> insert into tbl1 values('1', 'value1');
> insert into tbl2 values('1', 'value2');
> {noformat}
> Pig script:
> {noformat}
> tbl1 = LOAD 'tbl1' USING org.apache.hive.hcatalog.pig.HCatLoader();
> tbl2 = LOAD 'tbl2' USING org.apache.hive.hcatalog.pig.HCatLoader();
> src_tbl1 = FILTER tbl1 BY (key == '1');
> prj_tbl1 = FOREACH src_tbl1 GENERATE
>key as tbl1_key,
>value as tbl1_value,
>'333' as tbl1_v1;
>
> src_tbl2 = FILTER tbl2 BY (key == '1');
> prj_tbl2 = FOREACH src_tbl2 GENERATE
>key as tbl2_key,
>value as tbl2_value;
>
> result = JOIN prj_tbl1 BY (tbl1_key), prj_tbl2 BY (tbl2_key);
> prj_result = FOREACH result 
>   GENERATE  prj_tbl1::tbl1_key AS key1,
> prj_tbl1::tbl1_value AS value1,
> prj_tbl1::tbl1_v1 AS v1,
> prj_tbl2::tbl2_key AS key2,
> prj_tbl2::tbl2_value AS value2;
>
> dump prj_result;
> {noformat}
> We could see different invalid results or even no result which should return.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10720) Pig using HCatLoader to access RCFile and perform join but get incorrect result.

2015-05-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565594#comment-14565594
 ] 

Mithun Radhakrishnan commented on HIVE-10720:
-

:] Frustration aside, I'm completely open to reverting the patch if it's the 
right thing to do. (Incorrect results are a critical bug.) We're trying to make 
sure that we won't have to revert the revert.

Viraj has confirmed that there was a bug in his patch. He's uploading a new one 
shortly. If this doesn't sort out the issue you're facing, let's revert and 
postpone debate to a later time.

> Pig using HCatLoader to access RCFile and perform join but get incorrect 
> result.
> 
>
> Key: HIVE-10720
> URL: https://issues.apache.org/jira/browse/HIVE-10720
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.3.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
> Attachments: HIVE-10720.patch
>
>
> {noformat}
> Create table tbl1 (key string, value string) stored as rcfile;
> Create table tbl2 (key string, value string);
> insert into tbl1 values('1', 'value1');
> insert into tbl2 values('1', 'value2');
> {noformat}
> Pig script:
> {noformat}
> tbl1 = LOAD 'tbl1' USING org.apache.hive.hcatalog.pig.HCatLoader();
> tbl2 = LOAD 'tbl2' USING org.apache.hive.hcatalog.pig.HCatLoader();
> src_tbl1 = FILTER tbl1 BY (key == '1');
> prj_tbl1 = FOREACH src_tbl1 GENERATE
>key as tbl1_key,
>value as tbl1_value,
>'333' as tbl1_v1;
>
> src_tbl2 = FILTER tbl2 BY (key == '1');
> prj_tbl2 = FOREACH src_tbl2 GENERATE
>key as tbl2_key,
>value as tbl2_value;
>
> result = JOIN prj_tbl1 BY (tbl1_key), prj_tbl2 BY (tbl2_key);
> prj_result = FOREACH result 
>   GENERATE  prj_tbl1::tbl1_key AS key1,
> prj_tbl1::tbl1_value AS value1,
> prj_tbl1::tbl1_v1 AS v1,
> prj_tbl2::tbl2_key AS key2,
> prj_tbl2::tbl2_value AS value2;
>
> dump prj_result;
> {noformat}
> We could see different invalid results or even no result which should return.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10720) Pig using HCatLoader to access RCFile and perform join but get incorrect result.

2015-05-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565768#comment-14565768
 ] 

Mithun Radhakrishnan commented on HIVE-10720:
-

Ok. Looks like I have held you up long enough. If you've verified that this 
code path works without HIVE-5193, let's roll it back, and revisit this fix in 
a separate JIRA. We will try identify how this works correctly on our internal 
branch. Viral, does that sound ok?

Sorry for the delay. I applaud your diligence and patience, Aihua. Thank you. :]

> Pig using HCatLoader to access RCFile and perform join but get incorrect 
> result.
> 
>
> Key: HIVE-10720
> URL: https://issues.apache.org/jira/browse/HIVE-10720
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.3.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
> Attachments: HIVE-10720.patch
>
>
> {noformat}
> Create table tbl1 (key string, value string) stored as rcfile;
> Create table tbl2 (key string, value string);
> insert into tbl1 values('1', 'value1');
> insert into tbl2 values('1', 'value2');
> {noformat}
> Pig script:
> {noformat}
> tbl1 = LOAD 'tbl1' USING org.apache.hive.hcatalog.pig.HCatLoader();
> tbl2 = LOAD 'tbl2' USING org.apache.hive.hcatalog.pig.HCatLoader();
> src_tbl1 = FILTER tbl1 BY (key == '1');
> prj_tbl1 = FOREACH src_tbl1 GENERATE
>key as tbl1_key,
>value as tbl1_value,
>'333' as tbl1_v1;
>
> src_tbl2 = FILTER tbl2 BY (key == '1');
> prj_tbl2 = FOREACH src_tbl2 GENERATE
>key as tbl2_key,
>value as tbl2_value;
>
> result = JOIN prj_tbl1 BY (tbl1_key), prj_tbl2 BY (tbl2_key);
> prj_result = FOREACH result 
>   GENERATE  prj_tbl1::tbl1_key AS key1,
> prj_tbl1::tbl1_value AS value1,
> prj_tbl1::tbl1_v1 AS v1,
> prj_tbl2::tbl2_key AS key2,
> prj_tbl2::tbl2_value AS value2;
>
> dump prj_result;
> {noformat}
> We could see different invalid results or even no result which should return.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10752) Revert HIVE-5193

2015-06-01 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567892#comment-14567892
 ] 

Mithun Radhakrishnan commented on HIVE-10752:
-

Yes, of course. +1, as per 
[HIVE-10720|https://issues.apache.org/jira/browse/HIVE-10720?focusedCommentId=14565768&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14565768].

Let's circle back, after Viraj and I have identified why this isn't a problem 
with our internal Hive 0.13-0.14 branch. 

> Revert HIVE-5193
> 
>
> Key: HIVE-10752
> URL: https://issues.apache.org/jira/browse/HIVE-10752
> Project: Hive
>  Issue Type: Sub-task
>  Components: HCatalog
>Affects Versions: 1.2.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
> Attachments: HIVE-10752.patch
>
>
> Revert HIVE-5193 since it causes pig+hcatalog not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10754) Pig+Hcatalog doesn't work properly since we need to clone the Job instance in HCatLoader

2015-06-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573232#comment-14573232
 ] 

Mithun Radhakrishnan commented on HIVE-10754:
-

Hello, Aihua. I'm all for switching from the deprecated {{Job}} constructor to 
using {{Job.getInstance()}}.

But I am unable to understand how this changes/fixes anything. Both {{new 
Job(Configuration)}} and {{Job.getInstance(Configuration)}} seem to eventually 
use the package-private {{Job(JobConf)}} constructor. No latter references to 
{{clone}} or {{job}} have been modified in {{HCatLoader.setLocation()}}.

Could you please explain your intention?

> Pig+Hcatalog doesn't work properly since we need to clone the Job instance in 
> HCatLoader
> 
>
> Key: HIVE-10754
> URL: https://issues.apache.org/jira/browse/HIVE-10754
> Project: Hive
>  Issue Type: Sub-task
>  Components: HCatalog
>Affects Versions: 1.2.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
> Attachments: HIVE-10754.patch
>
>
> {noformat}
> Create table tbl1 (key string, value string) stored as rcfile;
> Create table tbl2 (key string, value string);
> insert into tbl1 values( '1', '111');
> insert into tbl2 values('1', '2');
> {noformat}
> Pig script:
> {noformat}
> src_tbl1 = FILTER tbl1 BY (key == '1');
> prj_tbl1 = FOREACH src_tbl1 GENERATE
>key as tbl1_key,
>value as tbl1_value,
>'333' as tbl1_v1;
>
> src_tbl2 = FILTER tbl2 BY (key == '1');
> prj_tbl2 = FOREACH src_tbl2 GENERATE
>key as tbl2_key,
>value as tbl2_value;
>
> dump prj_tbl1;
> dump prj_tbl2;
> result = JOIN prj_tbl1 BY (tbl1_key), prj_tbl2 BY (tbl2_key);
> prj_result = FOREACH result 
>   GENERATE  prj_tbl1::tbl1_key AS key1,
> prj_tbl1::tbl1_value AS value1,
> prj_tbl1::tbl1_v1 AS v1,
> prj_tbl2::tbl2_key AS key2,
> prj_tbl2::tbl2_value AS value2;
>
> dump prj_result;
> {noformat}
> The expected result is (1,111,333,1,2) while the result is (1,2,333,1,2).  We 
> need to clone the job instance in HCatLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10754) Pig+Hcatalog doesn't work properly since we need to clone the Job instance in HCatLoader

2015-06-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573596#comment-14573596
 ] 

Mithun Radhakrishnan commented on HIVE-10754:
-

I see what we're trying to achieve, but I still need help understanding how 
this change fixes the problem. (Sorry. :/) 

Here's the relevant code from {{Job.java}} from Hadoop 2.6.

{code:java|title=Job.java|borderStyle=solid|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
  @Deprecated
  public Job(Configuration conf) throws IOException {
this(new JobConf(conf));
  }

  Job(JobConf conf) throws IOException {
super(conf, null);
// propagate existing user credentials to job
this.credentials.mergeAll(this.ugi.getCredentials());
this.cluster = null;
  }

 public static Job getInstance(Configuration conf) throws IOException {
// create with a null Cluster
JobConf jobConf = new JobConf(conf);
return new Job(jobConf);
  }
{code}

# The current implementation of {{HCatLoader.setLocation()}} calls {{new Job( 
Configuration )}}, which clones the {{JobConf}} inline and calls the private 
constructor {{Job(JobConf)}}.
# Your improved implementation of {{HCatLoader.setLocation()}} calls 
{{Job.getInstance()}}. This method clones the {{JobConf}} explicitly, and then 
calls the private constructor {{Job(jobConf)}}.

bq. These two are different (JobConf is not cloned when we call new Job(conf)).
Both of these seem identical in effect to me. :/ There's no way for 
{{HCatLoader.setLocation()}} to call the {{Job(JobConf)}} constructor, because 
it's package-private, right?


> Pig+Hcatalog doesn't work properly since we need to clone the Job instance in 
> HCatLoader
> 
>
> Key: HIVE-10754
> URL: https://issues.apache.org/jira/browse/HIVE-10754
> Project: Hive
>  Issue Type: Sub-task
>  Components: HCatalog
>Affects Versions: 1.2.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
> Attachments: HIVE-10754.patch
>
>
> {noformat}
> Create table tbl1 (key string, value string) stored as rcfile;
> Create table tbl2 (key string, value string);
> insert into tbl1 values( '1', '111');
> insert into tbl2 values('1', '2');
> {noformat}
> Pig script:
> {noformat}
> src_tbl1 = FILTER tbl1 BY (key == '1');
> prj_tbl1 = FOREACH src_tbl1 GENERATE
>key as tbl1_key,
>value as tbl1_value,
>'333' as tbl1_v1;
>
> src_tbl2 = FILTER tbl2 BY (key == '1');
> prj_tbl2 = FOREACH src_tbl2 GENERATE
>key as tbl2_key,
>value as tbl2_value;
>
> dump prj_tbl1;
> dump prj_tbl2;
> result = JOIN prj_tbl1 BY (tbl1_key), prj_tbl2 BY (tbl2_key);
> prj_result = FOREACH result 
>   GENERATE  prj_tbl1::tbl1_key AS key1,
> prj_tbl1::tbl1_value AS value1,
> prj_tbl1::tbl1_v1 AS v1,
> prj_tbl2::tbl2_key AS key2,
> prj_tbl2::tbl2_value AS value2;
>
> dump prj_result;
> {noformat}
> The expected result is (1,111,333,1,2) while the result is (1,2,333,1,2).  We 
> need to clone the job instance in HCatLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10761) Create codahale-based metrics system for Hive

2015-06-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573613#comment-14573613
 ] 

Mithun Radhakrishnan commented on HIVE-10761:
-

Hey, Sush, Szehon. I can confirm that Yahoo cares about HS2 metrics. :p 

I'm not familiar with codehale, but if it works with JMX, that's cool. Lemme do 
some homework. Thanks for the heads-up and the nifty addition, chaps.

> Create codahale-based metrics system for Hive
> -
>
> Key: HIVE-10761
> URL: https://issues.apache.org/jira/browse/HIVE-10761
> Project: Hive
>  Issue Type: New Feature
>  Components: Diagnosability
>Reporter: Szehon Ho
>Assignee: Szehon Ho
> Fix For: 1.3.0
>
> Attachments: HIVE-10761.2.patch, HIVE-10761.3.patch, 
> HIVE-10761.4.patch, HIVE-10761.5.patch, HIVE-10761.6.patch, HIVE-10761.patch, 
> hms-metrics.json
>
>
> There is a current Hive metrics system that hooks up to a JMX reporting, but 
> all its measurements, models are custom.
> This is to make another metrics system that will be based on Codahale (ie 
> yammer, dropwizard), which has the following advantage:
> * Well-defined metric model for frequently-needed metrics (ie JVM metrics)
> * Well-defined measurements for all metrics (ie max, mean, stddev, mean_rate, 
> etc), 
> * Built-in reporting frameworks like JMX, Console, Log, JSON webserver
> It is used for many projects, including several Apache projects like Oozie.  
> Overall, monitoring tools should find it easier to understand these common 
> metric, measurement, reporting models.
> The existing metric subsystem will be kept and can be enabled if backward 
> compatibility is desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10761) Create codahale-based metrics system for Hive

2015-06-04 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573617#comment-14573617
 ] 

Mithun Radhakrishnan commented on HIVE-10761:
-

Question: Are we proposing to deprecate the old metrics system on trunk? What 
release are we considering deprecation and removal?

> Create codahale-based metrics system for Hive
> -
>
> Key: HIVE-10761
> URL: https://issues.apache.org/jira/browse/HIVE-10761
> Project: Hive
>  Issue Type: New Feature
>  Components: Diagnosability
>Reporter: Szehon Ho
>Assignee: Szehon Ho
> Fix For: 1.3.0
>
> Attachments: HIVE-10761.2.patch, HIVE-10761.3.patch, 
> HIVE-10761.4.patch, HIVE-10761.5.patch, HIVE-10761.6.patch, HIVE-10761.patch, 
> hms-metrics.json
>
>
> There is a current Hive metrics system that hooks up to a JMX reporting, but 
> all its measurements, models are custom.
> This is to make another metrics system that will be based on Codahale (ie 
> yammer, dropwizard), which has the following advantage:
> * Well-defined metric model for frequently-needed metrics (ie JVM metrics)
> * Well-defined measurements for all metrics (ie max, mean, stddev, mean_rate, 
> etc), 
> * Built-in reporting frameworks like JMX, Console, Log, JSON webserver
> It is used for many projects, including several Apache projects like Oozie.  
> Overall, monitoring tools should find it easier to understand these common 
> metric, measurement, reporting models.
> The existing metric subsystem will be kept and can be enabled if backward 
> compatibility is desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-4166) closeAllForUGI causes failure in hiveserver2 when fetching large amount of data

2016-12-02 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15716009#comment-15716009
 ] 

Mithun Radhakrishnan commented on HIVE-4166:


Argh. This patch has gone stale. I'll get a rebased version of this shortly.

> closeAllForUGI causes failure in hiveserver2 when fetching large amount of 
> data
> ---
>
> Key: HIVE-4166
> URL: https://issues.apache.org/jira/browse/HIVE-4166
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, Security, Shims
>Affects Versions: 0.10.0, 0.11.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Attachments: HIVE-4166-0.10.patch, HIVE-4166-trunk.patch
>
>
> HiveServer2 configured to use Kerberos authentication with doAs enabled 
> throws an exception when fetching a large amount of data from a query.
> The exception is caused because FileSystem.closeAllForUGI is always called at 
> the end of TUGIAssumingProcessor.process. This affects requests on the 
> ResultSet for data from a SELECT query when the amount of data exceeds a 
> certain size. At that point any subsequent calls to fetch more data throw an 
> exception because the underlying DFSClient has been closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11553) use basic file metadata cache in ETLSplitStrategy-related paths

2016-12-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734229#comment-15734229
 ] 

Mithun Radhakrishnan commented on HIVE-11553:
-

This patch seems to have renamed 
{{HiveConf.ConfVars.METASTORE_BATCH_RETRIEVE_TABLE_PARTITION_MAX}} to 
{{HiveConf.ConfVars.METASTORE_BATCH_RETRIEVE_OBJECTS_MAX}}, instead of 
deprecating it in favour of a new constant. :/

> use basic file metadata cache in ETLSplitStrategy-related paths
> ---
>
> Key: HIVE-11553
> URL: https://issues.apache.org/jira/browse/HIVE-11553
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>  Labels: TODOC2.0
> Fix For: 2.0.0
>
> Attachments: HIVE-11553.01.patch, HIVE-11553.02.patch, 
> HIVE-11553.03.patch, HIVE-11553.04.patch, HIVE-11553.06.patch, 
> HIVE-11553.07.patch, HIVE-11553.patch
>
>
> This is the first step; uses the simple footer-getting API, without PPD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11475) Bad rename of directory during commit, when using HCat dynamic-partitioning.

2016-12-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11475:

Status: Open  (was: Patch Available)

> Bad rename of directory during commit, when using HCat dynamic-partitioning.
> 
>
> Key: HIVE-11475
> URL: https://issues.apache.org/jira/browse/HIVE-11475
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>Priority: Critical
> Attachments: HIVE-11475.1.patch
>
>
> Here's one that [~knoguchi] found and root-caused. This one's a doozy. 
> Under seemingly random conditions, the temporary output (under 
> {{_SCRATCH1.234*}}) for HCat's dynamic partitioner isn't promoted correctly 
> to the final table directory.
> The namenode logs indicated a botched directory-rename:
> {noformat}
> 2015-08-02 03:24:29,090 INFO FSNamesystem.audit: allowed=true ugi=myth 
> (auth:TOKEN) via wrkf...@grid.myth.net (auth:TOKEN) ip=/10.192.100.117 
> cmd=rename 
> src=/projects/hive/myth.db/myth_table_15m/_SCRATCH2.8772158158263395E-4/tc=1/utc_time=201508020145/part-r-0
>  
> dst=/projects/hive/myth.db/myth_table_15mE-4/tc=1/utc_time=201508020145/part-r-0
>  perm=myth:madcaps:rw-r-r- proto=rpc
> {noformat}
> Note that the table-directory name {{"myth_table_15m"}} is appended with 
> {{"E-4"}}. This'll break anything that uses HDFS-based polling.
> [~knoguchi] points out the following code:
> {code:title=HCatOutputFormat.java}
> 119   if ((idHash = conf.get(HCatConstants.HCAT_OUTPUT_ID_HASH)) == null) {
> 120 idHash = String.valueOf(Math.random());
> 121   }
> {code}
> {code:title=FileOutputCommitterContainer.java}
> 370   String finalLocn = jobLocation.replaceAll(Path.SEPARATOR + 
> SCRATCH_DIR_NAME + "\\d\\.?\\d+","");
> {code}
> The problem is that when {{Math.random()}} produces a number <= 10 ^-3^, 
> {{String.valueOf(double)}} uses exponential notation. The regex doesn't 
> capture or handle this notation.
> The fix belies the debugging-effort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11693) CommonMergeJoinOperator throws exception with tez

2016-07-06 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15364709#comment-15364709
 ] 

Mithun Radhakrishnan commented on HIVE-11693:
-

[~rajesh.balamohan], et al., [~selinazh]'s analysis here seems accurate. 
Wouldn't her suggestion (i.e. to move {{posBigTable = (byte) 
conf.getBigTablePosition();}} to {{initializeOp()}}) fix the problem?

Would anyone else like to comment?

> CommonMergeJoinOperator throws exception with tez
> -
>
> Key: HIVE-11693
> URL: https://issues.apache.org/jira/browse/HIVE-11693
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
> Attachments: HIVE-11693.1.patch
>
>
> Got this when executing a simple query with latest hive build + tez latest 
> version.
> {noformat}
> Error: Failure while running task: 
> attempt_1439860407967_0291_2_03_45_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: Hive Runtime Error while closing operators: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Hive Runtime Error while closing 
> operators: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:316)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
> ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:412)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:375)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.doFirstFetchIfNeeded(CommonMergeJoinOperator.java:482)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinFinalLeftData(CommonMergeJoinOperator.java:434)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.closeOp(CommonMergeJoinOperator.java:384)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:292)
> ... 15 more
> Caused by: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:291)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:400)
> ... 21 more
> Caused by: java.io.IOException: Please check if you are invoking moveToNext() 
> even after it returned false.
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.hasCompletedProcessing(ValuesIterator.java:223)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.moveToNext(ValuesIterator.java:105)
> at 
> org.apache.tez.runtime.library.input.OrderedGroupedKVInput$OrderedGroupedKeyValuesReader.next(OrderedGroupedKVInput.java:308)
> at 
> org.apache.hadoop.hive.ql.exec.tez.KeyValuesFromKeyValues.next(KeyValuesFromKeyValues.java:46)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:249)
> ... 22 more
> {noformat}
> Not sure if this is related to HIVE-11016. 



--
Th

[jira] [Commented] (HIVE-11693) CommonMergeJoinOperator throws exception with tez

2016-07-06 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15364738#comment-15364738
 ] 

Mithun Radhakrishnan commented on HIVE-11693:
-

Kewl. Assigned to [~selinazh]. Should have our patch out soon.

> CommonMergeJoinOperator throws exception with tez
> -
>
> Key: HIVE-11693
> URL: https://issues.apache.org/jira/browse/HIVE-11693
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Selina Zhang
> Attachments: HIVE-11693.1.patch
>
>
> Got this when executing a simple query with latest hive build + tez latest 
> version.
> {noformat}
> Error: Failure while running task: 
> attempt_1439860407967_0291_2_03_45_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: Hive Runtime Error while closing operators: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Hive Runtime Error while closing 
> operators: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:316)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
> ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:412)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:375)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.doFirstFetchIfNeeded(CommonMergeJoinOperator.java:482)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinFinalLeftData(CommonMergeJoinOperator.java:434)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.closeOp(CommonMergeJoinOperator.java:384)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:292)
> ... 15 more
> Caused by: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:291)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:400)
> ... 21 more
> Caused by: java.io.IOException: Please check if you are invoking moveToNext() 
> even after it returned false.
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.hasCompletedProcessing(ValuesIterator.java:223)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.moveToNext(ValuesIterator.java:105)
> at 
> org.apache.tez.runtime.library.input.OrderedGroupedKVInput$OrderedGroupedKeyValuesReader.next(OrderedGroupedKVInput.java:308)
> at 
> org.apache.hadoop.hive.ql.exec.tez.KeyValuesFromKeyValues.next(KeyValuesFromKeyValues.java:46)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:249)
> ... 22 more
> {noformat}
> Not sure if this is related to HIVE-11016. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-11693) CommonMergeJoinOperator throws exception with tez

2016-07-06 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-11693:

Assignee: Selina Zhang

> CommonMergeJoinOperator throws exception with tez
> -
>
> Key: HIVE-11693
> URL: https://issues.apache.org/jira/browse/HIVE-11693
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Selina Zhang
> Attachments: HIVE-11693.1.patch
>
>
> Got this when executing a simple query with latest hive build + tez latest 
> version.
> {noformat}
> Error: Failure while running task: 
> attempt_1439860407967_0291_2_03_45_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: Hive Runtime Error while closing operators: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Hive Runtime Error while closing 
> operators: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:316)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
> ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:412)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:375)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.doFirstFetchIfNeeded(CommonMergeJoinOperator.java:482)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinFinalLeftData(CommonMergeJoinOperator.java:434)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.closeOp(CommonMergeJoinOperator.java:384)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:292)
> ... 15 more
> Caused by: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:291)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:400)
> ... 21 more
> Caused by: java.io.IOException: Please check if you are invoking moveToNext() 
> even after it returned false.
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.hasCompletedProcessing(ValuesIterator.java:223)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.moveToNext(ValuesIterator.java:105)
> at 
> org.apache.tez.runtime.library.input.OrderedGroupedKVInput$OrderedGroupedKeyValuesReader.next(OrderedGroupedKVInput.java:308)
> at 
> org.apache.hadoop.hive.ql.exec.tez.KeyValuesFromKeyValues.next(KeyValuesFromKeyValues.java:46)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:249)
> ... 22 more
> {noformat}
> Not sure if this is related to HIVE-11016. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13756) Map failure attempts to delete reducer _temporary directory on multi-query pig query

2016-07-15 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15380352#comment-15380352
 ] 

Mithun Radhakrishnan commented on HIVE-13756:
-

IMHO, the qtest failures here are irrelevant. This is a fix in HCat.

> Map failure attempts to delete reducer _temporary directory on multi-query 
> pig query
> 
>
> Key: HIVE-13756
> URL: https://issues.apache.org/jira/browse/HIVE-13756
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Attachments: HIVE-13756-branch-1.patch, HIVE-13756.1-branch-1.patch, 
> HIVE-13756.1.patch, HIVE-13756.patch
>
>
> A pig script, executed with multi-query enabled, that reads the source data 
> and writes it as-is into TABLE_A as well as performing a group-by operation 
> on the data which is written into TABLE_B can produce erroneous results if 
> any map fails. This results in a single MR job that writes the map output to 
> a scratch directory relative to TABLE_A and the reducer output to a scratch 
> directory relative to TABLE_B.
> If one or more maps fail it will delete the attempt data relative to TABLE_A, 
> but it also deletes the _temporary directory relative to TABLE_B. This has 
> the unintended side-effect of preventing subsequent maps from committing 
> their data. This means that any maps which successfully completed before the 
> first map failure will have its data committed as expected, other maps not, 
> resulting in an incomplete result set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14274) When columns are added to structs in a Hive table, HCatLoader breaks.

2016-07-18 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14274:

Status: Patch Available  (was: Open)

> When columns are added to structs in a Hive table, HCatLoader breaks.
> -
>
> Key: HIVE-14274
> URL: https://issues.apache.org/jira/browse/HIVE-14274
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 2.1.0, 1.2.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14274.1.patch
>
>
> Consider this sequence of table/partition creation and schema evolution:
> {code:sql}
> -- Create table.
> CREATE EXTERNAL TABLE `simple_text` (
> foo STRING,
> bar STRUCT
> )
> PARTITIONED BY ( dt STRING )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY ':'
> STORED AS TEXTFILE ;
> -- Add partition.
> ALTER TABLE simple_text ADD PARTITION ( dt='0' );
> -- Alter the struct-column to add a new sub-field.
> ALTER TABLE simple_text CHANGE COLUMN bar bar STRUCT zoo:STRING>;
> {code}
> The {{dt='0'}} partition's schema indicates 2 fields in {{bar}}. The data can 
> be read using Hive, but not through HCatLoader. The error looks as follows:
> {noformat}
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: data_raw: 
> Store(hdfs://dilithiumblue-nn1.blue.ygrid.yahoo.com:8020/tmp/temp-643668868/tmp-1639945319:org.apache.pig.impl.io.TFileStorage)
>  - scope-1 Operator Key: scope-1): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
> converting read value to tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:376)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:241)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
> converting read value to tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:160)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
>   ... 16 more
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6018: 
> Error converting read value to tuple
>   at 
> org.apache.hive.hcatalog.pig.HCatBaseLoader.getNext(HCatBaseLoader.java:76)
>   at org.apache.hive.hcatalog.pig.HCatLoader.getNext(HCatLoader.java:63)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
>   at 
> org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:118)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:140)
>   ... 17 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> org.apache.hive.hcatalog.pig.PigHCatUtil.transf

[jira] [Updated] (HIVE-14274) When columns are added to structs in a Hive table, HCatLoader breaks.

2016-07-18 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14274:

Attachment: HIVE-14274.1.patch

This solution allows for columns to be added to the end of structs. It looks 
like adding support for arbitrary column-schema evolution in structs would be 
very tricky.

(Note: The solution doesn't change {{HCatRecordReader}} at all, since the 
entire struct is projected correctly by the reader.)

> When columns are added to structs in a Hive table, HCatLoader breaks.
> -
>
> Key: HIVE-14274
> URL: https://issues.apache.org/jira/browse/HIVE-14274
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14274.1.patch
>
>
> Consider this sequence of table/partition creation and schema evolution:
> {code:sql}
> -- Create table.
> CREATE EXTERNAL TABLE `simple_text` (
> foo STRING,
> bar STRUCT
> )
> PARTITIONED BY ( dt STRING )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY ':'
> STORED AS TEXTFILE ;
> -- Add partition.
> ALTER TABLE simple_text ADD PARTITION ( dt='0' );
> -- Alter the struct-column to add a new sub-field.
> ALTER TABLE simple_text CHANGE COLUMN bar bar STRUCT zoo:STRING>;
> {code}
> The {{dt='0'}} partition's schema indicates 2 fields in {{bar}}. The data can 
> be read using Hive, but not through HCatLoader. The error looks as follows:
> {noformat}
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: data_raw: 
> Store(hdfs://dilithiumblue-nn1.blue.ygrid.yahoo.com:8020/tmp/temp-643668868/tmp-1639945319:org.apache.pig.impl.io.TFileStorage)
>  - scope-1 Operator Key: scope-1): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
> converting read value to tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:376)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:241)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
>   at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
> org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
> converting read value to tuple
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:160)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
>   ... 16 more
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6018: 
> Error converting read value to tuple
>   at 
> org.apache.hive.hcatalog.pig.HCatBaseLoader.getNext(HCatBaseLoader.java:76)
>   at org.apache.hive.hcatalog.pig.HCatLoader.getNext(HCatLoader.java:63)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
>   at 
> org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:118)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTupl

[jira] [Updated] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-07-28 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14380:

Description: 
If a table has table/partition locations set to remote HDFS paths, querying 
them will cause the following IAException:

{noformat}
2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
(SemanticAnalyzer.java:getMetaData(1867)) - 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
hdfs://bar.ygrid.yahoo.com:8020
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
...
{noformat}

This is because of the following code in {{SessionState}}:
{code:title=SessionState.java|borderStyle=solid}
 public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
HiveException {
if (hdfsEncryptionShim == null) {
  try {
FileSystem fs = FileSystem.get(sessionConf);
if ("hdfs".equals(fs.getUri().getScheme())) {
  hdfsEncryptionShim = 
ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
} else {
  LOG.debug("Could not get hdfsEncryptionShim, it is only applicable to 
hdfs filesystem.");
}
  } catch (Exception e) {
throw new HiveException(e);
  }
}

return hdfsEncryptionShim;
  }
{code}

When the {{FileSystem}} instance is created, using the {{sessionConf}} implies 
that the current HDFS is going to be used. This call should instead fetch the 
{{FileSystem}} instance corresponding to the path being checked.

A fix is forthcoming...

  was:
If a table has table/partition locations set to remote HDFS paths, querying 
them will cause the following IAException:

{noformat}
2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
(SemanticAnalyzer.java:getMetaData(1867)) - 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to deter
mine if hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
hdfs://bar.ygrid.yahoo.com:8020
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
...
{noformat}

This is because of the following code in {{SessionState}}:
{code:title=SessionState.java|borderStyle=solid}
 public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
HiveException {
if (hdfsEncryptionShim == null) {
  try {
FileSystem fs = FileSystem.get(sessionConf);
if ("hdfs".equals(fs.getUri().getScheme())) {
  hdfsEncryptionShim = 
ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
} else {
  LOG.debug("Could not get hdfsEncryptionShim, it is only applicable to 
hdfs filesystem.");
}
  } catch (Exception e) {
throw new HiveException(e);
  }
}

return hdfsEncryptionShim;
  }
{code}

When the {{FileSystem}} instance is created, using the {{sessionConf}} implies 
that the current HDFS is going to be used. This call should instead fetch the 
{{FileSystem}} instance corresponding to the path being checked.

A fix is forthcoming...


> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderSty

[jira] [Updated] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-07-28 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14380:

Status: Patch Available  (was: Open)

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-07-28 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14380:

Attachment: HIVE-14380.1.patch

The tentative fix.

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-07-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399853#comment-15399853
 ] 

Mithun Radhakrishnan commented on HIVE-14380:
-

Thanks for reviewing, [~spena]. :] Also, yikes. I'm not sure how 2 JIRAs got 
raised. :/

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-08-01 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403411#comment-15403411
 ] 

Mithun Radhakrishnan commented on HIVE-14380:
-

I'll confirm, but I think these failures might be unrelated.

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-08-01 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403482#comment-15403482
 ] 

Mithun Radhakrishnan commented on HIVE-14380:
-

Yeah, looks like these tests are busted on master. :/ Just checked on a fresh 
checkout.

All except {{TestHiveMetaStoreTxns}}. That test seems to run for me (even with 
my patch applied).

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-08-03 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406177#comment-15406177
 ] 

Mithun Radhakrishnan commented on HIVE-14380:
-

Thank you very much, [~spena]. I have a related fix on the metastore 
server-side. I hope to make time to raise a JIRA for this soon.

> Queries on tables with remote HDFS paths fail in "encryption" checks.
> -
>
> Key: HIVE-14380
> URL: https://issues.apache.org/jira/browse/HIVE-14380
> Project: Hive
>  Issue Type: Bug
>  Components: Encryption
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Fix For: 2.2.0
>
> Attachments: HIVE-14380.1.patch
>
>
> If a table has table/partition locations set to remote HDFS paths, querying 
> them will cause the following IAException:
> {noformat}
> 2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
> (SemanticAnalyzer.java:getMetaData(1867)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
> hdfs://bar.ygrid.yahoo.com:8020
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
> ...
> {noformat}
> This is because of the following code in {{SessionState}}:
> {code:title=SessionState.java|borderStyle=solid}
>  public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
> HiveException {
> if (hdfsEncryptionShim == null) {
>   try {
> FileSystem fs = FileSystem.get(sessionConf);
> if ("hdfs".equals(fs.getUri().getScheme())) {
>   hdfsEncryptionShim = 
> ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
> } else {
>   LOG.debug("Could not get hdfsEncryptionShim, it is only applicable 
> to hdfs filesystem.");
> }
>   } catch (Exception e) {
> throw new HiveException(e);
>   }
> }
> return hdfsEncryptionShim;
>   }
> {code}
> When the {{FileSystem}} instance is created, using the {{sessionConf}} 
> implies that the current HDFS is going to be used. This call should instead 
> fetch the {{FileSystem}} instance corresponding to the path being checked.
> A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13754) Fix resource leak in HiveClientCache

2016-08-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412206#comment-15412206
 ] 

Mithun Radhakrishnan commented on HIVE-13754:
-

+1. The test failures don't seem related to the fix.

> Fix resource leak in HiveClientCache
> 
>
> Key: HIVE-13754
> URL: https://issues.apache.org/jira/browse/HIVE-13754
> Project: Hive
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Attachments: HIVE-13754-branch-1.patch, HIVE-13754.1-branch-1.patch, 
> HIVE-13754.1.patch, HIVE-13754.patch
>
>
> Found that the {{users}} reference count can go into negative values, which 
> prevents {{tearDownIfUnused}} from closing the client connection when called.
> This leads to a build up of clients which have been evicted from the cache, 
> are no longer in use, but have not been shutdown.
> GC will eventually call {{finalize}}, which forcibly closes the connection 
> and cleans up the client, but I have seen as many as several hundred open 
> client connections as a result.
> The main resource for this is caused by RetryingMetaStoreClient, which will 
> call {{reconnect}} on acquire, which calls {{close}}. This will decrement 
> {{users}} to -1 on the reconnect, then acquire will increase this to 0 while 
> using it, and back to -1 when it releases it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13756) Map failure attempts to delete reducer _temporary directory on multi-query pig query

2016-08-09 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414342#comment-15414342
 ] 

Mithun Radhakrishnan commented on HIVE-13756:
-

+1. 

> Map failure attempts to delete reducer _temporary directory on multi-query 
> pig query
> 
>
> Key: HIVE-13756
> URL: https://issues.apache.org/jira/browse/HIVE-13756
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Attachments: HIVE-13756-branch-1.patch, HIVE-13756.1-branch-1.patch, 
> HIVE-13756.1.patch, HIVE-13756.patch
>
>
> A pig script, executed with multi-query enabled, that reads the source data 
> and writes it as-is into TABLE_A as well as performing a group-by operation 
> on the data which is written into TABLE_B can produce erroneous results if 
> any map fails. This results in a single MR job that writes the map output to 
> a scratch directory relative to TABLE_A and the reducer output to a scratch 
> directory relative to TABLE_B.
> If one or more maps fail it will delete the attempt data relative to TABLE_A, 
> but it also deletes the _temporary directory relative to TABLE_B. This has 
> the unintended side-effect of preventing subsequent maps from committing 
> their data. This means that any maps which successfully completed before the 
> first map failure will have its data committed as expected, other maps not, 
> resulting in an incomplete result set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13756) Map failure attempts to delete reducer _temporary directory on multi-query pig query

2016-08-10 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415574#comment-15415574
 ] 

Mithun Radhakrishnan commented on HIVE-13756:
-

Committed to master. Thanks, [~cdrome]!

> Map failure attempts to delete reducer _temporary directory on multi-query 
> pig query
> 
>
> Key: HIVE-13756
> URL: https://issues.apache.org/jira/browse/HIVE-13756
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Attachments: HIVE-13756-branch-1.patch, HIVE-13756.1-branch-1.patch, 
> HIVE-13756.1.patch, HIVE-13756.patch
>
>
> A pig script, executed with multi-query enabled, that reads the source data 
> and writes it as-is into TABLE_A as well as performing a group-by operation 
> on the data which is written into TABLE_B can produce erroneous results if 
> any map fails. This results in a single MR job that writes the map output to 
> a scratch directory relative to TABLE_A and the reducer output to a scratch 
> directory relative to TABLE_B.
> If one or more maps fail it will delete the attempt data relative to TABLE_A, 
> but it also deletes the _temporary directory relative to TABLE_B. This has 
> the unintended side-effect of preventing subsequent maps from committing 
> their data. This means that any maps which successfully completed before the 
> first map failure will have its data committed as expected, other maps not, 
> resulting in an incomplete result set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-13756) Map failure attempts to delete reducer _temporary directory on multi-query pig query

2016-08-10 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-13756:

  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0, 1.2.1  (was: 1.2.1, 2.0.0)
  Status: Resolved  (was: Patch Available)

> Map failure attempts to delete reducer _temporary directory on multi-query 
> pig query
> 
>
> Key: HIVE-13756
> URL: https://issues.apache.org/jira/browse/HIVE-13756
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.0.0
>
> Attachments: HIVE-13756-branch-1.patch, HIVE-13756.1-branch-1.patch, 
> HIVE-13756.1.patch, HIVE-13756.patch
>
>
> A pig script, executed with multi-query enabled, that reads the source data 
> and writes it as-is into TABLE_A as well as performing a group-by operation 
> on the data which is written into TABLE_B can produce erroneous results if 
> any map fails. This results in a single MR job that writes the map output to 
> a scratch directory relative to TABLE_A and the reducer output to a scratch 
> directory relative to TABLE_B.
> If one or more maps fail it will delete the attempt data relative to TABLE_A, 
> but it also deletes the _temporary directory relative to TABLE_B. This has 
> the unintended side-effect of preventing subsequent maps from committing 
> their data. This means that any maps which successfully completed before the 
> first map failure will have its data committed as expected, other maps not, 
> resulting in an incomplete result set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13754) Fix resource leak in HiveClientCache

2016-08-10 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415599#comment-15415599
 ] 

Mithun Radhakrishnan commented on HIVE-13754:
-

Committed to master. Thanks, [~cdrome]!

> Fix resource leak in HiveClientCache
> 
>
> Key: HIVE-13754
> URL: https://issues.apache.org/jira/browse/HIVE-13754
> Project: Hive
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.0.0
>
> Attachments: HIVE-13754-branch-1.patch, HIVE-13754.1-branch-1.patch, 
> HIVE-13754.1.patch, HIVE-13754.patch
>
>
> Found that the {{users}} reference count can go into negative values, which 
> prevents {{tearDownIfUnused}} from closing the client connection when called.
> This leads to a build up of clients which have been evicted from the cache, 
> are no longer in use, but have not been shutdown.
> GC will eventually call {{finalize}}, which forcibly closes the connection 
> and cleans up the client, but I have seen as many as several hundred open 
> client connections as a result.
> The main resource for this is caused by RetryingMetaStoreClient, which will 
> call {{reconnect}} on acquire, which calls {{close}}. This will decrement 
> {{users}} to -1 on the reconnect, then acquire will increase this to 0 while 
> using it, and back to -1 when it releases it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-13754) Fix resource leak in HiveClientCache

2016-08-10 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-13754:

  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0, 1.2.1  (was: 1.2.1, 2.0.0)
  Status: Resolved  (was: Patch Available)

> Fix resource leak in HiveClientCache
> 
>
> Key: HIVE-13754
> URL: https://issues.apache.org/jira/browse/HIVE-13754
> Project: Hive
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.0.0
>
> Attachments: HIVE-13754-branch-1.patch, HIVE-13754.1-branch-1.patch, 
> HIVE-13754.1.patch, HIVE-13754.patch
>
>
> Found that the {{users}} reference count can go into negative values, which 
> prevents {{tearDownIfUnused}} from closing the client connection when called.
> This leads to a build up of clients which have been evicted from the cache, 
> are no longer in use, but have not been shutdown.
> GC will eventually call {{finalize}}, which forcibly closes the connection 
> and cleans up the client, but I have seen as many as several hundred open 
> client connections as a result.
> The main resource for this is caused by RetryingMetaStoreClient, which will 
> call {{reconnect}} on acquire, which calls {{close}}. This will decrement 
> {{users}} to -1 on the reconnect, then acquire will increase this to 0 while 
> using it, and back to -1 when it releases it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-13754) Fix resource leak in HiveClientCache

2016-08-11 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-13754:

Fix Version/s: (was: 2.0.0)
   2.2.0

> Fix resource leak in HiveClientCache
> 
>
> Key: HIVE-13754
> URL: https://issues.apache.org/jira/browse/HIVE-13754
> Project: Hive
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.2.0
>
> Attachments: HIVE-13754-branch-1.patch, HIVE-13754.1-branch-1.patch, 
> HIVE-13754.1.patch, HIVE-13754.patch
>
>
> Found that the {{users}} reference count can go into negative values, which 
> prevents {{tearDownIfUnused}} from closing the client connection when called.
> This leads to a build up of clients which have been evicted from the cache, 
> are no longer in use, but have not been shutdown.
> GC will eventually call {{finalize}}, which forcibly closes the connection 
> and cleans up the client, but I have seen as many as several hundred open 
> client connections as a result.
> The main resource for this is caused by RetryingMetaStoreClient, which will 
> call {{reconnect}} on acquire, which calls {{close}}. This will decrement 
> {{users}} to -1 on the reconnect, then acquire will increase this to 0 while 
> using it, and back to -1 when it releases it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13754) Fix resource leak in HiveClientCache

2016-08-11 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15416684#comment-15416684
 ] 

Mithun Radhakrishnan commented on HIVE-13754:
-

Right you are, [~leftylev]. I've fixed (aha!) the fix version.

> Fix resource leak in HiveClientCache
> 
>
> Key: HIVE-13754
> URL: https://issues.apache.org/jira/browse/HIVE-13754
> Project: Hive
>  Issue Type: Bug
>  Components: Clients
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.2.0
>
> Attachments: HIVE-13754-branch-1.patch, HIVE-13754.1-branch-1.patch, 
> HIVE-13754.1.patch, HIVE-13754.patch
>
>
> Found that the {{users}} reference count can go into negative values, which 
> prevents {{tearDownIfUnused}} from closing the client connection when called.
> This leads to a build up of clients which have been evicted from the cache, 
> are no longer in use, but have not been shutdown.
> GC will eventually call {{finalize}}, which forcibly closes the connection 
> and cleans up the client, but I have seen as many as several hundred open 
> client connections as a result.
> The main resource for this is caused by RetryingMetaStoreClient, which will 
> call {{reconnect}} on acquire, which calls {{close}}. This will decrement 
> {{users}} to -1 on the reconnect, then acquire will increase this to 0 while 
> using it, and back to -1 when it releases it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-13756) Map failure attempts to delete reducer _temporary directory on multi-query pig query

2016-08-11 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-13756:

Fix Version/s: (was: 2.0.0)
   2.2.0

> Map failure attempts to delete reducer _temporary directory on multi-query 
> pig query
> 
>
> Key: HIVE-13756
> URL: https://issues.apache.org/jira/browse/HIVE-13756
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.0.0
>Reporter: Chris Drome
>Assignee: Chris Drome
> Fix For: 2.2.0
>
> Attachments: HIVE-13756-branch-1.patch, HIVE-13756.1-branch-1.patch, 
> HIVE-13756.1.patch, HIVE-13756.patch
>
>
> A pig script, executed with multi-query enabled, that reads the source data 
> and writes it as-is into TABLE_A as well as performing a group-by operation 
> on the data which is written into TABLE_B can produce erroneous results if 
> any map fails. This results in a single MR job that writes the map output to 
> a scratch directory relative to TABLE_A and the reducer output to a scratch 
> directory relative to TABLE_B.
> If one or more maps fail it will delete the attempt data relative to TABLE_A, 
> but it also deletes the _temporary directory relative to TABLE_B. This has 
> the unintended side-effect of preventing subsequent maps from committing 
> their data. This means that any maps which successfully completed before the 
> first map failure will have its data committed as expected, other maps not, 
> resulting in an incomplete result set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11693) CommonMergeJoinOperator throws exception with tez

2016-08-11 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417699#comment-15417699
 ] 

Mithun Radhakrishnan commented on HIVE-11693:
-

[~selinazh], could we please post our solution to this JIRA?

> CommonMergeJoinOperator throws exception with tez
> -
>
> Key: HIVE-11693
> URL: https://issues.apache.org/jira/browse/HIVE-11693
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Selina Zhang
> Attachments: HIVE-11693.1.patch
>
>
> Got this when executing a simple query with latest hive build + tez latest 
> version.
> {noformat}
> Error: Failure while running task: 
> attempt_1439860407967_0291_2_03_45_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: Hive Runtime Error while closing operators: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:349)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:71)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:60)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Hive Runtime Error while closing 
> operators: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:316)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
> ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: java.io.IOException: Please check if you are 
> invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:412)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:375)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.doFirstFetchIfNeeded(CommonMergeJoinOperator.java:482)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinFinalLeftData(CommonMergeJoinOperator.java:434)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.closeOp(CommonMergeJoinOperator.java:384)
> at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:292)
> ... 15 more
> Caused by: java.lang.RuntimeException: java.io.IOException: Please check if 
> you are invoking moveToNext() even after it returned false.
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:291)
> at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:400)
> ... 21 more
> Caused by: java.io.IOException: Please check if you are invoking moveToNext() 
> even after it returned false.
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.hasCompletedProcessing(ValuesIterator.java:223)
> at 
> org.apache.tez.runtime.library.common.ValuesIterator.moveToNext(ValuesIterator.java:105)
> at 
> org.apache.tez.runtime.library.input.OrderedGroupedKVInput$OrderedGroupedKeyValuesReader.next(OrderedGroupedKVInput.java:308)
> at 
> org.apache.hadoop.hive.ql.exec.tez.KeyValuesFromKeyValues.next(KeyValuesFromKeyValues.java:46)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:249)
> ... 22 more
> {noformat}
> Not sure if this is related to HIVE-11016. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-14789) Avro Table-reads bork when using SerDe-generated table-schema.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan reassigned HIVE-14789:
---

Assignee: Mithun Radhakrishnan

> Avro Table-reads bork when using SerDe-generated table-schema.
> --
>
> Key: HIVE-14789
> URL: https://issues.apache.org/jira/browse/HIVE-14789
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Affects Versions: 1.2.1, 2.0.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>
> AvroSerDe allows one to skip the table-columns in a table-definition when 
> creating a table, as long as the TBLPROPERTIES includes a valid 
> {{avro.schema.url}} or {{avro.schema.literal}}. The table-columns are 
> inferred from processing the Avro schema file/literal.
> The problem is that the inferred schema might not be congruent with the 
> actual schema in the Avro schema file/literal. Consider the following table 
> definition:
> {code:sql}
> CREATE TABLE avro_schema_break_1
> ROW FORMAT
> SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES ('avro.schema.literal'='{
>   "type": "record",
>   "name": "Messages",
>   "namespace": "net.myth",
>   "fields": [
> {
>   "name": "header",
>   "type": [
> "null",
> {
>   "type": "record",
>   "name": "HeaderInfo",
>   "fields": [
> {
>   "name": "inferred_event_type",
>   "type": [
> "null",
> "string"
>   ],
>   "default": null
> },
> {
>   "name": "event_type",
>   "type": [
> "null",
> "string"
>   ],
>   "default": null
> },
> {
>   "name": "event_version",
>   "type": [
> "null",
> "string"
>   ],
>   "default": null
> }
>   ]
> }
>   ]
> },
> {
>   "name": "messages",
>   "type": {
> "type": "array",
> "items": {
>   "name": "MessageInfo",
>   "type": "record",
>   "fields": [
> {
>   "name": "message_id",
>   "type": [
> "null",
> "string"
>   ],
>   "doc": "Message-ID"
> },
> {
>   "name": "received_date",
>   "type": [
> "null",
> "long"
>   ],
>   "doc": "Received Date"
> },
> {
>   "name": "sent_date",
>   "type": [
> "null",
> "long"
>   ]
> },
> {
>   "name": "from_name",
>   "type": [
> "null",
> "string"
>   ]
> },
> {
>   "name": "flags",
>   "type": [
> "null",
> {
>   "type": "record",
>   "name": "Flags",
>   "fields": [
> {
>   "name": "is_seen",
>   "type": [
> "null",
> "boolean"
>   ],
>   "default": null
> },
> {
>   "name": "is_read",
>   "type": [
> "null",
> "boolean"
>   ],
>   "default": null
> },
> {
>   "name": "is_flagged",
>   "type": [
> "null",
> "boolean"
>   ],
>   "default": null
> }
>   ]
> }
>   ],
>   "default": null
> }
>   ]
> }
>   }
> }
>   ]
> }');
> {code}
> This produces a table with the following schema:
> {noformat}
> 2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] 
> hive.log: DDL: struct avro_schema_break_1 { 
> struct 
> header, 
> list>>
>  messages}
> {noformat}
> Data written to this table using the AvroSchema from {{avro.schema.literal}} 
> using Pig's {{AvroStorage}} cannot be read using Hive using the generated 
> table schema. This is the exception one sees:
> {noforma

[jira] [Updated] (HIVE-14789) Avro Table-reads bork when using SerDe-generated table-schema.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14789:

Attachment: HIVE-14789-reproduce.patch

This attachment has a qfile-test that reproduces the error I'm talking about, 
including a scrubbed data-file that's readable with the schema-literal, but not 
without it. 

This was a fairly common failure at Yahoo. Our current recommendation is for 
users to only use Avro tables with the schema-file with which they were 
produced. The metastore-based schema is to be ignored entirely.

I've already tried modifying how the Avro schema is generated from 
{{columns.list.types}}, but I find that the conversions (to and fro) are lossy, 
brittle and unreliable. :/

> Avro Table-reads bork when using SerDe-generated table-schema.
> --
>
> Key: HIVE-14789
> URL: https://issues.apache.org/jira/browse/HIVE-14789
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Affects Versions: 1.2.1, 2.0.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14789-reproduce.patch
>
>
> AvroSerDe allows one to skip the table-columns in a table-definition when 
> creating a table, as long as the TBLPROPERTIES includes a valid 
> {{avro.schema.url}} or {{avro.schema.literal}}. The table-columns are 
> inferred from processing the Avro schema file/literal.
> The problem is that the inferred schema might not be congruent with the 
> actual schema in the Avro schema file/literal. Consider the following table 
> definition:
> {code:sql}
> CREATE TABLE avro_schema_break_1
> ROW FORMAT
> SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES ('avro.schema.literal'='{
>   "type": "record",
>   "name": "Messages",
>   "namespace": "net.myth",
>   "fields": [
> {
>   "name": "header",
>   "type": [
> "null",
> {
>   "type": "record",
>   "name": "HeaderInfo",
>   "fields": [
> {
>   "name": "inferred_event_type",
>   "type": [
> "null",
> "string"
>   ],
>   "default": null
> },
> {
>   "name": "event_type",
>   "type": [
> "null",
> "string"
>   ],
>   "default": null
> },
> {
>   "name": "event_version",
>   "type": [
> "null",
> "string"
>   ],
>   "default": null
> }
>   ]
> }
>   ]
> },
> {
>   "name": "messages",
>   "type": {
> "type": "array",
> "items": {
>   "name": "MessageInfo",
>   "type": "record",
>   "fields": [
> {
>   "name": "message_id",
>   "type": [
> "null",
> "string"
>   ],
>   "doc": "Message-ID"
> },
> {
>   "name": "received_date",
>   "type": [
> "null",
> "long"
>   ],
>   "doc": "Received Date"
> },
> {
>   "name": "sent_date",
>   "type": [
> "null",
> "long"
>   ]
> },
> {
>   "name": "from_name",
>   "type": [
> "null",
> "string"
>   ]
> },
> {
>   "name": "flags",
>   "type": [
> "null",
> {
>   "type": "record",
>   "name": "Flags",
>   "fields": [
> {
>   "name": "is_seen",
>   "type": [
> "null",
> "boolean"
>   ],
>   "default": null
> },
> {
>   "name": "is_read",
>   "type": [
> "null",
> "boolean"
>   ],
>   "default": null
> },
> {
>   "name": "is_flagged",
>   "type": [
> "null",
> "boolean"
>   ],
>   "default": null
> }
> 

[jira] [Updated] (HIVE-14792) AvroSerde reads the remote schema-file at least once per mapper, per table reference.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14792:

Attachment: HIVE-14792.1.patch

This patch introduces an optimizer that prefetches the {{avro.schema.url}} 
contents, and modifies the table-info stored in the query-plan to contain the 
schema (as the {{avro.schema.literal}} property). The {{AvroSerDe}} is almost 
completely unchanged, and handles {{avro.schema.literal}} transparently.

> AvroSerde reads the remote schema-file at least once per mapper, per table 
> reference.
> -
>
> Key: HIVE-14792
> URL: https://issues.apache.org/jira/browse/HIVE-14792
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14792.1.patch
>
>
> Avro tables that use "external" schema files stored on HDFS can cause 
> excessive calls to {{FileSystem::open()}}, especially for queries that spawn 
> large numbers of mappers.
> This is because of the following code in {{AvroSerDe::initialize()}}:
> {code:title=AvroSerDe.java|borderStyle=solid}
> public void initialize(Configuration configuration, Properties properties) 
> throws SerDeException {
> // ...
> if (hasExternalSchema(properties)
> || columnNameProperty == null || columnNameProperty.isEmpty()
> || columnTypeProperty == null || columnTypeProperty.isEmpty()) {
>   schema = determineSchemaOrReturnErrorSchema(configuration, properties);
> } else {
>   // Get column names and sort order
>   columnNames = Arrays.asList(columnNameProperty.split(","));
>   columnTypes = 
> TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
>   schema = getSchemaFromCols(properties, columnNames, columnTypes, 
> columnCommentProperty);
>  
> properties.setProperty(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(),
>  schema.toString());
> }
> // ...
> }
> {code}
> For files using {{avro.schema.url}}, every time the SerDe is initialized 
> (i.e. at least once per mapper), the schema file is read remotely. For 
> queries with thousands of mappers, this leads to a stampede to the handful 
> (3?) datanodes that host the schema-file. In the best case, this causes 
> slowdowns.
> It would be preferable to distribute the Avro-schema to all mappers as part 
> of the job-conf. The alternatives aren't exactly appealing:
> # One can't rely solely on the {{column.list.types}} stored in the Hive 
> metastore. (HIVE-14789).
> # {{avro.schema.literal}} might not always be usable, because of the 
> size-limit on table-parameters. The typical size of the Avro-schema file is 
> between 0.5-3MB, in my limited experience. Bumping the max table-parameter 
> size isn't a great solution.
> If the {{avro.schema.file}} were read during query-planning, and made 
> available as part of table-properties (but not serialized into the 
> metastore), the downstream logic will remain largely intact. I have a patch 
> that does this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14792) AvroSerde reads the remote schema-file at least once per mapper, per table reference.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14792:

Description: 
Avro tables that use "external" schema files stored on HDFS can cause excessive 
calls to {{FileSystem::open()}}, especially for queries that spawn large 
numbers of mappers.

This is because of the following code in {{AvroSerDe::initialize()}}:

{code:title=AvroSerDe.java|borderStyle=solid}
public void initialize(Configuration configuration, Properties properties) 
throws SerDeException {
// ...
if (hasExternalSchema(properties)
|| columnNameProperty == null || columnNameProperty.isEmpty()
|| columnTypeProperty == null || columnTypeProperty.isEmpty()) {
  schema = determineSchemaOrReturnErrorSchema(configuration, properties);
} else {
  // Get column names and sort order
  columnNames = Arrays.asList(columnNameProperty.split(","));
  columnTypes = 
TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);

  schema = getSchemaFromCols(properties, columnNames, columnTypes, 
columnCommentProperty);
 
properties.setProperty(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(),
 schema.toString());
}
// ...
}
{code}

For tables using {{avro.schema.url}}, every time the SerDe is initialized (i.e. 
at least once per mapper), the schema file is read remotely. For queries with 
thousands of mappers, this leads to a stampede to the handful (3?) datanodes 
that host the schema-file. In the best case, this causes slowdowns.

It would be preferable to distribute the Avro-schema to all mappers as part of 
the job-conf. The alternatives aren't exactly appealing:
# One can't rely solely on the {{column.list.types}} stored in the Hive 
metastore. (HIVE-14789).
# {{avro.schema.literal}} might not always be usable, because of the size-limit 
on table-parameters. The typical size of the Avro-schema file is between 
0.5-3MB, in my limited experience. Bumping the max table-parameter size isn't a 
great solution.

If the {{avro.schema.file}} were read during query-planning, and made available 
as part of table-properties (but not serialized into the metastore), the 
downstream logic will remain largely intact. I have a patch that does this.



  was:
Avro tables that use "external" schema files stored on HDFS can cause excessive 
calls to {{FileSystem::open()}}, especially for queries that spawn large 
numbers of mappers.

This is because of the following code in {{AvroSerDe::initialize()}}:

{code:title=AvroSerDe.java|borderStyle=solid}
public void initialize(Configuration configuration, Properties properties) 
throws SerDeException {
// ...
if (hasExternalSchema(properties)
|| columnNameProperty == null || columnNameProperty.isEmpty()
|| columnTypeProperty == null || columnTypeProperty.isEmpty()) {
  schema = determineSchemaOrReturnErrorSchema(configuration, properties);
} else {
  // Get column names and sort order
  columnNames = Arrays.asList(columnNameProperty.split(","));
  columnTypes = 
TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);

  schema = getSchemaFromCols(properties, columnNames, columnTypes, 
columnCommentProperty);
 
properties.setProperty(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(),
 schema.toString());
}
// ...
}
{code}

For files using {{avro.schema.url}}, every time the SerDe is initialized (i.e. 
at least once per mapper), the schema file is read remotely. For queries with 
thousands of mappers, this leads to a stampede to the handful (3?) datanodes 
that host the schema-file. In the best case, this causes slowdowns.

It would be preferable to distribute the Avro-schema to all mappers as part of 
the job-conf. The alternatives aren't exactly appealing:
# One can't rely solely on the {{column.list.types}} stored in the Hive 
metastore. (HIVE-14789).
# {{avro.schema.literal}} might not always be usable, because of the size-limit 
on table-parameters. The typical size of the Avro-schema file is between 
0.5-3MB, in my limited experience. Bumping the max table-parameter size isn't a 
great solution.

If the {{avro.schema.file}} were read during query-planning, and made available 
as part of table-properties (but not serialized into the metastore), the 
downstream logic will remain largely intact. I have a patch that does this.




> AvroSerde reads the remote schema-file at least once per mapper, per table 
> reference.
> -
>
> Key: HIVE-14792
> URL: https://issues.apache.org/jira/browse/HIVE-14792
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14792.1.pa

[jira] [Updated] (HIVE-14792) AvroSerde reads the remote schema-file at least once per mapper, per table reference.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14792:

Attachment: (was: HIVE-14792.1.patch)

> AvroSerde reads the remote schema-file at least once per mapper, per table 
> reference.
> -
>
> Key: HIVE-14792
> URL: https://issues.apache.org/jira/browse/HIVE-14792
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>
> Avro tables that use "external" schema files stored on HDFS can cause 
> excessive calls to {{FileSystem::open()}}, especially for queries that spawn 
> large numbers of mappers.
> This is because of the following code in {{AvroSerDe::initialize()}}:
> {code:title=AvroSerDe.java|borderStyle=solid}
> public void initialize(Configuration configuration, Properties properties) 
> throws SerDeException {
> // ...
> if (hasExternalSchema(properties)
> || columnNameProperty == null || columnNameProperty.isEmpty()
> || columnTypeProperty == null || columnTypeProperty.isEmpty()) {
>   schema = determineSchemaOrReturnErrorSchema(configuration, properties);
> } else {
>   // Get column names and sort order
>   columnNames = Arrays.asList(columnNameProperty.split(","));
>   columnTypes = 
> TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
>   schema = getSchemaFromCols(properties, columnNames, columnTypes, 
> columnCommentProperty);
>  
> properties.setProperty(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(),
>  schema.toString());
> }
> // ...
> }
> {code}
> For tables using {{avro.schema.url}}, every time the SerDe is initialized 
> (i.e. at least once per mapper), the schema file is read remotely. For 
> queries with thousands of mappers, this leads to a stampede to the handful 
> (3?) datanodes that host the schema-file. In the best case, this causes 
> slowdowns.
> It would be preferable to distribute the Avro-schema to all mappers as part 
> of the job-conf. The alternatives aren't exactly appealing:
> # One can't rely solely on the {{column.list.types}} stored in the Hive 
> metastore. (HIVE-14789).
> # {{avro.schema.literal}} might not always be usable, because of the 
> size-limit on table-parameters. The typical size of the Avro-schema file is 
> between 0.5-3MB, in my limited experience. Bumping the max table-parameter 
> size isn't a great solution.
> If the {{avro.schema.file}} were read during query-planning, and made 
> available as part of table-properties (but not serialized into the 
> metastore), the downstream logic will remain largely intact. I have a patch 
> that does this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14792) AvroSerde reads the remote schema-file at least once per mapper, per table reference.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14792:

Attachment: HIVE-14792.1.patch

> AvroSerde reads the remote schema-file at least once per mapper, per table 
> reference.
> -
>
> Key: HIVE-14792
> URL: https://issues.apache.org/jira/browse/HIVE-14792
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14792.1.patch
>
>
> Avro tables that use "external" schema files stored on HDFS can cause 
> excessive calls to {{FileSystem::open()}}, especially for queries that spawn 
> large numbers of mappers.
> This is because of the following code in {{AvroSerDe::initialize()}}:
> {code:title=AvroSerDe.java|borderStyle=solid}
> public void initialize(Configuration configuration, Properties properties) 
> throws SerDeException {
> // ...
> if (hasExternalSchema(properties)
> || columnNameProperty == null || columnNameProperty.isEmpty()
> || columnTypeProperty == null || columnTypeProperty.isEmpty()) {
>   schema = determineSchemaOrReturnErrorSchema(configuration, properties);
> } else {
>   // Get column names and sort order
>   columnNames = Arrays.asList(columnNameProperty.split(","));
>   columnTypes = 
> TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
>   schema = getSchemaFromCols(properties, columnNames, columnTypes, 
> columnCommentProperty);
>  
> properties.setProperty(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(),
>  schema.toString());
> }
> // ...
> }
> {code}
> For tables using {{avro.schema.url}}, every time the SerDe is initialized 
> (i.e. at least once per mapper), the schema file is read remotely. For 
> queries with thousands of mappers, this leads to a stampede to the handful 
> (3?) datanodes that host the schema-file. In the best case, this causes 
> slowdowns.
> It would be preferable to distribute the Avro-schema to all mappers as part 
> of the job-conf. The alternatives aren't exactly appealing:
> # One can't rely solely on the {{column.list.types}} stored in the Hive 
> metastore. (HIVE-14789).
> # {{avro.schema.literal}} might not always be usable, because of the 
> size-limit on table-parameters. The typical size of the Avro-schema file is 
> between 0.5-3MB, in my limited experience. Bumping the max table-parameter 
> size isn't a great solution.
> If the {{avro.schema.file}} were read during query-planning, and made 
> available as part of table-properties (but not serialized into the 
> metastore), the downstream logic will remain largely intact. I have a patch 
> that does this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14794) HCatalog support to pre-fetch schema for Avro tables that use avro.schema.url.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14794:

Summary: HCatalog support to pre-fetch schema for Avro tables that use 
avro.schema.url.  (was: HCatalog support to pre-fetch for Avro tables that use 
avro.schema.url.)

> HCatalog support to pre-fetch schema for Avro tables that use avro.schema.url.
> --
>
> Key: HIVE-14794
> URL: https://issues.apache.org/jira/browse/HIVE-14794
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>
> HIVE-14792 introduces support to modify and add properties to 
> table-parameters during query-planning. It prefetches remote Avro-schema 
> information and stores it in TBLPROPERTIES, under {{avro.schema.literal}}.
> We'll need similar support in {{HCatLoader}} to prevent excessive reads of 
> schema-files in Pig queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14794) HCatalog support to pre-fetch schema for Avro tables that use avro.schema.url.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14794:

Attachment: HIVE-14794.1.patch

This patch builds on HIVE-14792. It uses {{SpecialCases}} to prefetch Avro 
schema.

> HCatalog support to pre-fetch schema for Avro tables that use avro.schema.url.
> --
>
> Key: HIVE-14794
> URL: https://issues.apache.org/jira/browse/HIVE-14794
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1, 2.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14794.1.patch
>
>
> HIVE-14792 introduces support to modify and add properties to 
> table-parameters during query-planning. It prefetches remote Avro-schema 
> information and stores it in TBLPROPERTIES, under {{avro.schema.literal}}.
> We'll need similar support in {{HCatLoader}} to prevent excessive reads of 
> schema-files in Pig queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14792) AvroSerde reads the remote schema-file at least once per mapper, per table reference.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-14792:

Status: Patch Available  (was: Open)

Submitting, to run tests.

> AvroSerde reads the remote schema-file at least once per mapper, per table 
> reference.
> -
>
> Key: HIVE-14792
> URL: https://issues.apache.org/jira/browse/HIVE-14792
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.1.0, 1.2.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14792.1.patch
>
>
> Avro tables that use "external" schema files stored on HDFS can cause 
> excessive calls to {{FileSystem::open()}}, especially for queries that spawn 
> large numbers of mappers.
> This is because of the following code in {{AvroSerDe::initialize()}}:
> {code:title=AvroSerDe.java|borderStyle=solid}
> public void initialize(Configuration configuration, Properties properties) 
> throws SerDeException {
> // ...
> if (hasExternalSchema(properties)
> || columnNameProperty == null || columnNameProperty.isEmpty()
> || columnTypeProperty == null || columnTypeProperty.isEmpty()) {
>   schema = determineSchemaOrReturnErrorSchema(configuration, properties);
> } else {
>   // Get column names and sort order
>   columnNames = Arrays.asList(columnNameProperty.split(","));
>   columnTypes = 
> TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
>   schema = getSchemaFromCols(properties, columnNames, columnTypes, 
> columnCommentProperty);
>  
> properties.setProperty(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(),
>  schema.toString());
> }
> // ...
> }
> {code}
> For tables using {{avro.schema.url}}, every time the SerDe is initialized 
> (i.e. at least once per mapper), the schema file is read remotely. For 
> queries with thousands of mappers, this leads to a stampede to the handful 
> (3?) datanodes that host the schema-file. In the best case, this causes 
> slowdowns.
> It would be preferable to distribute the Avro-schema to all mappers as part 
> of the job-conf. The alternatives aren't exactly appealing:
> # One can't rely solely on the {{column.list.types}} stored in the Hive 
> metastore. (HIVE-14789).
> # {{avro.schema.literal}} might not always be usable, because of the 
> size-limit on table-parameters. The typical size of the Avro-schema file is 
> between 0.5-3MB, in my limited experience. Bumping the max table-parameter 
> size isn't a great solution.
> If the {{avro.schema.file}} were read during query-planning, and made 
> available as part of table-properties (but not serialized into the 
> metastore), the downstream logic will remain largely intact. I have a patch 
> that does this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12158) Add methods to HCatClient for partition synchronization

2016-03-19 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198449#comment-15198449
 ] 

Mithun Radhakrishnan commented on HIVE-12158:
-

[~sushanth], [~nahguam], sorry for the delay.

I agree with the spirit of this patch. Thank you for working on this. (I just 
came across a user who needed this as well.)

Please change HCatClientHMSImpl.java::Line#537 to compare table-name instead of 
db-name. I'll take a closer look at your tests. (Tests! Thank you!)

> Add methods to HCatClient for partition synchronization
> ---
>
> Key: HIVE-12158
> URL: https://issues.apache.org/jira/browse/HIVE-12158
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Affects Versions: 2.0.0
>Reporter: David Maughan
>Assignee: David Maughan
>Priority: Minor
>  Labels: hcatalog
> Attachments: HIVE-12158.1.patch
>
>
> We have a use case where we have a list of partitions that are created as a 
> result of a batch job (new or updated) outside of Hive and would like to 
> synchronize them with the Hive MetaStore. We would like to use the HCatalog 
> {{HCatClient}} but it currently does not seem to support this. However it is 
> possible with the {{HiveMetaStoreClient}} directly. I am proposing to add the 
> following method to {{HCatClient}} and {{HCatClientHMSImpl}}:
> A method for altering partitions. The implementation would delegate to 
> {{HiveMetaStoreClient#alter_partitions}}. I've used "update" instead of 
> "alter" in the name so it's consistent with the 
> {{HCatClient#updateTableSchema}} method.
> {code}
> public void updatePartitions(List partitions) throws 
> HCatException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13370) Add test for HIVE-11470

2016-03-28 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214840#comment-15214840
 ] 

Mithun Radhakrishnan commented on HIVE-13370:
-

Thanks for adding the test, [~sushanth]. 
+1. Looks good.

> Add test for HIVE-11470
> ---
>
> Key: HIVE-13370
> URL: https://issues.apache.org/jira/browse/HIVE-13370
> Project: Hive
>  Issue Type: Bug
>Reporter: Sushanth Sowmyan
>Assignee: Sushanth Sowmyan
> Attachments: HIVE-13370.patch
>
>
> HIVE-11470 added capability to handle NULL dynamic partitioning keys 
> properly. However, it did not add a test for the case, we should have one so 
> we don't have future regressions of the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-13 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240190#comment-15240190
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

+[~daijy], [~rohini].

One possible concern is the disconnect between Hive and Pig:
# When one attempts to consume a *non-existent* directory (i.e. not just an 
empty directory) through Pig, one gets a failure.
# When one attempts to consume a non-existent partition (e.g. 
{{dt='3016-04-13'}}) in Hive, via  an unsatisfied partition-predicate, the 
query runs successfully (and returns nothing).

In ETL jobs using Pig, we might actually prefer a failure when the input data 
isn't available. Wouldn't this fix break those semantics for Pig?

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-14 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242058#comment-15242058
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

I knew this would be a sticking point with the Pig folks. ([~rohini], et al.) 
I'm afraid I agree with their assessment as well. 

Changing the default behaviour of {{HCatLoader}} to break Pig semantics would 
be incorrect, and would hide problems with missing data. We've run into 
failures/bugs in the {{FileOutputCommitterContainer}} that thankfully didn't 
perpetuate downstream, thanks to the current behaviour.

Can we keep the default behaviour, with a client-side option to ignore missing 
data?

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-15 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243789#comment-15243789
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

I'm stuck on production-support, at the moment. I'd review this on Monday. 
Sorry for the delay.

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.1.patch, HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-25 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256936#comment-15256936
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

Sorry for delaying you on this. If I don't have feedback for you tomorrow, 
please go ahead and check in as is. I'll trust [~szehon]'s review. :] Thanks 
for keeping the default behavior. 

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.1.patch, HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-26 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258454#comment-15258454
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

Reviewing your patch now. On the face of it, it looks good. Looking at it a 
little more closely...

A couple of observations:
# {{hcat.input.ignore.invalid.path}} is well-named, and would make sense to 
anyone who'd want to override the default. (I thought we'd go with 
{{hcat.input.allow.invalid.path=true}}, but your version is better.
# Consider replacing {{(pathString == null || pathString.trim().isEmpty())}} 
with {{StringUtils.isBlank(pathString)}}.
# Nitpick: Consider replacing the loop at {{HCatBaseInputFormat.java:Line#335}} 
with Google Guava's {{Iterators.filter()}}. Then, depending on whether 
{{ignoreInvalidPath}} is set, the erstwhile loop at Line#329 will either loop 
on {{paths}} or on {{filteredPaths}}. This will be more readable.

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.1.patch, HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-26 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258527#comment-15258527
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

bq. ... with Google Guava's {{Iterators.filter()}}.
Actually, please ignore comment#3, above. 

I was trying to avoid checking {{ignoreInvalidPath}} multiple times. I tried 
writing it out myself (to illustrate), and saw that the call to 
{{fs.makeQualified()}} implies that we'll need to use both 
{{Iterators.filter()}} and {{Iterators.transform}}, at which point, it's no 
longer short and sweet. 

Please fix #2 above, and I will +1.

Also, thanks for adding tests.

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.1.patch, HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

2016-04-26 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258780#comment-15258780
 ] 

Mithun Radhakrishnan commented on HIVE-13509:
-

Yes, sir. +1.

> HCatalog getSplits should ignore the partition with invalid path
> 
>
> Key: HIVE-13509
> URL: https://issues.apache.org/jira/browse/HIVE-13509
> Project: Hive
>  Issue Type: Improvement
>  Components: HCatalog
>Reporter: Chaoyu Tang
>Assignee: Chaoyu Tang
> Attachments: HIVE-13509.1.patch, HIVE-13509.2.patch, HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-02-23 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9736:
---
Attachment: HIVE-9736.1.patch

I've added overloads to methods in {{FileUtils}}, {{HadoopShims}} and 
{{DefaultFileAccess}} to work on more than a single Path/FileStatus at a time.
I've also grouped {{listStatus}} calls into batches, to avoid a frequent 
round-trips to the NN.

> StorageBasedAuthProvider should batch namenode-calls where possible.
> 
>
> Key: HIVE-9736
> URL: https://issues.apache.org/jira/browse/HIVE-9736
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore, Security
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9736.1.patch
>
>
> Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
> have 1 associated regions. Consider that the user does:
> {code:sql}
> ALTER TABLE my_table DROP PARTITION (dt='20150101');
> {code}
> As things stand now, {{StorageBasedAuthProvider}} will make individual 
> {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
> and authorize each one separately. It'd be faster to batch the calls, and 
> examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9629) HCatClient.dropPartitions() needs speeding up.

2015-02-23 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333651#comment-14333651
 ] 

Mithun Radhakrishnan commented on HIVE-9629:


Just an update on performance numbers: (A follow-on to those quoted in 
HIVE-9588)

1. Dropping 2K partitions from a managed Hive table took 204 seconds on my 
Hive/HCat test setup (with remote metastore, backed with Oracle).
2. HIVE-9588 reduced this to 83 seconds.
3. The combination of HIVE-9631, HIVE-9681 and HIVE-9736 has reduced this now 
to 16 seconds.
(The patch for HIVE-9631 isn't currently up. Selina has an internal patch that 
works with Oracle.)

I'll be testing this some more. In the meantime, I'd be grateful if the patches 
(other than HIVE-9631) could be reviewed.
 

> HCatClient.dropPartitions() needs speeding up.
> --
>
> Key: HIVE-9629
> URL: https://issues.apache.org/jira/browse/HIVE-9629
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog, Metastore
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>
> This is an über JIRA for the work required to speed up 
> HCatClient.dropPartitions().
> As it stands right now, {{dropPartitions()}} is slow because it takes N 
> thrift-calls to drop N partitions, and attempts to store all N partitions in 
> memory while it executes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9681) Extend HiveAuthorizationProvider to support partition-sets.

2015-02-23 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334155#comment-14334155
 ] 

Mithun Radhakrishnan commented on HIVE-9681:


Tagging [~thejas], since this patch adds on top of your security work.

> Extend HiveAuthorizationProvider to support partition-sets.
> ---
>
> Key: HIVE-9681
> URL: https://issues.apache.org/jira/browse/HIVE-9681
> Project: Hive
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9681.1.patch
>
>
> {{HiveAuthorizationProvider}} allows only for the authorization of a single 
> partition at a time. For instance, when the {{StorageBasedAuthProvider}} must 
> authorize an operation on a set of partitions (say from a 
> PreDropPartitionEvent), each partition's data-directory needs to be checked 
> individually. For N partitions, this results in N namenode calls.
> I'd like to add {{authorize()}} overloads that accept multiple partitions. 
> This will allow StorageBasedAuthProvider to make batched namenode calls. 
> P.S. There's 2 further optimizations that are possible:
> 1. In the ideal case, we'd have a single call in 
> {{org.apache.hadoop.fs.FileSystem}} to check access for an array of Paths, 
> something like:
> {code:title=FileSystem.java|borderStyle=solid}
> @InterfaceAudience.LimitedPrivate({"HDFS", "Hive"})
>   public void access(Path [] paths, FsAction mode) throws 
> AccessControlException, FileNotFoundException, IOException 
> {...}
> {code}
> 2. We can go one better if we could retrieve partition-locations in DirectSQL 
> and use those for authorization. The EventListener-abstraction behind which 
> the AuthProviders operate make this difficult. I can attempt to solve this 
> using a PartitionSpec and a call-back into the ObjectStore from 
> StorageBasedAuthProvider. I'll save this rigmarole for later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-02-23 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334157#comment-14334157
 ] 

Mithun Radhakrishnan commented on HIVE-9736:


Tagging [~thejas], since this marches all over the security-work (and SBAP).

Sorry the patch looks a little big... The actual changes aren't, really.

> StorageBasedAuthProvider should batch namenode-calls where possible.
> 
>
> Key: HIVE-9736
> URL: https://issues.apache.org/jira/browse/HIVE-9736
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore, Security
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9736.1.patch
>
>
> Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
> have 1 associated regions. Consider that the user does:
> {code:sql}
> ALTER TABLE my_table DROP PARTITION (dt='20150101');
> {code}
> As things stand now, {{StorageBasedAuthProvider}} will make individual 
> {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
> and authorize each one separately. It'd be faster to batch the calls, and 
> examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9083) New metastore API to support to purge partition-data directly in dropPartitions().

2015-02-24 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335206#comment-14335206
 ] 

Mithun Radhakrishnan commented on HIVE-9083:


Sure thing. I'll rebase the set.

> New metastore API to support to purge partition-data directly in 
> dropPartitions().
> --
>
> Key: HIVE-9083
> URL: https://issues.apache.org/jira/browse/HIVE-9083
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.15.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9083.3.patch, HIVE-9083.4.patch
>
>
> HIVE-7100 adds the option to purge table-data when dropping a table (from 
> Hive CLI.)
> This patch adds HiveMetaStoreClient APIs to support the same for 
> {{dropPartitions()}}.
> (I'll add a follow-up to support a command-line option for the same.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9083) New metastore API to support to purge partition-data directly in dropPartitions().

2015-02-24 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335793#comment-14335793
 ] 

Mithun Radhakrishnan commented on HIVE-9083:


This'll take longer than expected. I'm trying to do something about the 
increasing number of booleans in {{HMSC.dropPartitions()}}. Will post shortly. 

> New metastore API to support to purge partition-data directly in 
> dropPartitions().
> --
>
> Key: HIVE-9083
> URL: https://issues.apache.org/jira/browse/HIVE-9083
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.15.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9083.3.patch, HIVE-9083.4.patch
>
>
> HIVE-7100 adds the option to purge table-data when dropping a table (from 
> Hive CLI.)
> This patch adds HiveMetaStoreClient APIs to support the same for 
> {{dropPartitions()}}.
> (I'll add a follow-up to support a command-line option for the same.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9086) Add language support to PURGE data while dropping partitions.

2015-02-25 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336991#comment-14336991
 ] 

Mithun Radhakrishnan commented on HIVE-9086:


@[~ekoifman]: I'd like to keep the grammar as it is, if that's acceptable. 
We'll adjust the documentation accordingly.

I'm not quite done rebasing HIVE-9083 (that this patch depends on), so we'll 
have to wait a tad before checking this in.
(/CaptainObvious)

> Add language support to PURGE data while dropping partitions.
> -
>
> Key: HIVE-9086
> URL: https://issues.apache.org/jira/browse/HIVE-9086
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.15.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9086.1.patch
>
>
> HIVE-9083 adds metastore-support to skip-trash while dropping partitions. 
> This patch includes language support to do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9086) Add language support to PURGE data while dropping partitions.

2015-02-25 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337520#comment-14337520
 ] 

Mithun Radhakrishnan commented on HIVE-9086:


Judging from [the 
patch|https://issues.apache.org/jira/secure/attachment/12670435/HIVE-7100.11.patch#file-12],
 HIVE-7100 added the "drop-table-purge" functionality to read thus:

{code:sql}
DROP TABLE IF EXISTS my_doomed_table PURGE;
{code}

The current "alter table drop partitions" reads as follows:

{code:sql}
ALTER TABLE my_doomed_table DROP IF EXISTS PARTITION (part_key = "sayonara") 
IGNORE PROTECTION;
{code}

HIVE-9086 extends HIVE-7100's purge-functionality to partitions, and suggests 
that the {{PURGE}} keyword go at the end, thus:

{code:sql}
ALTER TABLE my_doomed_table DROP IF EXISTS PARTITION (part_key = "sayonara") 
IGNORE PROTECTION PURGE;
{code}

Should {{PURGE}} sit before/after {{IF EXISTS}} or after {{IGNORE PROTECTION}}?

We can't break backward compatibility, so we shouldn't be changing what we 
released in 0.14.

> Add language support to PURGE data while dropping partitions.
> -
>
> Key: HIVE-9086
> URL: https://issues.apache.org/jira/browse/HIVE-9086
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.15.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9086.1.patch
>
>
> HIVE-9083 adds metastore-support to skip-trash while dropping partitions. 
> This patch includes language support to do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9083) New metastore API to support to purge partition-data directly in dropPartitions().

2015-03-01 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9083:
---
Attachment: HIVE-9083.5.patch

Here's the patch rebased for trunk. I've lumped all the boolean flags for 
drop-partitions (and their default values) into a common class. The code's an 
easier read, and less prone to misfires.

(Apologies for the delay.)

> New metastore API to support to purge partition-data directly in 
> dropPartitions().
> --
>
> Key: HIVE-9083
> URL: https://issues.apache.org/jira/browse/HIVE-9083
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.15.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9083.3.patch, HIVE-9083.4.patch, HIVE-9083.5.patch
>
>
> HIVE-7100 adds the option to purge table-data when dropping a table (from 
> Hive CLI.)
> This patch adds HiveMetaStoreClient APIs to support the same for 
> {{dropPartitions()}}.
> (I'll add a follow-up to support a command-line option for the same.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9083) New metastore API to support to purge partition-data directly in dropPartitions().

2015-03-01 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9083:
---
Attachment: (was: HIVE-9083.5.patch)

> New metastore API to support to purge partition-data directly in 
> dropPartitions().
> --
>
> Key: HIVE-9083
> URL: https://issues.apache.org/jira/browse/HIVE-9083
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.15.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9083.3.patch, HIVE-9083.4.patch, HIVE-9083.5.patch
>
>
> HIVE-7100 adds the option to purge table-data when dropping a table (from 
> Hive CLI.)
> This patch adds HiveMetaStoreClient APIs to support the same for 
> {{dropPartitions()}}.
> (I'll add a follow-up to support a command-line option for the same.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9083) New metastore API to support to purge partition-data directly in dropPartitions().

2015-03-01 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9083:
---
Attachment: HIVE-9083.5.patch

Removed some dead code.

> New metastore API to support to purge partition-data directly in 
> dropPartitions().
> --
>
> Key: HIVE-9083
> URL: https://issues.apache.org/jira/browse/HIVE-9083
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.15.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9083.3.patch, HIVE-9083.4.patch, HIVE-9083.5.patch
>
>
> HIVE-7100 adds the option to purge table-data when dropping a table (from 
> Hive CLI.)
> This patch adds HiveMetaStoreClient APIs to support the same for 
> {{dropPartitions()}}.
> (I'll add a follow-up to support a command-line option for the same.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9086) Add language support to PURGE data while dropping partitions.

2015-03-01 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342753#comment-14342753
 ] 

Mithun Radhakrishnan commented on HIVE-9086:


Done rebasing HIVE-9083. It turns out HIVE-9086 doesn't need rebasing.

> Add language support to PURGE data while dropping partitions.
> -
>
> Key: HIVE-9086
> URL: https://issues.apache.org/jira/browse/HIVE-9086
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.15.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9086.1.patch
>
>
> HIVE-9083 adds metastore-support to skip-trash while dropping partitions. 
> This patch includes language support to do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9118) Support auto-purge for tables, when dropping tables/partitions.

2015-03-01 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9118:
---
Attachment: HIVE-9118.3.patch

Rebased patch because of rebasing HIVE-9083.

> Support auto-purge for tables, when dropping tables/partitions.
> ---
>
> Key: HIVE-9118
> URL: https://issues.apache.org/jira/browse/HIVE-9118
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.15.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9118.1.patch, HIVE-9118.2.patch, HIVE-9118.3.patch
>
>
> HIVE-7100 introduced a way to skip the trash directory, when deleting 
> table-data, while dropping tables.
> In HIVE-9083/HIVE-9086, I extended this to work when partitions are dropped.
> Here, I propose a table-parameter ({{"auto.purge"}}) to set up tables to 
> skip-trash when table/partition data is deleted, without needing to say 
> "PURGE" on the Hive CLI. Apropos, on {{dropTable()}} and {{dropPartition()}}, 
> table data is deleted directly (and not moved to trash) if the following hold 
> true:
> # The table is MANAGED.
> # The {{deleteData}} parameter to the {{HMSC.drop*()}} methods is true.
> # Either PURGE is explicitly specified on the command-line (or rather, 
> {{"ifPurge"}} is set in the environment context, OR
> # TBLPROPERTIES contains {{"auto.purge"="true"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8626) Extend HDFS super-user checks to dropPartitions

2015-03-01 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-8626:
---
Attachment: HIVE-8626.2.patch

Rebased, and simplified slightly. Removed unused imports, etc.

> Extend HDFS super-user checks to dropPartitions
> ---
>
> Key: HIVE-8626
> URL: https://issues.apache.org/jira/browse/HIVE-8626
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-8626.1.patch, HIVE-8626.2.patch
>
>
> HIVE-6392 takes care of allowing HDFS super-user accounts to register 
> partitions in tables whose HDFS paths don't explicitly grant 
> write-permissions to the super-user.
> However, the dropPartitions()/dropTable()/dropDatabase() use-cases don't 
> handle this at all. i.e. An HDFS super-user ({{kal...@dev.grid.myth.net}}) 
> can't drop the very partitions that were added to a table-directory owned by 
> the user ({{mithunr}}). The following error is the result:
> {quote}
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Table metadata 
> not deleted since 
> hdfs://mythcluster-nn1.grid.myth.net:8020/user/mithunr/myth.db/myth_table is 
> not writable by kal...@dev.grid.myth.net)
> {quote}
> This is the result of redundant checks in 
> {{HiveMetaStore::dropPartitionsAndGetLocations()}}:
> {code:title=HiveMetaStore.java|borderStyle=solid}
> if (!wh.isWritable(partPath.getParent())) {
>   throw new MetaException("Table metadata not deleted since the partition "
> + Warehouse.makePartName(partitionKeys, part.getValues()) 
> +  " has parent location " + partPath.getParent() 
> + " which is not writable " 
> + "by " + hiveConf.getUser());
> }
> {code}
> This check is already made in StorageBasedAuthorizationProvider. If the 
> argument is that the SBAP isn't guaranteed to be in play, then this shouldn't 
> be checked in HMS either. If HDFS permissions need to be checked in addition 
> to say, ACLs, then perhaps a recursively-composed auth-provider ought to be 
> used.
> For the moment, I'll get {{Warehouse.isWritable()}} to handle HDFS 
> super-users. But I think {{isWritable()}} checks oughtn't to be in 
> HiveMetaStore. (Perhaps fix this in another JIRA?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-03-01 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342835#comment-14342835
 ] 

Mithun Radhakrishnan commented on HIVE-9736:


The review on RB is [r/31615|https://reviews.apache.org/r/31615/]. Apologies 
for the delay.

> StorageBasedAuthProvider should batch namenode-calls where possible.
> 
>
> Key: HIVE-9736
> URL: https://issues.apache.org/jira/browse/HIVE-9736
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore, Security
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9736.1.patch
>
>
> Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
> have 1 associated regions. Consider that the user does:
> {code:sql}
> ALTER TABLE my_table DROP PARTITION (dt='20150101');
> {code}
> As things stand now, {{StorageBasedAuthProvider}} will make individual 
> {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
> and authorize each one separately. It'd be faster to batch the calls, and 
> examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-03-02 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: HIVE-9736.3.patch

[~cdrome] has me know (thank you!) that I'd neglected to change 
{{TestMetaStoreEventListener}} for this change. Here's the emended patch.

> *DropPartitionEvent should handle partition-sets.
> -
>
> Key: HIVE-9674
> URL: https://issues.apache.org/jira/browse/HIVE-9674
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9674.2.patch, HIVE-9736.3.patch
>
>
> Dropping a set of N partitions from a table currently results in N 
> DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
> is wasteful, especially so for large N. It also makes it impossible to even 
> try to run authorization-checks on all partitions in a batch.
> Taking the cue from HIVE-9609, we should compose an {{Iterable}} 
> in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9118) Support auto-purge for tables, when dropping tables/partitions.

2015-03-03 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345436#comment-14345436
 ] 

Mithun Radhakrishnan commented on HIVE-9118:


Thanks for the review and commit, sir. Much appreciated.

Could I please bother you for advice on HIVE-9086? We're having trouble 
reaching consensus on what the grammar should look like, for {{DROP PARTITIONS 
... PURGE}}.

> Support auto-purge for tables, when dropping tables/partitions.
> ---
>
> Key: HIVE-9118
> URL: https://issues.apache.org/jira/browse/HIVE-9118
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 1.0.0, 1.1
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Fix For: 1.2.0
>
> Attachments: HIVE-9118.1.patch, HIVE-9118.2.patch, HIVE-9118.3.patch
>
>
> HIVE-7100 introduced a way to skip the trash directory, when deleting 
> table-data, while dropping tables.
> In HIVE-9083/HIVE-9086, I extended this to work when partitions are dropped.
> Here, I propose a table-parameter ({{"auto.purge"}}) to set up tables to 
> skip-trash when table/partition data is deleted, without needing to say 
> "PURGE" on the Hive CLI. Apropos, on {{dropTable()}} and {{dropPartition()}}, 
> table data is deleted directly (and not moved to trash) if the following hold 
> true:
> # The table is MANAGED.
> # The {{deleteData}} parameter to the {{HMSC.drop*()}} methods is true.
> # Either PURGE is explicitly specified on the command-line (or rather, 
> {{"ifPurge"}} is set in the environment context, OR
> # TBLPROPERTIES contains {{"auto.purge"="true"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9083) New metastore API to support to purge partition-data directly in dropPartitions().

2015-03-03 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345443#comment-14345443
 ] 

Mithun Radhakrishnan commented on HIVE-9083:


[~leftylev]: Perhaps we should wait for HIVE-9086 to firm up. This is only the 
API change, which the javadoc hopefully covers.

> New metastore API to support to purge partition-data directly in 
> dropPartitions().
> --
>
> Key: HIVE-9083
> URL: https://issues.apache.org/jira/browse/HIVE-9083
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Fix For: 1.2.0
>
> Attachments: HIVE-9083.3.patch, HIVE-9083.4.patch, HIVE-9083.5.patch
>
>
> HIVE-7100 adds the option to purge table-data when dropping a table (from 
> Hive CLI.)
> This patch adds HiveMetaStoreClient APIs to support the same for 
> {{dropPartitions()}}.
> (I'll add a follow-up to support a command-line option for the same.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-03-04 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: HIVE-9736.4.patch

Here's an updated patch to decouple from HIVE-9609. One function is duplicated 
in {{JSONMessageFactory}}. (Sorry, Sush.) 

> *DropPartitionEvent should handle partition-sets.
> -
>
> Key: HIVE-9674
> URL: https://issues.apache.org/jira/browse/HIVE-9674
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9674.2.patch, HIVE-9736.3.patch, HIVE-9736.4.patch
>
>
> Dropping a set of N partitions from a table currently results in N 
> DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
> is wasteful, especially so for large N. It also makes it impossible to even 
> try to run authorization-checks on all partitions in a batch.
> Taking the cue from HIVE-9609, we should compose an {{Iterable}} 
> in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-03-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9588:
---
Attachment: HIVE-9588.4.patch

Hey, [~thejas]. Thanks for reviewing. I've added the comment you suggested.

> Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()
> -
>
> Key: HIVE-9588
> URL: https://issues.apache.org/jira/browse/HIVE-9588
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog, Metastore, Thrift API
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9588.1.patch, HIVE-9588.2.patch, HIVE-9588.3.patch, 
> HIVE-9588.4.patch
>
>
> {{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
> inefficient implementation. The partial partition-spec is converted into a 
> filter-string. The partitions are fetched from the server, and then dropped 
> one by one.
> Here's a reimplementation that uses the {{ExprNode}}-based 
> {{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
> back-and-forth between the HMS and the client-side. It also reduces the 
> memory footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-18 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan reassigned HIVE-9845:
--

Assignee: Mithun Radhakrishnan

> HCatSplit repeats information making input split data size huge
> ---
>
> Key: HIVE-9845
> URL: https://issues.apache.org/jira/browse/HIVE-9845
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Rohini Palaniswamy
>Assignee: Mithun Radhakrishnan
>
> Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
> has even triple the number of splits(100K+ splits and tasks) does not hit 
> that issue.
> {code}
> HCatBaseInputFormat.java:
>  //Call getSplit on the InputFormat, create an
>   //HCatSplit for each underlying split
>   //NumSplits is 0 for our purposes
>   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
> inputFormat.getSplits(jobConf, 0);
>   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
> splits.add(new HCatSplit(
> partitionInfo,
> split,allCols));
>   }
> {code}
> Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-18 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368588#comment-14368588
 ] 

Mithun Radhakrishnan commented on HIVE-9845:


There are several problems:

1. {{PartInfo}} stores and serializes the full {{TableInfo}} for every 
{{HCatSplit}} instance, even though that information is immutable;
2. {{PartInfo}} stores class-names for StorageHandler, SerDe, InputFormat and 
OutputFormat. It's likely that a lot of this is identical to the Table's info. 
I've changed the serialization not to include it if it doesn't differ from the 
Table.
3. Every {{HCatSplit}} stores the table's column schema separately, in spite of 
having this information available both in the {{InputJobInfo}} in the 
configuration, and in the {{TableInfo}} within the {{PartInfo}}. Again, 
redundant and wasteful.

I've changed the above. My testing with a Pig-script on a wide table (75 
columns) with a query spanning 45 partitions and 1000 splits looks promising:

{code}
# Before:
-rw-r--r--  10 mithunr users   55335466 2015-03-19 05:56 
/user/mithunr/.staging/job_1426414782073_303408/job.split
# After:
-rw-r--r--  10 mithunr users2643046 2015-03-19 06:01 
/user/mithunr/.staging/job_1426414782073_303697/job.split
{code}

Will post patch after cleaning it up a tad.

> HCatSplit repeats information making input split data size huge
> ---
>
> Key: HIVE-9845
> URL: https://issues.apache.org/jira/browse/HIVE-9845
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Rohini Palaniswamy
>Assignee: Mithun Radhakrishnan
>
> Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
> has even triple the number of splits(100K+ splits and tasks) does not hit 
> that issue.
> {code}
> HCatBaseInputFormat.java:
>  //Call getSplit on the InputFormat, create an
>   //HCatSplit for each underlying split
>   //NumSplits is 0 for our purposes
>   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
> inputFormat.getSplits(jobConf, 0);
>   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
> splits.add(new HCatSplit(
> partitionInfo,
> split,allCols));
>   }
> {code}
> Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-20 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: HIVE-9845.1.patch

The proposed fix.

> HCatSplit repeats information making input split data size huge
> ---
>
> Key: HIVE-9845
> URL: https://issues.apache.org/jira/browse/HIVE-9845
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Rohini Palaniswamy
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9845.1.patch
>
>
> Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
> has even triple the number of splits(100K+ splits and tasks) does not hit 
> that issue.
> {code}
> HCatBaseInputFormat.java:
>  //Call getSplit on the InputFormat, create an
>   //HCatSplit for each underlying split
>   //NumSplits is 0 for our purposes
>   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
> inputFormat.getSplits(jobConf, 0);
>   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
> splits.add(new HCatSplit(
> partitionInfo,
> split,allCols));
>   }
> {code}
> Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-03-25 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380320#comment-14380320
 ] 

Mithun Radhakrishnan commented on HIVE-9674:


Sush, could you please review this one? I'd like to avoid another rebase.

> *DropPartitionEvent should handle partition-sets.
> -
>
> Key: HIVE-9674
> URL: https://issues.apache.org/jira/browse/HIVE-9674
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9674.2.patch, HIVE-9736.3.patch, HIVE-9736.4.patch
>
>
> Dropping a set of N partitions from a table currently results in N 
> DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
> is wasteful, especially so for large N. It also makes it impossible to even 
> try to run authorization-checks on all partitions in a batch.
> Taking the cue from HIVE-9609, we should compose an {{Iterable}} 
> in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9582) HCatalog should use IMetaStoreClient interface

2015-03-25 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381017#comment-14381017
 ] 

Mithun Radhakrishnan commented on HIVE-9582:


I wonder, should {{HCatUtils}} be package-protected?

> HCatalog should use IMetaStoreClient interface
> --
>
> Key: HIVE-9582
> URL: https://issues.apache.org/jira/browse/HIVE-9582
> Project: Hive
>  Issue Type: Sub-task
>  Components: HCatalog, Metastore
>Affects Versions: 0.14.0, 0.13.1
>Reporter: Thiruvel Thirumoolan
>Assignee: Thiruvel Thirumoolan
>  Labels: hcatalog, metastore, rolling_upgrade
> Attachments: HIVE-9582.1.patch, HIVE-9582.2.patch, HIVE-9582.3.patch, 
> HIVE-9582.4.patch, HIVE-9582.5.patch, HIVE-9583.1.patch
>
>
> Hive uses IMetaStoreClient and it makes using RetryingMetaStoreClient easy. 
> Hence during a failure, the client retries and possibly succeeds. But 
> HCatalog has long been using HiveMetaStoreClient directly and hence failures 
> are costly, especially if they are during the commit stage of a job. Its also 
> not possible to do rolling upgrade of MetaStore Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-30 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: HIVE-9845.2.patch

> HCatSplit repeats information making input split data size huge
> ---
>
> Key: HIVE-9845
> URL: https://issues.apache.org/jira/browse/HIVE-9845
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Rohini Palaniswamy
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9845.1.patch, HIVE-9845.2.patch
>
>
> Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
> has even triple the number of splits(100K+ splits and tasks) does not hit 
> that issue.
> {code}
> HCatBaseInputFormat.java:
>  //Call getSplit on the InputFormat, create an
>   //HCatSplit for each underlying split
>   //NumSplits is 0 for our purposes
>   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
> inputFormat.getSplits(jobConf, 0);
>   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
> splits.add(new HCatSplit(
> partitionInfo,
> split,allCols));
>   }
> {code}
> Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-31 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: (was: HIVE-9845.2.patch)

> HCatSplit repeats information making input split data size huge
> ---
>
> Key: HIVE-9845
> URL: https://issues.apache.org/jira/browse/HIVE-9845
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Rohini Palaniswamy
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9845.1.patch
>
>
> Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
> has even triple the number of splits(100K+ splits and tasks) does not hit 
> that issue.
> {code}
> HCatBaseInputFormat.java:
>  //Call getSplit on the InputFormat, create an
>   //HCatSplit for each underlying split
>   //NumSplits is 0 for our purposes
>   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
> inputFormat.getSplits(jobConf, 0);
>   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
> splits.add(new HCatSplit(
> partitionInfo,
> split,allCols));
>   }
> {code}
> Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-31 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: HIVE-9845.3.patch

Another take on the first patch. Except, with more logging, and a correction to 
{{TestHCatOutputFormat}}.

> HCatSplit repeats information making input split data size huge
> ---
>
> Key: HIVE-9845
> URL: https://issues.apache.org/jira/browse/HIVE-9845
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Rohini Palaniswamy
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch
>
>
> Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
> has even triple the number of splits(100K+ splits and tasks) does not hit 
> that issue.
> {code}
> HCatBaseInputFormat.java:
>  //Call getSplit on the InputFormat, create an
>   //HCatSplit for each underlying split
>   //NumSplits is 0 for our purposes
>   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
> inputFormat.getSplits(jobConf, 0);
>   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
> splits.add(new HCatSplit(
> partitionInfo,
> split,allCols));
>   }
> {code}
> Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-31 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388925#comment-14388925
 ] 

Mithun Radhakrishnan commented on HIVE-9845:


Bah, finally. Unrelated test-failures.

> HCatSplit repeats information making input split data size huge
> ---
>
> Key: HIVE-9845
> URL: https://issues.apache.org/jira/browse/HIVE-9845
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Rohini Palaniswamy
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch
>
>
> Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
> has even triple the number of splits(100K+ splits and tasks) does not hit 
> that issue.
> {code}
> HCatBaseInputFormat.java:
>  //Call getSplit on the InputFormat, create an
>   //HCatSplit for each underlying split
>   //NumSplits is 0 for our purposes
>   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
> inputFormat.getSplits(jobConf, 0);
>   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
> splits.add(new HCatSplit(
> partitionInfo,
> split,allCols));
>   }
> {code}
> Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-31 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: HIVE-9845.3.patch

> HCatSplit repeats information making input split data size huge
> ---
>
> Key: HIVE-9845
> URL: https://issues.apache.org/jira/browse/HIVE-9845
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Rohini Palaniswamy
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch
>
>
> Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
> has even triple the number of splits(100K+ splits and tasks) does not hit 
> that issue.
> {code}
> HCatBaseInputFormat.java:
>  //Call getSplit on the InputFormat, create an
>   //HCatSplit for each underlying split
>   //NumSplits is 0 for our purposes
>   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
> inputFormat.getSplits(jobConf, 0);
>   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
> splits.add(new HCatSplit(
> partitionInfo,
> split,allCols));
>   }
> {code}
> Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9845) HCatSplit repeats information making input split data size huge

2015-03-31 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9845:
---
Attachment: (was: HIVE-9845.3.patch)

> HCatSplit repeats information making input split data size huge
> ---
>
> Key: HIVE-9845
> URL: https://issues.apache.org/jira/browse/HIVE-9845
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Rohini Palaniswamy
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9845.1.patch, HIVE-9845.3.patch
>
>
> Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which 
> has even triple the number of splits(100K+ splits and tasks) does not hit 
> that issue.
> {code}
> HCatBaseInputFormat.java:
>  //Call getSplit on the InputFormat, create an
>   //HCatSplit for each underlying split
>   //NumSplits is 0 for our purposes
>   org.apache.hadoop.mapred.InputSplit[] baseSplits = 
> inputFormat.getSplits(jobConf, 0);
>   for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
> splits.add(new HCatSplit(
> partitionInfo,
> split,allCols));
>   }
> {code}
> Each hcatSplit duplicates partition schema and table schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10213) MapReduce jobs using dynamic-partitioning fail on commit.

2015-04-03 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-10213:

Attachment: HIVE-10213.1.patch

> MapReduce jobs using dynamic-partitioning fail on commit.
> -
>
> Key: HIVE-10213
> URL: https://issues.apache.org/jira/browse/HIVE-10213
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-10213.1.patch
>
>
> I recently ran into a problem in {{TaskCommitContextRegistry}}, when using 
> dynamic-partitions.
> Consider a MapReduce program that reads HCatRecords from a table (using 
> HCatInputFormat), and then writes to another table (with identical schema), 
> using HCatOutputFormat. The Map-task fails with the following exception:
> {code}
> Error: java.io.IOException: No callback registered for 
> TaskAttemptID:attempt_1426589008676_509707_m_00_0@hdfs://crystalmyth.myth.net:8020/user/mithunr/mythdb/target/_DYN0.6784154320609959/grid=__HIVE_DEFAULT_PARTITION__/dt=__HIVE_DEFAULT_PARTITION__
> at 
> org.apache.hive.hcatalog.mapreduce.TaskCommitContextRegistry.commitTask(TaskCommitContextRegistry.java:56)
> at 
> org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitTask(FileOutputCommitterContainer.java:139)
> at org.apache.hadoop.mapred.Task.commit(Task.java:1163)
> at org.apache.hadoop.mapred.Task.done(Task.java:1025)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> {code}
> {{TaskCommitContextRegistry::commitTask()}} uses call-backs registered from 
> {{DynamicPartitionFileRecordWriter}}. But in case {{HCatInputFormat}} and 
> {{HCatOutputFormat}} are both used in the same job, the 
> {{DynamicPartitionFileRecordWriter}} might only be exercised in the Reducer.
> I'm relaxing the IOException, and log a warning message instead of just 
> failing.
> (I'll post the fix shortly.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10213) MapReduce jobs using dynamic-partitioning fail on commit.

2015-04-05 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396319#comment-14396319
 ] 

Mithun Radhakrishnan commented on HIVE-10213:
-

The borked tests are unrelated, I think. 

> MapReduce jobs using dynamic-partitioning fail on commit.
> -
>
> Key: HIVE-10213
> URL: https://issues.apache.org/jira/browse/HIVE-10213
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-10213.1.patch
>
>
> I recently ran into a problem in {{TaskCommitContextRegistry}}, when using 
> dynamic-partitions.
> Consider a MapReduce program that reads HCatRecords from a table (using 
> HCatInputFormat), and then writes to another table (with identical schema), 
> using HCatOutputFormat. The Map-task fails with the following exception:
> {code}
> Error: java.io.IOException: No callback registered for 
> TaskAttemptID:attempt_1426589008676_509707_m_00_0@hdfs://crystalmyth.myth.net:8020/user/mithunr/mythdb/target/_DYN0.6784154320609959/grid=__HIVE_DEFAULT_PARTITION__/dt=__HIVE_DEFAULT_PARTITION__
> at 
> org.apache.hive.hcatalog.mapreduce.TaskCommitContextRegistry.commitTask(TaskCommitContextRegistry.java:56)
> at 
> org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitTask(FileOutputCommitterContainer.java:139)
> at org.apache.hadoop.mapred.Task.commit(Task.java:1163)
> at org.apache.hadoop.mapred.Task.done(Task.java:1025)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> {code}
> {{TaskCommitContextRegistry::commitTask()}} uses call-backs registered from 
> {{DynamicPartitionFileRecordWriter}}. But in case {{HCatInputFormat}} and 
> {{HCatOutputFormat}} are both used in the same job, the 
> {{DynamicPartitionFileRecordWriter}} might only be exercised in the Reducer.
> I'm relaxing the IOException, and log a warning message instead of just 
> failing.
> (I'll post the fix shortly.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9609) AddPartitionMessage.getPartitions() can return null

2015-04-07 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484171#comment-14484171
 ] 

Mithun Radhakrishnan commented on HIVE-9609:


@[~sushanth]: 

1-2. I agree with you, and hence, me again. (?!) {{List>}} might 
be doable, but we can hit that with a separate JIRA. The rest of the iterator 
stuff is pretty neat. I'll read through the updated patch more closely before 
+1-ing.
3. That was likely my (IDE's) doing. Much obliged, and many simultaneous 
apologies.

I had recommended a change to 
{{AuthorizationPreEventListener.authorizeAddPartition}} to use the alternative 
{{PartitionWrapper}} constructor. (It's way faster.) But again, it's possible 
that that change distracts from our objective here. Separate JIRA?

> AddPartitionMessage.getPartitions() can return null
> ---
>
> Key: HIVE-9609
> URL: https://issues.apache.org/jira/browse/HIVE-9609
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Sushanth Sowmyan
>Assignee: Sushanth Sowmyan
> Attachments: HIVE-9609.2.patch, HIVE-9609.3.patch, HIVE-9609.patch
>
>
> DbNotificationListener and NotificationListener both depend on 
> AddPartitionEvent.getPartitions() to get their partitions to trigger a 
> message, but this can be null if an AddPartitionEvent was initialized on a 
> PartitionSpec rather than a List.
> Also, AddPartitionEvent seems to have a duality, where getPartitions() works 
> only if instantiated on a List, and getPartitionIterator() works 
> only if instantiated on a PartitionSpec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10250) Optimize AuthorizationPreEventListener to reuse TableWrapper objects

2015-04-07 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-10250:

Attachment: HIVE-10250.1.patch

Tentative fix. I'll have to rebase this once HIVE-9609 goes in.

> Optimize AuthorizationPreEventListener to reuse TableWrapper objects
> 
>
> Key: HIVE-10250
> URL: https://issues.apache.org/jira/browse/HIVE-10250
> Project: Hive
>  Issue Type: Bug
>  Components: Authorization
>Reporter: Mithun Radhakrishnan
> Attachments: HIVE-10250.1.patch
>
>
> Here's the {{PartitionWrapper}} class in {{AuthorizationPreEventListener}}:
> {code:java|title=AuthorizationPreEventListener.java}
>  public static class PartitionWrapper extends 
> org.apache.hadoop.hive.ql.metadata.Partition {
> ...
> public PartitionWrapper(org.apache.hadoop.hive.metastore.api.Partition 
> mapiPart, PreEventContext context) throws ... {
>  Partition wrapperApiPart   = mapiPart.deepCopy();
>  Table t = context.getHandler().get_table_core(
>  mapiPart.getDbName(), 
>  mapiPart.getTableName());
> ...
> }
> {code}
> {{PreAddPartitionEvent}} (and soon, {{PreDropPartitionEvent}}) correspond not 
> just to a single partition, but an entire set of partitions added atomically. 
> When the event is authorized, {{HMSHandler.get_table_core()}} will be called 
> once for every partition in the Event instance.
> Since we already make the assumption that the partition-sets correspond to a 
> single table, we might as well make a single call.
> I'll have a patch for this, shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9609) AddPartitionMessage.getPartitions() can return null

2015-04-07 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484479#comment-14484479
 ] 

Mithun Radhakrishnan commented on HIVE-9609:


Looking good, sir. +1.

FYI, I've raised HIVE-10250 for the {{AuthPreEventListener}} problem. I'll 
rebase HIVE-10250 and HIVE-9674 after we commit this one.

> AddPartitionMessage.getPartitions() can return null
> ---
>
> Key: HIVE-9609
> URL: https://issues.apache.org/jira/browse/HIVE-9609
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Sushanth Sowmyan
>Assignee: Sushanth Sowmyan
> Attachments: HIVE-9609.2.patch, HIVE-9609.3.patch, HIVE-9609.patch
>
>
> DbNotificationListener and NotificationListener both depend on 
> AddPartitionEvent.getPartitions() to get their partitions to trigger a 
> message, but this can be null if an AddPartitionEvent was initialized on a 
> PartitionSpec rather than a List.
> Also, AddPartitionEvent seems to have a duality, where getPartitions() works 
> only if instantiated on a List, and getPartitionIterator() works 
> only if instantiated on a PartitionSpec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-10250) Optimize AuthorizationPreEventListener to reuse TableWrapper objects

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan reassigned HIVE-10250:
---

Assignee: Mithun Radhakrishnan

> Optimize AuthorizationPreEventListener to reuse TableWrapper objects
> 
>
> Key: HIVE-10250
> URL: https://issues.apache.org/jira/browse/HIVE-10250
> Project: Hive
>  Issue Type: Bug
>  Components: Authorization
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-10250.1.patch
>
>
> Here's the {{PartitionWrapper}} class in {{AuthorizationPreEventListener}}:
> {code:java|title=AuthorizationPreEventListener.java}
>  public static class PartitionWrapper extends 
> org.apache.hadoop.hive.ql.metadata.Partition {
> ...
> public PartitionWrapper(org.apache.hadoop.hive.metastore.api.Partition 
> mapiPart, PreEventContext context) throws ... {
>  Partition wrapperApiPart   = mapiPart.deepCopy();
>  Table t = context.getHandler().get_table_core(
>  mapiPart.getDbName(), 
>  mapiPart.getTableName());
> ...
> }
> {code}
> {{PreAddPartitionEvent}} (and soon, {{PreDropPartitionEvent}}) correspond not 
> just to a single partition, but an entire set of partitions added atomically. 
> When the event is authorized, {{HMSHandler.get_table_core()}} will be called 
> once for every partition in the Event instance.
> Since we already make the assumption that the partition-sets correspond to a 
> single table, we might as well make a single call.
> I'll have a patch for this, shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486145#comment-14486145
 ] 

Mithun Radhakrishnan commented on HIVE-9674:


Actually, [~sushanth], let's hold off for right now, on this one. I'll rebase 
this under the assumption that HIVE-9609 is good to go.

> *DropPartitionEvent should handle partition-sets.
> -
>
> Key: HIVE-9674
> URL: https://issues.apache.org/jira/browse/HIVE-9674
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9674.2.patch
>
>
> Dropping a set of N partitions from a table currently results in N 
> DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
> is wasteful, especially so for large N. It also makes it impossible to even 
> try to run authorization-checks on all partitions in a batch.
> Taking the cue from HIVE-9609, we should compose an {{Iterable}} 
> in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: HIVE-9674.3.patch

Rebased to accommodate HIVE-9609.

> *DropPartitionEvent should handle partition-sets.
> -
>
> Key: HIVE-9674
> URL: https://issues.apache.org/jira/browse/HIVE-9674
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9674.2.patch, HIVE-9674.3.patch
>
>
> Dropping a set of N partitions from a table currently results in N 
> DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
> is wasteful, especially so for large N. It also makes it impossible to even 
> try to run authorization-checks on all partitions in a batch.
> Taking the cue from HIVE-9609, we should compose an {{Iterable}} 
> in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9681) Extend HiveAuthorizationProvider to support partition-sets.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9681:
---
Attachment: HIVE-9681.2.patch

Rebased to accommodate the rebase of HIVE-9674.

> Extend HiveAuthorizationProvider to support partition-sets.
> ---
>
> Key: HIVE-9681
> URL: https://issues.apache.org/jira/browse/HIVE-9681
> Project: Hive
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9681.1.patch, HIVE-9681.2.patch
>
>
> {{HiveAuthorizationProvider}} allows only for the authorization of a single 
> partition at a time. For instance, when the {{StorageBasedAuthProvider}} must 
> authorize an operation on a set of partitions (say from a 
> PreDropPartitionEvent), each partition's data-directory needs to be checked 
> individually. For N partitions, this results in N namenode calls.
> I'd like to add {{authorize()}} overloads that accept multiple partitions. 
> This will allow StorageBasedAuthProvider to make batched namenode calls. 
> P.S. There's 2 further optimizations that are possible:
> 1. In the ideal case, we'd have a single call in 
> {{org.apache.hadoop.fs.FileSystem}} to check access for an array of Paths, 
> something like:
> {code:title=FileSystem.java|borderStyle=solid}
> @InterfaceAudience.LimitedPrivate({"HDFS", "Hive"})
>   public void access(Path [] paths, FsAction mode) throws 
> AccessControlException, FileNotFoundException, IOException 
> {...}
> {code}
> 2. We can go one better if we could retrieve partition-locations in DirectSQL 
> and use those for authorization. The EventListener-abstraction behind which 
> the AuthProviders operate make this difficult. I can attempt to solve this 
> using a PartitionSpec and a call-back into the ObjectStore from 
> StorageBasedAuthProvider. I'll save this rigmarole for later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: (was: HIVE-9736.3.patch)

> *DropPartitionEvent should handle partition-sets.
> -
>
> Key: HIVE-9674
> URL: https://issues.apache.org/jira/browse/HIVE-9674
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9674.2.patch
>
>
> Dropping a set of N partitions from a table currently results in N 
> DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
> is wasteful, especially so for large N. It also makes it impossible to even 
> try to run authorization-checks on all partitions in a batch.
> Taking the cue from HIVE-9609, we should compose an {{Iterable}} 
> in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486159#comment-14486159
 ] 

Mithun Radhakrishnan commented on HIVE-9736:


Ok, I'd better rebase this change.

> StorageBasedAuthProvider should batch namenode-calls where possible.
> 
>
> Key: HIVE-9736
> URL: https://issues.apache.org/jira/browse/HIVE-9736
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore, Security
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9736.1.patch
>
>
> Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
> have 1 associated regions. Consider that the user does:
> {code:sql}
> ALTER TABLE my_table DROP PARTITION (dt='20150101');
> {code}
> As things stand now, {{StorageBasedAuthProvider}} will make individual 
> {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
> and authorize each one separately. It'd be faster to batch the calls, and 
> examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: (was: HIVE-9736.4.patch)

> *DropPartitionEvent should handle partition-sets.
> -
>
> Key: HIVE-9674
> URL: https://issues.apache.org/jira/browse/HIVE-9674
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.14.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9674.2.patch
>
>
> Dropping a set of N partitions from a table currently results in N 
> DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
> is wasteful, especially so for large N. It also makes it impossible to even 
> try to run authorization-checks on all partitions in a batch.
> Taking the cue from HIVE-9609, we should compose an {{Iterable}} 
> in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9736:
---
Attachment: HIVE-9736.2.patch

The rebased patch, as promised. 

> StorageBasedAuthProvider should batch namenode-calls where possible.
> 
>
> Key: HIVE-9736
> URL: https://issues.apache.org/jira/browse/HIVE-9736
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore, Security
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9736.1.patch, HIVE-9736.2.patch
>
>
> Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
> have 1 associated regions. Consider that the user does:
> {code:sql}
> ALTER TABLE my_table DROP PARTITION (dt='20150101');
> {code}
> As things stand now, {{StorageBasedAuthProvider}} will make individual 
> {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
> and authorize each one separately. It'd be faster to batch the calls, and 
> examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-04-08 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486260#comment-14486260
 ] 

Mithun Radhakrishnan commented on HIVE-9736:


@[~cnauroth]: Good to meet you, sir. I'd value your input on this change, given 
that you've worked on the SBAP already.

bq. Great ideas in this patch!
Aww, shucks... You're only saying that because it's true. ;p 

I should have a rebased version for you shortly. I'd better sort HIVE-9674 out 
first.

> StorageBasedAuthProvider should batch namenode-calls where possible.
> 
>
> Key: HIVE-9736
> URL: https://issues.apache.org/jira/browse/HIVE-9736
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore, Security
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Attachments: HIVE-9736.1.patch
>
>
> Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
> have 1 associated regions. Consider that the user does:
> {code:sql}
> ALTER TABLE my_table DROP PARTITION (dt='20150101');
> {code}
> As things stand now, {{StorageBasedAuthProvider}} will make individual 
> {{DistributedFileSystem.listStatus()}} calls for each partition-directory, 
> and authorize each one separately. It'd be faster to batch the calls, and 
> examine multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2   3   4   5   6   7   >