[jira] [Created] (HIVE-12678) BETWEEN relational operator sometimes returns incorrect results against PARQUET tables

2015-12-15 Thread Nicholas Brenwald (JIRA)
Nicholas Brenwald created HIVE-12678:


 Summary: BETWEEN relational operator sometimes returns incorrect 
results against PARQUET tables
 Key: HIVE-12678
 URL: https://issues.apache.org/jira/browse/HIVE-12678
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Nicholas Brenwald


When querying a parquet table, the BETWEEN relational operator returns 
incorrect results when hive.optimize.index.filter and hive.optimize.ppd.storage 
are enabled

Create a parquet table:
{code}
create table t(c string) stored as parquet;
{code}

Insert some strings representing dates
{code}
insert into t select '2015-12-09' from default.dual limit 1;
insert into t select '2015-12-10' from default.dual limit 1;
insert into t select '2015-12-11' from default.dual limit 1;
{code}

h3. Example 1
This query correctly returns 3:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=true;
select count(*) from t where c >= '2015-12-09' and c <= '2015-12-11';
+--+--+
| _c0  |
+--+--+
| 3|
+--+--+
{code}

This query incorrectly returns 1:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=true;
select count(*) from t where c between '2015-12-09' and '2015-12-11';
+--+--+
| _c0  |
+--+--+
| 1|
+--+--+
{code}

Disabling hive.optimize.findex.filter resolves the problem. This query now 
correctly returns 3:
{code}
set hive.optimize.index.filter=false;
set hive.optimize.ppd.storage=true;
select count(*) from t where c between '2015-12-09' and '2015-12-11';
+--+--+
| _c0  |
+--+--+
| 3|
+--+--+
{code}

Disabling hive.optimize.ppd.storage resolves the problem. This query now 
correctly returns 3:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=false;
select count(*) from t where c between '2015-12-09' and '2015-12-11';
+--+--+
| _c0  |
+--+--+
| 3|
+--+--+
{code}

h3. Example 2
This query correctly returns 1:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=true;
select count(*) from t where c >=  '2015-12-10' and c <= '2015-12-10';
+--+--+
| _c0  |
+--+--+
| 1|
+--+--+
{code}

This query incorrectly returns 0:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=true;
select count(*) from t where c between '2015-12-10' and '2015-12-10';
+--+--+
| _c0  |
+--+--+
| 0|
+--+--+
{code}

Disabling hive.optimize.findex.filter resolves the problem. This query now 
correctly returns 1:
{code}
set hive.optimize.index.filter=false;
set hive.optimize.ppd.storage=true;
select count(*) from t where c >= '2015-12-10' and c <= '2015-12-10';
+--+--+
| _c0  |
+--+--+
| 1|
+--+--+
{code}

Disabling hive.optimize.ppd.storage resolves the problem. This query now 
correctly returns 1:
{code}
set hive.optimize.index.filter=true;
set hive.optimize.ppd.storage=false;
select count(*) from t where c >= '2015-12-10' and c <= '2015-12-10';
+--+--+
| _c0  |
+--+--+
| 1|
+--+--+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Allow other implementations of IMetaStoreClient in Hive

2015-12-15 Thread Austin Lee
Thank you so much Alan for your prompt responses and for the information
you provided.  I will have a look at the HBase work.

I am new to the process and it's not 100% clear to me, but the wiki seems
to suggest I should use this forum to get to consensus on a proposal before
creating a JIRA ticket.  If the "why" is clear on my proposal, I would like
to create a JIRA ticket and take this through the rest of the process via
JIRA.  Does that sound good?

Thanks,
Austin

On Tue, Dec 15, 2015 at 11:04 AM, Alan Gates  wrote:

> For work along the same lines you should check out the HBase metastore
> work in Hive 2.0.  It still uses the thrift server and RawStore but puts
> HBase behind it instead of an RDBMS.  We did this because we found that
> most of the inefficiencies of Hive's metadata access had to do with the
> layout of the RDBMS and the way it was accessed.  In the same work I built
> short-circuit options in to avoid using thrift and enable sharing of
> objects across HiveMetaStore and HiveMetaStoreClient.
>
> On the backwards incompatibilities, yes IMetaStoreClient evolves in lock
> step with the thrift interface.  My point was we often add calls, add new
> fields to structs, etc.  Your code would still compile in these cases, new
> features just wouldn't work.  Given that a couple major Hadoop support
> vendors now support rolling upgrades they are devs interested in making
> sure that client version x works properly with server version x+1.
>
> Still, we don't test for the use case you are proposing so we could end up
> breaking your code without knowing it.
>
> When I said it wasn't external, I meant we did not expect end users to
> write code against it (like say the UDF interface).  Yes it's external to
> the metastore package as you point out.
>
> Alan.
>
> Austin Lee 
> December 15, 2015 at 10:46
> Yes, a more efficient implementation is what I am trying to achieve.  I
> also want to retain the ability to talk to a remote metastore that is not
> necessarily thrift.
>
> To be more precise, what I would like is a more efficient metastore.  In
> looking at the current architecture, I came to a conclusion that there are
> three logical boundaries where I can inject an improved implementation or
> alternative to what Hive offers in the metastore space.
>
> 1) RawStore
> I think the existing mechanism that Hive offers users to choose from major
> RDBMSes works fine.  I suppose there's still room for improvement here, but
> the impact of those improvements would be limited to the storage aspects of
> metadata.
>
> 2) Thrift server
> An alternative HiveMetaStore that talks Hive Metastore Thrift.  It's
> almost a coin toss between this and #3, but I think for the reasons I will
> state below, #3 is preferable.
>
> 3) IMetaStoreClient
> I feel this gives me the most freedom since I can be embedded or remote.
> I am not tied to the Thrift interface or the RawStore interface, if I
> choose to roll my own.
>
> One thing that does concern me is your statement about IMetaStoreClient
> being an internal interface, which is true.  Do the changes to this
> interface really happen ad-hoc?  Doesn't it evolve in lock step with the
> Thrift interface?  If so, wouldn't backward compatibility guarantees for
> Thrift translate to backward compatibility guarantees for this interface as
> well?  From the way it is used by Query Planning, I think it could be made
> an "external" interface that belongs in hive-metastore.
>
>
> Alan Gates 
> December 15, 2015 at 10:14
> I don't see an issue with this, it seems fine.  One caveat though is we
> see this as an internal interface and we change it all the time.  I
> wouldn't want to be pushed into making backwards compatibility guarantees
> for IMetaStoreClient.  Which means that if you develop a different
> implementation of it outside Hive it will likely break on every upgrade.
>
> I don't understand your example use case.  You can run Hive now without
> the thrift server, so I'm guessing that's not what you're really trying to
> do.  Are you just interested in building a more efficient implementation or
> do you have another use case in mind?
>
> Alan.
>
> Austin Lee 
> December 14, 2015 at 20:48
> Hi,
>
> I would like to propose a change that would make it possible for users to
> choose an implementation of IMetaStoreClient via HiveConf, i.e.
> hive-site.xml. Currently, in Hive the choice is hard coded to be
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.
> There is no other direct reference to SessionHiveMetaStoreClient other than
> the hard coded class name in Hive.java and the QL component operates only
> on the IMetaStoreClient interface so the change would be minimal and it
> would be quite similar to how an implementation of RawStore is specified
> and loaded in hive-metastore. One use case this change would serve would
> be one where a user wishes to 

Re: Allow other implementations of IMetaStoreClient in Hive

2015-12-15 Thread Alan Gates
I don't see an issue with this, it seems fine.  One caveat though is we 
see this as an internal interface and we change it all the time.  I 
wouldn't want to be pushed into making backwards compatibility 
guarantees for IMetaStoreClient.  Which means that if you develop a 
different implementation of it outside Hive it will likely break on 
every upgrade.


I don't understand your example use case.  You can run Hive now without 
the thrift server, so I'm guessing that's not what you're really trying 
to do.  Are you just interested in building a more efficient 
implementation or do you have another use case in mind?


Alan.


Austin Lee 
December 14, 2015 at 20:48
Hi,

I would like to propose a change that would make it possible for users to
choose an implementation of IMetaStoreClient via HiveConf, i.e.
hive-site.xml. Currently, in Hive the choice is hard coded to be
SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.
There is no other direct reference to SessionHiveMetaStoreClient other 
than

the hard coded class name in Hive.java and the QL component operates only
on the IMetaStoreClient interface so the change would be minimal and it
would be quite similar to how an implementation of RawStore is specified
and loaded in hive-metastore. One use case this change would serve would
be one where a user wishes to use an implementation of this interface
without the dependency on the Thrift server. I would appreciate the
community's input and feedback on this proposal.

Thank you,
Austin



Re: Allow other implementations of IMetaStoreClient in Hive

2015-12-15 Thread Alan Gates
For work along the same lines you should check out the HBase metastore 
work in Hive 2.0.  It still uses the thrift server and RawStore but puts 
HBase behind it instead of an RDBMS.  We did this because we found that 
most of the inefficiencies of Hive's metadata access had to do with the 
layout of the RDBMS and the way it was accessed.  In the same work I 
built short-circuit options in to avoid using thrift and enable sharing 
of objects across HiveMetaStore and HiveMetaStoreClient.


On the backwards incompatibilities, yes IMetaStoreClient evolves in lock 
step with the thrift interface.  My point was we often add calls, add 
new fields to structs, etc.  Your code would still compile in these 
cases, new features just wouldn't work.  Given that a couple major 
Hadoop support vendors now support rolling upgrades they are devs 
interested in making sure that client version x works properly with 
server version x+1.


Still, we don't test for the use case you are proposing so we could end 
up breaking your code without knowing it.


When I said it wasn't external, I meant we did not expect end users to 
write code against it (like say the UDF interface).  Yes it's external 
to the metastore package as you point out.


Alan.


Austin Lee 
December 15, 2015 at 10:46
Yes, a more efficient implementation is what I am trying to achieve.  
I also want to retain the ability to talk to a remote metastore that 
is not necessarily thrift.


To be more precise, what I would like is a more efficient metastore.  
In looking at the current architecture, I came to a conclusion that 
there are three logical boundaries where I can inject an improved 
implementation or alternative to what Hive offers in the metastore space.


1) RawStore
I think the existing mechanism that Hive offers users to choose from 
major RDBMSes works fine.  I suppose there's still room for 
improvement here, but the impact of those improvements would be 
limited to the storage aspects of metadata.


2) Thrift server
An alternative HiveMetaStore that talks Hive Metastore Thrift.  It's 
almost a coin toss between this and #3, but I think for the reasons I 
will state below, #3 is preferable.


3) IMetaStoreClient
I feel this gives me the most freedom since I can be embedded or 
remote.  I am not tied to the Thrift interface or the RawStore 
interface, if I choose to roll my own.


One thing that does concern me is your statement about 
IMetaStoreClient being an internal interface, which is true.  Do the 
changes to this interface really happen ad-hoc?  Doesn't it evolve in 
lock step with the Thrift interface?  If so, wouldn't backward 
compatibility guarantees for Thrift translate to backward 
compatibility guarantees for this interface as well?  From the way it 
is used by Query Planning, I think it could be made an "external" 
interface that belongs in hive-metastore.



Alan Gates 
December 15, 2015 at 10:14
I don't see an issue with this, it seems fine.  One caveat though is 
we see this as an internal interface and we change it all the time.  I 
wouldn't want to be pushed into making backwards compatibility 
guarantees for IMetaStoreClient.  Which means that if you develop a 
different implementation of it outside Hive it will likely break on 
every upgrade.


I don't understand your example use case.  You can run Hive now 
without the thrift server, so I'm guessing that's not what you're 
really trying to do.  Are you just interested in building a more 
efficient implementation or do you have another use case in mind?


Alan.

Austin Lee 
December 14, 2015 at 20:48
Hi,

I would like to propose a change that would make it possible for users to
choose an implementation of IMetaStoreClient via HiveConf, i.e.
hive-site.xml. Currently, in Hive the choice is hard coded to be
SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.
There is no other direct reference to SessionHiveMetaStoreClient other 
than

the hard coded class name in Hive.java and the QL component operates only
on the IMetaStoreClient interface so the change would be minimal and it
would be quite similar to how an implementation of RawStore is specified
and loaded in hive-metastore. One use case this change would serve would
be one where a user wishes to use an implementation of this interface
without the dependency on the Thrift server. I would appreciate the
community's input and feedback on this proposal.

Thank you,
Austin



Re: Allow other implementations of IMetaStoreClient in Hive

2015-12-15 Thread Austin Lee
Yes, a more efficient implementation is what I am trying to achieve.  I
also want to retain the ability to talk to a remote metastore that is not
necessarily thrift.

To be more precise, what I would like is a more efficient metastore.  In
looking at the current architecture, I came to a conclusion that there are
three logical boundaries where I can inject an improved implementation or
alternative to what Hive offers in the metastore space.

1) RawStore
I think the existing mechanism that Hive offers users to choose from major
RDBMSes works fine.  I suppose there's still room for improvement here, but
the impact of those improvements would be limited to the storage aspects of
metadata.

2) Thrift server
An alternative HiveMetaStore that talks Hive Metastore Thrift.  It's almost
a coin toss between this and #3, but I think for the reasons I will state
below, #3 is preferable.

3) IMetaStoreClient
I feel this gives me the most freedom since I can be embedded or remote.  I
am not tied to the Thrift interface or the RawStore interface, if I choose
to roll my own.

One thing that does concern me is your statement about IMetaStoreClient
being an internal interface, which is true.  Do the changes to this
interface really happen ad-hoc?  Doesn't it evolve in lock step with the
Thrift interface?  If so, wouldn't backward compatibility guarantees for
Thrift translate to backward compatibility guarantees for this interface as
well?  From the way it is used by Query Planning, I think it could be made
an "external" interface that belongs in hive-metastore.

On Tue, Dec 15, 2015 at 10:14 AM, Alan Gates  wrote:

> I don't see an issue with this, it seems fine.  One caveat though is we
> see this as an internal interface and we change it all the time.  I
> wouldn't want to be pushed into making backwards compatibility guarantees
> for IMetaStoreClient.  Which means that if you develop a different
> implementation of it outside Hive it will likely break on every upgrade.
>
> I don't understand your example use case.  You can run Hive now without
> the thrift server, so I'm guessing that's not what you're really trying to
> do.  Are you just interested in building a more efficient implementation or
> do you have another use case in mind?
>
> Alan.
>
> Austin Lee 
> December 14, 2015 at 20:48
> Hi,
>
> I would like to propose a change that would make it possible for users to
> choose an implementation of IMetaStoreClient via HiveConf, i.e.
> hive-site.xml. Currently, in Hive the choice is hard coded to be
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.
> There is no other direct reference to SessionHiveMetaStoreClient other than
> the hard coded class name in Hive.java and the QL component operates only
> on the IMetaStoreClient interface so the change would be minimal and it
> would be quite similar to how an implementation of RawStore is specified
> and loaded in hive-metastore. One use case this change would serve would
> be one where a user wishes to use an implementation of this interface
> without the dependency on the Thrift server. I would appreciate the
> community's input and feedback on this proposal.
>
> Thank you,
> Austin
>
>


[jira] [Created] (HIVE-12676) [hive+impala] Alter table Rename to + Set location in a single step

2015-12-15 Thread Egmont Koblinger (JIRA)
Egmont Koblinger created HIVE-12676:
---

 Summary: [hive+impala] Alter table Rename to + Set location in a 
single step
 Key: HIVE-12676
 URL: https://issues.apache.org/jira/browse/HIVE-12676
 Project: Hive
  Issue Type: Improvement
  Components: hpl/sql
Reporter: Egmont Koblinger
Assignee: Dmitry Tolpeko
Priority: Minor


Assume a nonstandard table location, let's say /foo/bar/table1. You might want 
to rename from table1 to table2 and move the underlying data accordingly to 
/foo/bar/table2.

The "alter table ... rename to ..." clause alters the table name, but in the 
same step moves the data into the standard location 
/user/hive/warehouse/table2. Then a subsequent "alter table ... set location 
..." can move it back to the desired location /foo/bar/table2.

This is problematic if there's any permission problem in the game, e.g. not 
being able to write to /user/hive/warehouse. So it should be possible to move 
the underlying data to its desired final place without intermittent places in 
between.

A probably hard to discover workaround is to set the table to external, then 
rename it, then set back to internal and then change its location.

It would be great to be able to do an "alter table ... rename to ... set 
location ..." operation in a single step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Can you help us Hive Community?

2015-12-15 Thread Igor Wiese
Hi, Hive Community.

My name is Igor Wiese, phd Student from Brazil. I sent an email a week
ago about my research. We received some visit to inspect the results
but any feedback was provided.

I am investigating two important questions: What makes two files
change together? Can we predict when they are going to co-change
again?


My name is Igor Wiese, phd Student from Brazil. In my research I am
investigating two important questions: What makes two files change
together? Can we predict when they are going to co-change again?


I've tried to investigate this question on the Hive project. I've
collected data from issue reports, discussions and commits and using
some machine learning techniques to build a prediction model.


I collected a total of 721 commits in which a pair of files changed
together and could correctly predict 53% commits. These were the most
useful information for predicting co-changes of files:

- sum of number of lines of code added, modified and removed,

- number of words used to describe and discuss the issues,

- number of comments in each issue,

- median value of closeness, a social network measure obtained from
issue comments, and

- median value of effective size, a social network measure obtained
from issue comments.


To illustrate, consider the following example from our analysis. For
release 0.14, the files "metastore/MetaStoreDirectSql.java" and
"metastore/ObjectStore.java" changed together in 4 commits. In another
2 commits, only the first file changed, but not the second. Collecting
contextual information for each commit made to first file in previous
release, we were able to predict 4 commits in which both files changed
together in release 0.14, and we issued 0 false positives and two
wrong predictions. For this pair of files, the most important
contextual information was the number of lines of codes added, the sum
of lines of codes added, removed and modified, and two social network
metrics (constraint, ties) obtained from issue comments


- Do these results surprise you? Can you think in any explanation for
the results?

- Do you think that our rate of prediction is good enough to be used
for building tool support for the software community?

- Do you have any suggestion on what can be done to improve the change
recommendation?


You can visit a webpage to inspect the results in details:
http://flosscoach.com/index.php/17-cochanges/72-hive


All the best,
Igor Wiese

Phd Candidate


-- 
=
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná


[jira] [Created] (HIVE-12677) StackOverflowError with kryo

2015-12-15 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created HIVE-12677:
---

 Summary: StackOverflowError with kryo
 Key: HIVE-12677
 URL: https://issues.apache.org/jira/browse/HIVE-12677
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


{noformat}

explain formatted insert overwrite table default.test  select entry_date,
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(random_string
,"\\b(A\;S|A|Tours)\\b","Destination Services")
,"\\b(PPV/3PP)\\b","Third Party Package")
,"\\b(Flight)\\b","Air")
,"\\b(Rail)\\b","Train")
,"\\b(Hotel)\\b","Lodging") as rn from transactions where 
effective_date between '2015-12-01' AND '2015-12-31' limit 10;

{"STAGE DEPENDENCIES":{"Stage-1":{"ROOT STAGE":"TRUE"},"Stage-2":{"DEPENDENT 
STAGES":"Stage-1"},"Stage-0":{"DEPENDENT 
STAGES":"Stage-2"},"Stage-3":{"DEPENDENT STAGES":"Stage-0"}},"STAGE 
PLANS":{"Stage-1":{"Tez":{"Edges:":{"Reducer 2":{"parent":"Map 
1","type":"SIMPLE_EDGE"}},"DagName:":"rajesh_20151215120344_69fa6465-22ed-4fe2-83b5-20782e45d3f7:2","Vertices:":{"Map
 1":{"Map Operator 
Tree:":[{"TableScan":{"alias:":"transactions","filterExpr:":"effective_date 
BETWEEN '2015-12-01' AND '2015-12-31' (type: boolean)","Statistics:":"Num rows: 
197642628 Data size: 59095145772 Basic stats: COMPLETE Column stats: 
COMPLETE","children":{"Select Operator":{"expressions:":"entry_date (type: 
date), 
regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(random_string,
 '\\b(AS|A|Tours)\\b', 'Destination Services'), '\\b(PPV/3PP)\\b', 
'Third Party Package'), '\\b(Flight)\\b', 'Air'), '\\b(Rail)\\b', 'Train'), 
'\\b(Hotel)\\b', 'Lodging') (type: 
string)","outputColumnNames:":["_col0","_col1"],"Statistics:":"Num rows: 
197642628 Data size: 47434230720 Basic stats: COMPLETE Column stats: 
COMPLETE","children":{"Limit":{"Number of rows:":"10","Statistics:":"Num rows: 
10 Data size: 2400 Basic stats: COMPLETE Column stats: 
COMPLETE","children":{"Reduce Output Operator":{"sort 
order:":"","Statistics:":"Num rows: 10 Data size: 2400 Basic stats: COMPLETE 
Column stats: COMPLETE","TopN Hash Memory Usage:":"0.04","value 
expressions:":"_col0 (type: date), _col1 (type: string)"]},"Reducer 
2":{"Execution mode:":"vectorized","Reduce Operator Tree:":{"Select 
Operator":{"expressions:":"VALUE._col0 (type: date), VALUE._col1 (type: 
string)","outputColumnNames:":["_col0","_col1"],"Statistics:":"Num rows: 10 
Data size: 2400 Basic stats: COMPLETE Column stats: 
COMPLETE","children":{"Limit":{"Number of rows:":"10","Statistics:":"Num rows: 
10 Data size: 2400 Basic stats: COMPLETE Column stats: 
COMPLETE","children":{"Select Operator":{"expressions:":"UDFToString(_col0) 
(type: string), _col1 (type: 
string)","outputColumnNames:":["_col0","_col1"],"Statistics:":"Num rows: 10 
Data size: 3680 Basic stats: COMPLETE Column stats: COMPLETE","children":{"File 
Output Operator":{"compressed:":"false","Statistics:":"Num rows: 10 Data size: 
3680 Basic stats: COMPLETE Column stats: COMPLETE","table:":{"input 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde","name:":"default.test"},"Stage-2":{"Dependency
 Collection":{}},"Stage-0":{"Move 
Operator":{"tables:":{"replace:":"true","table:":{"input 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output 
format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat","serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde","name:":"default.test","Stage-3":{"Stats-Aggr
 Operator":{

{noformat}

{noformat}
childOperators (org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.vector.VectorLimitOperator)
childOperators (org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator)
reducer (org.apache.hadoop.hive.ql.plan.ReduceWork)
at 
org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:450)
at 
org.apache.hadoop.hive.ql.exec.Utilities.getReduceWork(Utilities.java:305)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor$1.call(ReduceRecordProcessor.java:106)
at 
org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:75)
... 16 more
Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: 
java.lang.IllegalArgumentException: Unable to create serializer 
"org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer" for 
class: org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator
Serialization trace:
childOperators (org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator)
childOperators 

Re: Allow other implementations of IMetaStoreClient in Hive

2015-12-15 Thread Austin Lee
FYI - I have created the following JIRA for this:

https://issues.apache.org/jira/browse/HIVE-12679

Thanks,
Austin

On Tue, Dec 15, 2015 at 12:49 PM, Alan Gates  wrote:

> I think opening a JIRA is a good next step.
>
> Alan.
>
> Austin Lee 
> December 15, 2015 at 11:19
> Thank you so much Alan for your prompt responses and for the information
> you provided.  I will have a look at the HBase work.
>
> I am new to the process and it's not 100% clear to me, but the wiki seems
> to suggest I should use this forum to get to consensus on a proposal before
> creating a JIRA ticket.  If the "why" is clear on my proposal, I would like
> to create a JIRA ticket and take this through the rest of the process via
> JIRA.  Does that sound good?
>
> Thanks,
> Austin
>
>
> Alan Gates 
> December 15, 2015 at 11:04
> For work along the same lines you should check out the HBase metastore
> work in Hive 2.0.  It still uses the thrift server and RawStore but puts
> HBase behind it instead of an RDBMS.  We did this because we found that
> most of the inefficiencies of Hive's metadata access had to do with the
> layout of the RDBMS and the way it was accessed.  In the same work I built
> short-circuit options in to avoid using thrift and enable sharing of
> objects across HiveMetaStore and HiveMetaStoreClient.
>
> On the backwards incompatibilities, yes IMetaStoreClient evolves in lock
> step with the thrift interface.  My point was we often add calls, add new
> fields to structs, etc.  Your code would still compile in these cases, new
> features just wouldn't work.  Given that a couple major Hadoop support
> vendors now support rolling upgrades they are devs interested in making
> sure that client version x works properly with server version x+1.
>
> Still, we don't test for the use case you are proposing so we could end up
> breaking your code without knowing it.
>
> When I said it wasn't external, I meant we did not expect end users to
> write code against it (like say the UDF interface).  Yes it's external to
> the metastore package as you point out.
>
> Alan.
>
> Austin Lee 
> December 15, 2015 at 10:46
> Yes, a more efficient implementation is what I am trying to achieve.  I
> also want to retain the ability to talk to a remote metastore that is not
> necessarily thrift.
>
> To be more precise, what I would like is a more efficient metastore.  In
> looking at the current architecture, I came to a conclusion that there are
> three logical boundaries where I can inject an improved implementation or
> alternative to what Hive offers in the metastore space.
>
> 1) RawStore
> I think the existing mechanism that Hive offers users to choose from major
> RDBMSes works fine.  I suppose there's still room for improvement here, but
> the impact of those improvements would be limited to the storage aspects of
> metadata.
>
> 2) Thrift server
> An alternative HiveMetaStore that talks Hive Metastore Thrift.  It's
> almost a coin toss between this and #3, but I think for the reasons I will
> state below, #3 is preferable.
>
> 3) IMetaStoreClient
> I feel this gives me the most freedom since I can be embedded or remote.
> I am not tied to the Thrift interface or the RawStore interface, if I
> choose to roll my own.
>
> One thing that does concern me is your statement about IMetaStoreClient
> being an internal interface, which is true.  Do the changes to this
> interface really happen ad-hoc?  Doesn't it evolve in lock step with the
> Thrift interface?  If so, wouldn't backward compatibility guarantees for
> Thrift translate to backward compatibility guarantees for this interface as
> well?  From the way it is used by Query Planning, I think it could be made
> an "external" interface that belongs in hive-metastore.
>
>
> Alan Gates 
> December 15, 2015 at 10:14
> I don't see an issue with this, it seems fine.  One caveat though is we
> see this as an internal interface and we change it all the time.  I
> wouldn't want to be pushed into making backwards compatibility guarantees
> for IMetaStoreClient.  Which means that if you develop a different
> implementation of it outside Hive it will likely break on every upgrade.
>
> I don't understand your example use case.  You can run Hive now without
> the thrift server, so I'm guessing that's not what you're really trying to
> do.  Are you just interested in building a more efficient implementation or
> do you have another use case in mind?
>
> Alan.
>
> Austin Lee 
> December 14, 2015 at 20:48
> Hi,
>
> I would like to propose a change that would make it possible for users to
> choose an implementation of IMetaStoreClient via HiveConf, i.e.
> hive-site.xml. Currently, in Hive the choice is hard coded to be
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.
> There is no other direct reference to SessionHiveMetaStoreClient other than
> 

[jira] [Created] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf

2015-12-15 Thread Austin Lee (JIRA)
Austin Lee created HIVE-12679:
-

 Summary: Allow users to be able to specify an implementation of 
IMetaStoreClient via HiveConf
 Key: HIVE-12679
 URL: https://issues.apache.org/jira/browse/HIVE-12679
 Project: Hive
  Issue Type: Improvement
  Components: Configuration, Metastore, Query Planning
Reporter: Austin Lee
Assignee: Austin Lee
Priority: Minor


Hi,

I would like to propose a change that would make it possible for users to 
choose an implementation of IMetaStoreClient via HiveConf, i.e. hive-site.xml.  
Currently, in Hive the choice is hard coded to be SessionHiveMetaStoreClient in 
org.apache.hadoop.hive.ql.metadata.Hive.  There is no other direct reference to 
SessionHiveMetaStoreClient other than the hard coded class name in Hive.java 
and the QL component operates only on the IMetaStoreClient interface so the 
change would be minimal and it would be quite similar to how an implementation 
of RawStore is specified and loaded in hive-metastore.  One use case this 
change would serve would be one where a user wishes to use an implementation of 
this interface without the dependency on the Thrift server.
  
Thank you,
Austin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Allow other implementations of IMetaStoreClient in Hive

2015-12-15 Thread Alan Gates

I think opening a JIRA is a good next step.

Alan.


Austin Lee 
December 15, 2015 at 11:19
Thank you so much Alan for your prompt responses and for the 
information you provided.  I will have a look at the HBase work.


I am new to the process and it's not 100% clear to me, but the wiki 
seems to suggest I should use this forum to get to consensus on a 
proposal before creating a JIRA ticket.  If the "why" is clear on my 
proposal, I would like to create a JIRA ticket and take this through 
the rest of the process via JIRA.  Does that sound good?


Thanks,
Austin


Alan Gates 
December 15, 2015 at 11:04
For work along the same lines you should check out the HBase metastore 
work in Hive 2.0.  It still uses the thrift server and RawStore but 
puts HBase behind it instead of an RDBMS.  We did this because we 
found that most of the inefficiencies of Hive's metadata access had to 
do with the layout of the RDBMS and the way it was accessed.  In the 
same work I built short-circuit options in to avoid using thrift and 
enable sharing of objects across HiveMetaStore and HiveMetaStoreClient.


On the backwards incompatibilities, yes IMetaStoreClient evolves in 
lock step with the thrift interface.  My point was we often add calls, 
add new fields to structs, etc.  Your code would still compile in 
these cases, new features just wouldn't work.  Given that a couple 
major Hadoop support vendors now support rolling upgrades they are 
devs interested in making sure that client version x works properly 
with server version x+1.


Still, we don't test for the use case you are proposing so we could 
end up breaking your code without knowing it.


When I said it wasn't external, I meant we did not expect end users to 
write code against it (like say the UDF interface).  Yes it's external 
to the metastore package as you point out.


Alan.

Austin Lee 
December 15, 2015 at 10:46
Yes, a more efficient implementation is what I am trying to achieve.  
I also want to retain the ability to talk to a remote metastore that 
is not necessarily thrift.


To be more precise, what I would like is a more efficient metastore.  
In looking at the current architecture, I came to a conclusion that 
there are three logical boundaries where I can inject an improved 
implementation or alternative to what Hive offers in the metastore space.


1) RawStore
I think the existing mechanism that Hive offers users to choose from 
major RDBMSes works fine.  I suppose there's still room for 
improvement here, but the impact of those improvements would be 
limited to the storage aspects of metadata.


2) Thrift server
An alternative HiveMetaStore that talks Hive Metastore Thrift.  It's 
almost a coin toss between this and #3, but I think for the reasons I 
will state below, #3 is preferable.


3) IMetaStoreClient
I feel this gives me the most freedom since I can be embedded or 
remote.  I am not tied to the Thrift interface or the RawStore 
interface, if I choose to roll my own.


One thing that does concern me is your statement about 
IMetaStoreClient being an internal interface, which is true.  Do the 
changes to this interface really happen ad-hoc?  Doesn't it evolve in 
lock step with the Thrift interface?  If so, wouldn't backward 
compatibility guarantees for Thrift translate to backward 
compatibility guarantees for this interface as well?  From the way it 
is used by Query Planning, I think it could be made an "external" 
interface that belongs in hive-metastore.



Alan Gates 
December 15, 2015 at 10:14
I don't see an issue with this, it seems fine.  One caveat though is 
we see this as an internal interface and we change it all the time.  I 
wouldn't want to be pushed into making backwards compatibility 
guarantees for IMetaStoreClient.  Which means that if you develop a 
different implementation of it outside Hive it will likely break on 
every upgrade.


I don't understand your example use case.  You can run Hive now 
without the thrift server, so I'm guessing that's not what you're 
really trying to do.  Are you just interested in building a more 
efficient implementation or do you have another use case in mind?


Alan.

Austin Lee 
December 14, 2015 at 20:48
Hi,

I would like to propose a change that would make it possible for users to
choose an implementation of IMetaStoreClient via HiveConf, i.e.
hive-site.xml. Currently, in Hive the choice is hard coded to be
SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.
There is no other direct reference to SessionHiveMetaStoreClient other 
than

the hard coded class name in Hive.java and the QL component operates only
on the IMetaStoreClient interface so the change would be minimal and it
would be quite similar to how an implementation of RawStore is specified
and loaded in hive-metastore. One use case this change would 

[jira] [Created] (HIVE-12681) Improvements on HIVE-11107 to remove template for PerfCliDriver and update stats data file.

2015-12-15 Thread Hari Sankar Sivarama Subramaniyan (JIRA)
Hari Sankar Sivarama Subramaniyan created HIVE-12681:


 Summary: Improvements on HIVE-11107 to remove template for 
PerfCliDriver and update stats data file.
 Key: HIVE-12681
 URL: https://issues.apache.org/jira/browse/HIVE-12681
 Project: Hive
  Issue Type: Sub-task
Reporter: Hari Sankar Sivarama Subramaniyan
Assignee: Hari Sankar Sivarama Subramaniyan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12684) NPE in stats annotation when all values in decimal column are NULLs

2015-12-15 Thread Prasanth Jayachandran (JIRA)
Prasanth Jayachandran created HIVE-12684:


 Summary: NPE in stats annotation when all values in decimal column 
are NULLs
 Key: HIVE-12684
 URL: https://issues.apache.org/jira/browse/HIVE-12684
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.3.0, 2.0.0, 2.1.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran


When all column values are null for a decimal column and when column stats 
exists. AnnotateWithStatistics optimization can throw NPE. Following is the 
exception trace

{code}
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.getColStatistics(StatsUtils.java:712)
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.convertColStats(StatsUtils.java:764)
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.getTableColumnStats(StatsUtils.java:750)
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:197)
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:143)
at 
org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:131)
at 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$TableScanStatsRule.process(StatsRulesProcFactory.java:114)
at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
at 
org.apache.hadoop.hive.ql.lib.LevelOrderWalker.walk(LevelOrderWalker.java:143)
at 
org.apache.hadoop.hive.ql.lib.LevelOrderWalker.startWalking(LevelOrderWalker.java:122)
at 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.AnnotateWithStatistics.transform(AnnotateWithStatistics.java:78)
at 
org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:228)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10156)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:225)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:237)
at 
org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:74)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:237)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12685) Remove invalid property in common/src/test/resources/hive-site.xml

2015-12-15 Thread Wei Zheng (JIRA)
Wei Zheng created HIVE-12685:


 Summary: Remove invalid property in 
common/src/test/resources/hive-site.xml
 Key: HIVE-12685
 URL: https://issues.apache.org/jira/browse/HIVE-12685
 Project: Hive
  Issue Type: Bug
Affects Versions: 2.0.0, 2.1.0
Reporter: Wei Zheng
Assignee: Wei Zheng


Currently there's such a property as below, which is obviously wrong
{code}

  javax.jdo.option.ConnectionDriverName
  hive-site.xml
  Override ConfVar defined in HiveConf

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 40467: HIVE-12075 analyze for file metadata

2015-12-15 Thread Sergey Shelukhin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40467/
---

(Updated Dec. 16, 2015, 4:17 a.m.)


Review request for hive, Alan Gates and Prasanth_J.


Repository: hive-git


Description
---

see jira


Diffs (updated)
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 243f281 
  itests/src/test/resources/testconfiguration.properties 1e7dce3 
  metastore/if/hive_metastore.thrift bb754f1 
  metastore/src/java/org/apache/hadoop/hive/metastore/FileFormatProxy.java 
PRE-CREATION 
  metastore/src/java/org/apache/hadoop/hive/metastore/FileMetadataHandler.java 
7c3525a 
  metastore/src/java/org/apache/hadoop/hive/metastore/FileMetadataManager.java 
PRE-CREATION 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
0940fd7 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java 
c5e7a5f 
  metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java 
aa96f77 
  metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java 
23068f8 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java abfe2b8 
  
metastore/src/java/org/apache/hadoop/hive/metastore/PartitionExpressionProxy.java
 ed59829 
  metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java e118a3b 
  
metastore/src/java/org/apache/hadoop/hive/metastore/filemeta/OrcFileMetadataHandler.java
 14189da 
  metastore/src/java/org/apache/hadoop/hive/metastore/hbase/HBaseReadWrite.java 
287394e 
  metastore/src/java/org/apache/hadoop/hive/metastore/hbase/HBaseStore.java 
b9509ab 
  metastore/src/java/org/apache/hadoop/hive/metastore/hbase/MetadataStore.java 
PRE-CREATION 
  
metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreControlledCommit.java
 c1156b3 
  
metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
 bf20e99 
  
metastore/src/test/org/apache/hadoop/hive/metastore/MockPartitionExpressionForMetastore.java
 d72bf76 
  metastore/src/test/org/apache/hadoop/hive/metastore/TestObjectStore.java 
9089d1c 
  metastore/src/test/org/apache/hadoop/hive/metastore/hbase/MockUtils.java 
983129a 
  ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 4fb6c00 
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFileFormatProxy.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java c682df2 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionExpressionForMetastore.java
 7cddcc9 
  ql/src/java/org/apache/hadoop/hive/ql/parse/AnalyzeCommandUtils.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java 
832a5bc 
  ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java c407aae 
  ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g 1c72b1c 
  ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g d5051ce 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java 
0affe84 
  ql/src/java/org/apache/hadoop/hive/ql/plan/CacheMetadataDesc.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/DDLWork.java a4c3db1 
  ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java af7e43e 
  ql/src/test/queries/clientpositive/stats_filemetadata.q PRE-CREATION 
  ql/src/test/results/clientpositive/tez/stats_filemetadata.q.out PRE-CREATION 
  shims/common/src/main/java/org/apache/hadoop/hive/io/HdfsUtils.java 
PRE-CREATION 

Diff: https://reviews.apache.org/r/40467/diff/


Testing
---


Thanks,

Sergey Shelukhin



[jira] [Created] (HIVE-12688) HIVE-11826 makes hive unusable in properly secured cluster

2015-12-15 Thread Thejas M Nair (JIRA)
Thejas M Nair created HIVE-12688:


 Summary: HIVE-11826 makes hive unusable in properly secured cluster
 Key: HIVE-12688
 URL: https://issues.apache.org/jira/browse/HIVE-12688
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.3.0, 2.0.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair


HIVE-11826 makes a change to restrict connections to metastore to users who 
belong to groups under 'hadoop.proxyuser.hive.groups'.
That property was only a meant to be a hadoop property, which controls what 
users the hive user can impersonate. What this change is doing is to enable use 
of that to also restrict who can connect to metastore server. This is new 
functionality, not a bug fix. There is value to this functionality.

However, this change makes hive unusable in a properly secured cluster. If 
'hadoop.proxyuser.hive.hosts' is set to the proper set of hosts that run 
Metastore and Hiveserver2 (instead of a very open "*"), then users will be able 
to connect to metastore only from those hosts.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12680) Binary type partition column values are incorrectly serialized and deserialized

2015-12-15 Thread Venki Korukanti (JIRA)
Venki Korukanti created HIVE-12680:
--

 Summary: Binary type partition column values are incorrectly 
serialized and deserialized
 Key: HIVE-12680
 URL: https://issues.apache.org/jira/browse/HIVE-12680
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 1.2.1
Reporter: Venki Korukanti
Priority: Minor


Here are the repro steps:

{code}
CREATE TABLE kv_binary(key INT, value STRING) PARTITIONED BY (binary_part 
BINARY);
INSERT INTO TABLE kv_binary PARTITION (binary_part='somevalue') SELECT * FROM 
kv LIMIT 1;
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Job running in-process (local Hadoop)
2015-12-15 13:34:15,758 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1142919541_0001
Loading data to table default.kv_binary partition (binary_part=[B@15871)
Partition default.kv_binary{binary_part=[B@15871} stats: [numFiles=1, 
numRows=1, totalSize=13, rawDataSize=12]
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 8192 HDFS Write: 11733 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
{code}

Partition created has java object reference as value in FileSystem:
{code}
hadoop fs -ls /user/hive/warehouse/kv_binary
Found 1 items
drwxr-xr-x   - hadoop supergroup  0 2015-12-15 13:34 
/user/hive/warehouse/kv_binary/binary_part=%5BB@15871
{code}

Selecting from the same table:
{code}
hive> SELECT * FROM kv_binary;
OK
238 val/238=[B@15871
{code}

This makes the binary partitions unusable, but binary partitions doesn't seem 
to be commonly used. Logging the bug for tracking purposes. Seems like 
somewhere are calling the toString on byte[].

BTW, this is working fine in Hive 1.0.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12682) Reducers in dynamic partitioning job spend a lot of time running hadoop.conf.Configuration.getOverlay

2015-12-15 Thread Carter Shanklin (JIRA)
Carter Shanklin created HIVE-12682:
--

 Summary: Reducers in dynamic partitioning job spend a lot of time 
running hadoop.conf.Configuration.getOverlay
 Key: HIVE-12682
 URL: https://issues.apache.org/jira/browse/HIVE-12682
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 1.2.1
Reporter: Carter Shanklin
 Attachments: reducer.png

I tested this on Hive 1.2.1 but looks like it's still applicable to 2.0.

I ran this query:
{code}
create table flights (
…
)
PARTITIONED BY (Year int)
CLUSTERED BY (Month)
SORTED BY (DayofMonth) into 12 buckets
STORED AS ORC
TBLPROPERTIES("orc.bloom.filter.columns"="*")
;
{code}

(Taken from here: 
https://github.com/t3rmin4t0r/all-airlines-data/blob/master/ddl/orc.sql)

I profiled just the reduce phase and noticed something odd, the attached graph 
shows where time was spent during the reducer phase.

Problem seems to relate to 
https://github.com/apache/hive/blob/branch-2.0/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L903

/cc [~gopalv]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-12686) TxnHandler.checkLock(CheckLockRequest) perf improvements

2015-12-15 Thread Eugene Koifman (JIRA)
Eugene Koifman created HIVE-12686:
-

 Summary: TxnHandler.checkLock(CheckLockRequest) perf improvements
 Key: HIVE-12686
 URL: https://issues.apache.org/jira/browse/HIVE-12686
 Project: Hive
  Issue Type: Bug
  Components: Transactions
Affects Versions: 1.3.0
Reporter: Eugene Koifman
Assignee: Eugene Koifman


CheckLockRequest should include txnid since the caller should always know this 
(if there is a txn).
This would make getTxnIdFromLockId() call unnecessary.

checkLock() is usually called much more often (especially at the beginning of 
exponential back off sequence), thus a lot of these heartbeats are overkill.

In fact, if we made heartbeat in DbTxnManager start right after locks in "W" 
state are inserted, heartbeat in checkLock() would not be needed at all.
This would be the best solution but need to make sure that heartbeating is 
started appropriately in Streaming API - currently it does not.  It requires 
the client to start heartbeating.

  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Review Request 41431: HIVE-12674 HS2 Tez session lifetime

2015-12-15 Thread Sergey Shelukhin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41431/
---

Review request for hive, Siddharth Seth and Vikram Dixit Kumaraswamy.


Repository: hive-git


Description
---

see JIRA


Diffs
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 243f281 
  ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecDriver.java 971dac9 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezJobMonitor.java f6bc19c 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionPoolManager.java 
0d84340 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java e1a8041 
  ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java c066c7a 
  ql/src/test/org/apache/hadoop/hive/ql/exec/tez/SampleTezSessionState.java 
d55c9fe 
  ql/src/test/org/apache/hadoop/hive/ql/exec/tez/TestTezSessionPool.java 
11c0325 

Diff: https://reviews.apache.org/r/41431/diff/


Testing
---


Thanks,

Sergey Shelukhin



[jira] [Created] (HIVE-12687) LLAP Workdirs need to default to YARN local

2015-12-15 Thread Gopal V (JIRA)
Gopal V created HIVE-12687:
--

 Summary: LLAP Workdirs need to default to YARN local
 Key: HIVE-12687
 URL: https://issues.apache.org/jira/browse/HIVE-12687
 Project: Hive
  Issue Type: Bug
  Components: llap
Affects Versions: 2.0.0, 2.1.0
Reporter: Gopal V
Assignee: Gopal V


{code}
   LLAP_DAEMON_WORK_DIRS("hive.llap.daemon.work.dirs", ""
{code}

is a bad default & fails at startup if not overridden.

A better default would be to fall back onto YARN local dirs if this is not 
configured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 40467: HIVE-12075 analyze for file metadata

2015-12-15 Thread Sergey Shelukhin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40467/
---

(Updated Dec. 16, 2015, 4:12 a.m.)


Review request for hive, Alan Gates and Prasanth_J.


Repository: hive-git


Description
---

see jira


Diffs (updated)
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 243f281 
  itests/src/test/resources/testconfiguration.properties 1e7dce3 
  metastore/if/hive_metastore.thrift bb754f1 
  metastore/src/java/org/apache/hadoop/hive/metastore/FileFormatProxy.java 
PRE-CREATION 
  metastore/src/java/org/apache/hadoop/hive/metastore/FileMetadataHandler.java 
7c3525a 
  metastore/src/java/org/apache/hadoop/hive/metastore/FileMetadataManager.java 
PRE-CREATION 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
0940fd7 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java 
c5e7a5f 
  metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java 
aa96f77 
  metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java 
23068f8 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java abfe2b8 
  
metastore/src/java/org/apache/hadoop/hive/metastore/PartitionExpressionProxy.java
 ed59829 
  metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java e118a3b 
  
metastore/src/java/org/apache/hadoop/hive/metastore/filemeta/FileMetadataHandler1.java
 PRE-CREATION 
  
metastore/src/java/org/apache/hadoop/hive/metastore/filemeta/OrcFileMetadataHandler.java
 14189da 
  metastore/src/java/org/apache/hadoop/hive/metastore/hbase/HBaseReadWrite.java 
287394e 
  metastore/src/java/org/apache/hadoop/hive/metastore/hbase/HBaseStore.java 
b9509ab 
  metastore/src/java/org/apache/hadoop/hive/metastore/hbase/MetadataStore.java 
PRE-CREATION 
  
metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreControlledCommit.java
 c1156b3 
  
metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
 bf20e99 
  
metastore/src/test/org/apache/hadoop/hive/metastore/MockPartitionExpressionForMetastore.java
 d72bf76 
  metastore/src/test/org/apache/hadoop/hive/metastore/TestObjectStore.java 
9089d1c 
  metastore/src/test/org/apache/hadoop/hive/metastore/hbase/MockUtils.java 
983129a 
  ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 4fb6c00 
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFileFormatProxy.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java c682df2 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionExpressionForMetastore.java
 7cddcc9 
  ql/src/java/org/apache/hadoop/hive/ql/parse/AnalyzeCommandUtils.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java 
832a5bc 
  ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java c407aae 
  ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g 1c72b1c 
  ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g d5051ce 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java 
0affe84 
  ql/src/java/org/apache/hadoop/hive/ql/plan/CacheMetadataDesc.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/DDLWork.java a4c3db1 
  ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java af7e43e 
  ql/src/test/queries/clientpositive/stats_filemetadata.q PRE-CREATION 
  ql/src/test/results/clientpositive/tez/stats_filemetadata.q.out PRE-CREATION 
  shims/common/src/main/java/org/apache/hadoop/hive/io/HdfsUtils.java 
PRE-CREATION 

Diff: https://reviews.apache.org/r/40467/diff/


Testing
---


Thanks,

Sergey Shelukhin



[jira] [Created] (HIVE-12689) Support multiple spark sessions in one Hive Session

2015-12-15 Thread Nemon Lou (JIRA)
Nemon Lou created HIVE-12689:


 Summary: Support multiple spark sessions in one Hive Session
 Key: HIVE-12689
 URL: https://issues.apache.org/jira/browse/HIVE-12689
 Project: Hive
  Issue Type: Improvement
  Components: Spark
Reporter: Nemon Lou


As discussed in HIVE-12538,in case of one Hive Connection been used 
concurrently,there should be more than one spark sessions for that connection.
{quote}
 A hive session may "own" more than one spark session in case of asynchronous 
queries. If a spark session is live (used to run a spark job), that spark 
session will not be used to run the next job. Therefore, whenever whenever a 
spark configuration change is detected in Hive session, we need to mark all the 
live Spark sessions as outdated. When we are getting a session from the pool 
and check if the flag is set, then we destroy it and get a new one. 
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)