Re: Question regarding lock manager

2021-09-03 Thread Alan Gates
You do not need ZooKeeper to use ACID in Hive.  The first thing I would
check is that you have configured your system as described on this page:
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions.  Also,
make sure you have not set hive.lock.manager to zookeeper.

There are other features in Hive that can optionally use ZK, such as the
discovery service for HiveServer2, you'll want to make sure this isn't
coming from there rather than from the transaction system.

Alan.

On Thu, Sep 2, 2021 at 7:46 AM Antoine DUBOIS 
wrote:

>
> Hello,
> I'm trying to configure ACID hive in a kerberos environment with :
> Hadoop 3.1.4 deployed in HA considered working
> and now I'm trying to setup hive with remote metastore and ACID
> configuration.
> I may misunderstand what is written in documentation
> https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions , but
> to me it's cleary stated that DbTxnManager do not need zookeeper to run
> properly,
> However it seems it in fact needs a zookeeper running somewhere as I have
> several attempt to connect to a local zookeeper instance:
> 2021-09-02T15:22:21,708 INFO [main-SendThread(localhost:2181)]
> client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
> 2021-09-02T15:22:21,708 DEBUG [main-SendThread(localhost:2181)]
> client.ZooKeeperSaslClient: creating sasl client: client=h***
> ;service=zookeeper;serviceHostname=localhost
> 2021-09-02T15:22:21,709 INFO [main-SendThread(localhost:2181)]
> zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.
> 0.1:2181. Will attempt to SASL-authenticate using Login Context section
> 'HiveZooKeeperClient'
> 2021-09-02T15:22:21,710 WARN [main-SendThread(localhost:2181)]
> zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error,
> closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>
> Could you please help me to understand properly the documentation ? Do I
> need or not a zookeeper instance if using DbTxnManager and if so, why it
> isn't stated precisely and explicitely in the documentation ?
> I hope you have the best of day.
>
> Antoine DUBOIS
>


[ANNOUNCE] Apache Hive 2.3.7 Released

2020-04-19 Thread Alan Gates
The Apache Hive team is proud to announce the release of Apache Hive
version 2.3.7.


The Apache Hive (TM) data warehouse software facilitates querying and
managing large datasets residing in distributed storage. Built on top
of Apache Hadoop (TM), it provides, among others:

* Tools to enable easy data extract/transform/load (ETL)
* A mechanism to impose structure on a variety of data formats
* Access to files stored either directly in Apache HDFS (TM) or in
other data storage systems such as Apache HBase (TM) or Amazon's S3
(TM).
* Query execution via Apache Hadoop MapReduce, Apache Tez and Apache
Spark frameworks.

For Hive release details and downloads, please visit:
https://hive.apache.org/downloads.html

Hive 2.3.7 Release Notes are available here:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346056&styleName=Text&projectId=12310843

We would like to thank the many contributors who made this release possible.

Regards,

The Apache Hive Team


Re: metastore without hadoop

2020-04-10 Thread Alan Gates
It needs Hadoop libraries; it still uses some HDFS libraries for file
reading and password management.  It does not require a running, installed
Hadoop.  It does not ship with the required Hadoop libraries to avoid
version clashes when installed on a running Hadoop system.

Alan.

On Thu, Apr 9, 2020 at 6:35 PM Rafael Jaimes III 
wrote:

> I've heard conflicting things about what Hive Standalone Metastore
> requires.
> I untarred hive-standalone-metastore-3.0.0-bin.tar.gz and ran
> bin/start-metastore, which results in the following error:
> "Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be
> set or hadoop must be in the path"
> So does Hive standalone need Hadoop or is this a configuration error?
>
> Thanks in advance,
> Rafael
>


Re: If Hive Metastore is compatibility with MariaDB version 10.x.?

2020-01-17 Thread Alan Gates
Hive is tested against MariaDB 5.5, so I can't say whether it will work
against version 10.  You would need to do some testing with it to see.

Alan.

On Fri, Jan 17, 2020 at 4:29 AM Oleksiy S 
wrote:

> Hi all.
>
> Could you please help? Customer asked if Hive Metastore is compatible with
> MariaDB version 10.x. He is going to use 10.4.10-MariaDB MariaDB Server.
>
> --
> Oleksiy
>


Re: Locks with ACID: need some clarifications

2019-09-09 Thread Alan Gates
Not simultaneously.  In Hive 2 the first delete started will obtain a lock,
and the second will have to wait.  In Hive 3, the first one to commit will
win and the second will fail (at commit time).

Alan.

On Mon, Sep 9, 2019 at 10:55 AM David Morin 
wrote:

> Thanks Alan,
>
> When you say "you just can't have two simultaneous deletes in the same
> partition", simultaneous means for the same transaction ?
> If a create 2 "transactions" for 2 deletes on the same table/partition it
> works. Am I right ?
>
>
> Le lun. 9 sept. 2019 à 19:04, Alan Gates  a écrit :
>
>> In Hive 2 update and delete take what are called semi-shared locks
>> (meaning they allow shared locks through, while not allowing other
>> semi-shared locks), and insert and select take shared locks.  So you can
>> insert or select while deleting, you just can't have two simultaneous
>> deletes in the same partition.
>>
>> The reason insert can take a shared lock is because Hive does not enforce
>> uniqueness constraints, so there's no concept of overwriting an existing
>> row.  Multiple inserts can also proceed simultaneously.
>>
>> This changes in Hive 3, where update and delete also take shared locks
>> and a first committer wins strategy is employed instead.
>>
>> Alan.
>>
>> On Mon, Sep 9, 2019 at 8:29 AM David Morin 
>> wrote:
>>
>>> Hello,
>>>
>>> I use in production HDP 2.6.5 with Hive 2.1.0
>>> We use transactional tables and we try to ingest data in a streaming way
>>> (despite the fact we still use Hive 2)
>>> I've read some docs but I would like some clarifications concerning the
>>> use of Locks with transactional tables.
>>> Do we have to use locks during insert or delete ?
>>> Let's consider this pipeline:
>>> 1. create a transaction for delete and related lock (shared)
>>> 2. create delta directory + file with this new transaction (original
>>> transaction != current with original is the transaction used for the last
>>> insert)
>>> 3 Same steps 1 and 2 for Insert (except original transaction = current)
>>> 4. commit transactions
>>>
>>> Can we use Shared lock here ? Thus select queries can still be used
>>>
>>> Thanks
>>> David
>>>
>>>
>>>
>>>
>>>
>>>
>>>


Re: Locks with ACID: need some clarifications

2019-09-09 Thread Alan Gates
In Hive 2 update and delete take what are called semi-shared locks (meaning
they allow shared locks through, while not allowing other semi-shared
locks), and insert and select take shared locks.  So you can insert or
select while deleting, you just can't have two simultaneous deletes in the
same partition.

The reason insert can take a shared lock is because Hive does not enforce
uniqueness constraints, so there's no concept of overwriting an existing
row.  Multiple inserts can also proceed simultaneously.

This changes in Hive 3, where update and delete also take shared locks and
a first committer wins strategy is employed instead.

Alan.

On Mon, Sep 9, 2019 at 8:29 AM David Morin 
wrote:

> Hello,
>
> I use in production HDP 2.6.5 with Hive 2.1.0
> We use transactional tables and we try to ingest data in a streaming way
> (despite the fact we still use Hive 2)
> I've read some docs but I would like some clarifications concerning the
> use of Locks with transactional tables.
> Do we have to use locks during insert or delete ?
> Let's consider this pipeline:
> 1. create a transaction for delete and related lock (shared)
> 2. create delta directory + file with this new transaction (original
> transaction != current with original is the transaction used for the last
> insert)
> 3 Same steps 1 and 2 for Insert (except original transaction = current)
> 4. commit transactions
>
> Can we use Shared lock here ? Thus select queries can still be used
>
> Thanks
> David
>
>
>
>
>
>
>


[ANNOUNCE] Apache Hive 3.1.2 released

2019-08-27 Thread Alan Gates
The Apache Hive team is proud to announce the release of Apache Hive
version 3.1.2.

The Apache Hive (TM) data warehouse software facilitates querying and
managing large datasets residing in distributed storage. Built on top
of Apache Hadoop (TM), it provides, among others:

* Tools to enable easy data extract/transform/load (ETL)

* A mechanism to impose structure on a variety of data formats

* Access to files stored either directly in Apache HDFS (TM) or in other
  data storage systems such as Apache HBase (TM)

* Query execution via Apache Hadoop MapReduce, Apache Tez and Apache
Spark frameworks.

For Hive release details and downloads, please
visit:https://hive.apache.org/downloads.html

Hive 3.1.2 Release Notes are available
here:https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12344397&styleName=Html&projectId=12310843

We would like to thank the many contributors who made this release
possible.

Regards,

The Apache Hive Team


[ANNOUNCE] Apache Hive 2.3.6 Released

2019-08-23 Thread Alan Gates
The Apache Hive team is proud to announce the release of Apache Hive
version 2.3.6.

The Apache Hive (TM) data warehouse software facilitates querying and
managing large datasets residing in distributed storage. Built on top
of Apache Hadoop (TM), it provides, among others:
* Tools to enable easy data extract/transform/load (ETL)
* A mechanism to impose structure on a variety of data formats
* Access to files stored either directly in Apache HDFS (TM) or in other
  data storage systems such as Apache HBase (TM)
* Query execution via Apache Hadoop MapReduce, Apache Tez and Apache Spark
frameworks.

For Hive release details and downloads, please visit:
https://hive.apache.org/downloads.html

Hive 2.3.6 Release Notes are available here:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12345603&styleName=Text&projectId=12310843

We would like to thank the many contributors who made this release
possible.

Regards,

The Apache Hive Team


Re: Question on Hive metastore thrift uri

2019-06-25 Thread Alan Gates
It depends on how you configure the system.  If you are using HS2 you can
configure it to talk directly to the metastoredb (by providing it with the
JDBC connection information and setting the metastore thrift url to
localhost) or to talk through a metastore server instance (by not providing
the JDBC credentials to the metastore db and configuring the thrift url
with the address of the metastore server).

Alan.

On Mon, Jun 24, 2019 at 9:13 PM reetika agrawal <
agrawal.reetika...@gmail.com> wrote:

> Hi,
>
> I have a question , how does the hive connection happen when we connect it
> using hive meta store thrift URI ?
> Does it go through hiveserver2->metastore -> metastore db or directly
> connects to metastore -> metastore db . If someone could help me with
> understanding this that would be great?
>
> --
> Thanks,
> Reetika Agrawal
>


Re: Restrict users from creating tables in default warehouse

2019-06-06 Thread Alan Gates
The easiest way to do this is through grant and revoke statements and/or
file permissions.  Hive has several authorization schemes (storage based
auth, sql standard auth, integration with Ranger and Sentry) added over
several releases.  Which version of Hive are you using and which, if any,
of these authorization schemes?

Alan.

On Thu, Jun 6, 2019 at 4:54 PM Mainak Ghosh  wrote:

> Hello,
>
> We are trying to restrict our customers from creating new tables in the
> default warehouse and encourage them to create their own warehouses for
> simpler maintenance. Can you suggest some ways we can achieve this?
>
> Thanks and Regards,
> Mainak


Re: Hive Insert and Select only specific columns ( not all columns ) - Partitioned table

2019-05-30 Thread Alan Gates
You need to provide a value for the deptno partition key.  You can't insert
into a partitioned table without providing a value for the partition
column.  You can either give it a static value:
insert into table emp_parquet partition (deptno = 'x') select empno, ename
from emp

or you can set it dynamically based on a value in emp (if such a value
exists)
insert into table emp_parquet partition (deptno) select empno, ename,
deptno from emp;

Alan.

On Wed, May 29, 2019 at 9:16 PM Raviprasad T  wrote:

> Is it possible to Insert only specific columns in  Hive   ( Partitioned
> tables ) ?
>
> For Non partitioned tables, it is working as below.
>
> insert into emp (empno,sal) select empno,sal from emp_hist.;   ---  This
> is working  ( emp_hist  table is having  empno,ename,sal,job_desc  columns,
> I am able to insert only empno  and  sal  columns )
>
> -- Forwarded message -
> From: Raviprasad T 
> Date: Mon, May 27, 2019 at 11:00 PM
> Subject: Hive Insert and Select only specific columns ( not all columns )
> - Partitioned table
> To: 
>
>
> *emp   ( Hive emp table,  Non partitioned table )*
>
> empo
> ename
> sal
> deptno
>
>
> *emp_parquet*   ( Hive  emp_parquet table  Partitioned by deptno )
>
> empo
> ename
> sal
> Partition by deptno
>
> I am  having two tables  emp  ( text file format  )   and emp_parquet  (
> Parquet file format )
>
> Both are have same columns  and  emp_parquet is  the partitioned table.
>
> I want to select only *specific column*   ( Not all columns )  from emp
> table  and insert into emp_parquet ( partitioned table )table.
>
> *insert into table emp_parquet partition by (deptno)  select empno, ename
> from emp;*
>
> Please note :  I want to insert only empno, ename   and   NOT  sal column
> in  emp_parquet table.  ( Partitioned table ).
> I have tried with Non partitioned table,  it was working fine
>
> Regards
> Ravi
>
>
>
>
> --
> --
> Regards,
> RAVI PRASAD. T
>
>
> --
> --
> Regards,
> RAVI PRASAD. T
>
>
> 
>  Virus-free.
> www.avg.com
> 
> <#m_6521893363656032594_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>


Re: hcatalog and hiveserver2

2019-05-24 Thread Alan Gates
HCatalog was built as an interface to allow tools such as Pig and MapReduce
to access Hive tabular data, for both read and write.  In more recent
versions of Hive, HCatalog has not been updated to support the newest
features, such as reading or writing transactional data or, in Hive 3.x,
accessing managed tables (that tables that Hive owns).

HiveServer2 is a ODBC/JDBC server for Hive.  There no relationship between
HiveServer2 and HCatalog.

Hive also has a metastore, a data catalog that tracks metadata for Hive
tables.  This can be run as a separate service, in which case it is often
referred to as HMS, or embedded into another system.  For example in the
past HiveServer2 was often configured to embed the metastore
functionality.  HCatalog communicates with the metastore to determine what
physical storage objects (files or objects in an object store) make up a
table or partition that the non-Hive user wants to interact with.
Traditionally Spark communicates directly with the Hive metastore (I
believe it can either embed the metastore functionality or communicate with
an external HMS, but I'm not sure) and then operates directly on the
underlying files or objects. This no longer works in Hive 3, and there are
other ways to connect the two, which I can go into if you're interested.

Alan.


On Fri, May 24, 2019 at 1:28 AM 崔苗(未来技术实验室) <0049003...@znv.com> wrote:

> Hi,
> we have some confusion about hive :
> 1、what is the difference between hcatalog and hiveserver2 ,does
> hiveserver2 rely on hcatalog ?
> 2、what is the layer of hcatalog and hiverserver2 in the whole  Hive
> Architecture ?
> 3、how does spark sql read hive tables , through hcatalog or hiveserver2 ?
>
> thanks for any replys
>
> 0049003208
> 0049003...@znv.com
>
> 
> 签名由 网易邮箱大师  定制
>


Re: Consuming delta from Hive tables

2019-05-20 Thread Alan Gates
On Sun, May 19, 2019 at 11:21 PM Bhargav Bipinchandra Naik (Seller
Platform-BLR)  wrote:

> Hi Alan,
>
>
> Are write_ids monotonically increasing?
>
They are assigned monotonically, but the transactions they are a part of
may commit at different times, so you can't use it as a low water mark.
That is, if you looked at the state of the table at time t1 and say that
write_id1 and write_id3 had been committed, it does not mean that there
won't be a write_id2 the next time you look, as the transaction for
write_id2 could have started before the transaction for write_id3 but
finished after.


> Are write_ids accessible in the hive query?
>
For e.g.:
> select * from table_name where write_id > N;
>
No.  For full ACID (ORC) tables the write_id is part of a pseudo-column
struct called row__id (not to be confused with the row_id mentioned before,
sorry we overloaded the term).  For insert only ACID (Non-ORC tables) the
write id is inferred from the filename.  In both cases the metastore
doesn't know about these columns, and thus I believe will fail the query
saying "no such column".

Alan.

>
> Basically I am trying to understand if I can use write_id to consume only
> updated rows.
> Store the maximum write_id(X) seen in the result and next time query for
> all rows with row_id greater than X.
>
> Thanks,
> Bhargav
>
> On Fri, May 17, 2019 at 10:37 PM Alan Gates  wrote:
>
>> Sorry, looks like you sent this earlier and I missed it.
>>
>> A couple of things.  One, write_id is per transaction per table.  So for
>> table T, all rows written in w1 will have the same write_id, though they
>> will each have their own monotonically increasing row_ids.  Row_ids are
>> scoped by a write_id, so if both w1 and w2 insert a 100 rows, w1 would have
>> write_id 1, and row_ids 0-99 while w2's rows would have write_id 2 and
>> row_ids 0-99.
>>
>> Two, If w1 and w2 both attempted to update or delete (not insert) records
>> from the same partition of table T, then w1 would fail at commit time
>> because it would see that w2 had already committed and there's a possible
>> conflict.  This avoids lost updates and deleted records magically
>> reappearing.
>>
>> Alan.
>>
>> On Fri, May 17, 2019 at 4:44 AM Bhargav Bipinchandra Naik (Seller
>> Platform-BLR)  wrote:
>>
>>> Is the following scenario supported?
>>>
>>> *timestamp:* t1 < t2 < t3 < t4 < t5 < t6
>>>
>>> *w1 -* transaction which updates subset of rows in table T {start_time:
>>> t1, end_time: t5}
>>> *w2 -* transaction which updates subset of rows in table T {start_time:
>>> t2, end_time: t3}
>>> *r1 - *job which reads rows from table T {start_time: t4}
>>> *r2 - *job which reads rows from table T {start_time: t6}
>>>
>>> - Is the write_id strictly increasing number across rows?
>>> - Is the write_id a version number per row and not a global construct?
>>> - Will the subset of rows updated by w1 have write_ids greater than
>>> write_ids of row updated by w2?
>>>
>>> Say if job r1 consumed the data at t4 had maximum write_id 100.
>>> Will rows updated by job w1 (end_time: t5) always have write_id > 100?
>>>
>>> Basically I need some kind of checkpoint using which the next run of the
>>> read job can read only the data updated since the checkpoint.
>>>
>>> Thanks,
>>> -Bhargav
>>>
>>>
>>>
>>>
>>>


Re: Consuming delta from Hive tables

2019-05-17 Thread Alan Gates
Sorry, looks like you sent this earlier and I missed it.

A couple of things.  One, write_id is per transaction per table.  So for
table T, all rows written in w1 will have the same write_id, though they
will each have their own monotonically increasing row_ids.  Row_ids are
scoped by a write_id, so if both w1 and w2 insert a 100 rows, w1 would have
write_id 1, and row_ids 0-99 while w2's rows would have write_id 2 and
row_ids 0-99.

Two, If w1 and w2 both attempted to update or delete (not insert) records
from the same partition of table T, then w1 would fail at commit time
because it would see that w2 had already committed and there's a possible
conflict.  This avoids lost updates and deleted records magically
reappearing.

Alan.

On Fri, May 17, 2019 at 4:44 AM Bhargav Bipinchandra Naik (Seller
Platform-BLR)  wrote:

> Is the following scenario supported?
>
> *timestamp:* t1 < t2 < t3 < t4 < t5 < t6
>
> *w1 -* transaction which updates subset of rows in table T {start_time:
> t1, end_time: t5}
> *w2 -* transaction which updates subset of rows in table T {start_time:
> t2, end_time: t3}
> *r1 - *job which reads rows from table T {start_time: t4}
> *r2 - *job which reads rows from table T {start_time: t6}
>
> - Is the write_id strictly increasing number across rows?
> - Is the write_id a version number per row and not a global construct?
> - Will the subset of rows updated by w1 have write_ids greater than
> write_ids of row updated by w2?
>
> Say if job r1 consumed the data at t4 had maximum write_id 100.
> Will rows updated by job w1 (end_time: t5) always have write_id > 100?
>
> Basically I need some kind of checkpoint using which the next run of the
> read job can read only the data updated since the checkpoint.
>
> Thanks,
> -Bhargav
>
>
>
>
>


Re: Any HIVE DDL statement takes minutes to execute

2019-05-16 Thread Alan Gates
I can think of two things that could take a long time in creating a table,
database operations or file system operations.  The perf timers inside the
metastore only measure the entire metadata operation, not the file part and
the db part, so it will be hard to tell where the time is being spent.
When a table is first created the metastore prints a debug message to the
logs that says "create_table" (you have to having logging set to DEBUG to
see this).  This will tell you when the metastore started processing the
create table.  Between creating the directory for the table and connecting
to the RDBMS to create the entry for it, the createtime for the table is
set.  A describe table extended should show you the create time of the
table (or you can directly query the TBLS table in the RDBMS to find it as
well).  Finally, when the metastore is done creating the table there is
another entry in the log that starts with "create_table".  All three of
these timestamps are generated on the same machine, so clock syncing won't
be an issue.  These three timestamps should give you an idea of whether the
majority of the time is being spent creating the directory or creating an
entry in the database.

Which logs you need to look in to find the debug statements depends on
whether you have a separate Hive Metastore Thrift service running or you
have Hive Server2 directly communicating with the RDBMS.

Alan.

On Thu, May 16, 2019 at 1:42 AM Iulian Mongescu 
wrote:

> Hi Mich,
>
>
>
> First thank  you for taking the time to look over this problem. Now
> regarding the questions :
>
>
>
>1. I can confirm there are no locks on metastore DB ;
>2. About duration of the queries, in my previous mail I just gave some
>examples and I  can confirm that I run those queries on the metastore db
>server and also from the hive node that I’m using to test and the results
>are similar, almost instant response on all queries;
>3. And yes, this apply only on DDL statements and is constant problem,
>not a random delay;
>4. Regarding the network communication blocking, there is no firewall
>or a network performance issue between hive node and metastore db. As I
>said at the previous point, I run all queries also manually using mysql cmd
>client from the hive node and the response was almost instant;
>
>
>
> Thank you,
>
> Iulian
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, May 16, 2019 11:20 AM
> *To:* user 
> *Subject:* Re: Any HIVE DDL statement takes minutes to execute
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> I don't know much about MySQL but assuming it has tools to see the
> activity in the back end, what locks are you seeing in the database itself
> plus the duration of time that the command is executed on RDBMS etc.
>
>
>
> Does this only apply to the DDL statements?
>
>
>
> It is either some locking/blocking in the back end or the network
> connection between your Hadoop and the RDBMS causing the issue
>
>
>
> I just tested DDL for external table in Hive through Oracle database and
> there was no issue.
>
>
>
> HTH
>
>
>
>
>
> On Thu, 16 May 2019 at 08:16, Iulian Mongescu 
> wrote:
>
> Hi Alan,
>
>
>
> I’m using MySQL (Mariadb) for the metastore and I was thinking on this
> possibility too but from all my tests on metastore database that I run,
> every query is almost instant.
>
> For example :
>
> SELECT * FROM `TBLS`  ->  Query took 0.0001 seconds.
>
> INSERT INTO `TBLS` ->  Query took 0.0020 seconds
>
> DELETE FROM `TBLS` -> Query took 0.0021 seconds
>
>
>
> Thank you,
>
> Iulian
>
>
>
> *From:* Alan Gates 
> *Sent:* Wednesday, May 15, 2019 9:51 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Any HIVE DDL statement takes minutes to execute
>
>
>
> What are you using as the RDBMS for your metastore?  A first place I'd
> look is if the communications with the RDBMS are slow for some reason.
>
>
>
> Alan.
>
>
>
> On Wed, May 15, 2019 at 10:34 AM Iulian Mo

Re: Any HIVE DDL statement takes minutes to execute

2019-05-15 Thread Alan Gates
What are you using as the RDBMS for your metastore?  A first place I'd look
is if the communications with the RDBMS are slow for some reason.

Alan.

On Wed, May 15, 2019 at 10:34 AM Iulian Mongescu 
wrote:

> Hello,
>
>
>
> I'm working on a HDP-2.6.5.0 cluster with kerberos enabled and I have a
> problem with hive as any DDL statement that I run takes minutes to execute
> but any DML run in normal limits. I checked the logs but I didn’t find
> anything that seems related with this problem and I would appreciate any
> help to debug this issue.
>
>
>
> Please find bellow some examples with DDL&DML queries and their durations:
>
>
>
> 
>
> 0: jdbc:hive2://hdpx03:1/> CREATE EXTERNAL TABLE IF NOT EXISTS agenti1
> (...) STORED AS ORC LOCATION
> '/staging/core/agenti/2019-03-18/29d52a54eecae3731b31a3d6ef45d012';
>
> No rows affected (184.191 seconds)
>
> 
>
> 0: jdbc:hive2://hdpx03:1/> show tables;
>
> +---+--+
>
> | tab_name |
>
> +---+--+
>
> | agenti1 |
>
> +---+--+
>
> 1 row selected (0.358 seconds)
>
> -
>
> 0: jdbc:hive2://hdpx03:1/> select count(*) as total from agenti1 where
> 1;
>
> INFO : Tez session hasn't been created yet. Opening session
>
> INFO : Dag name: select count(*) as total from agenti1 wh...1(Stage-1)
>
> INFO : Status: Running (Executing on YARN cluster with App id
> application_1552674174918_0002)
>
>
>
>
> 
>
> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
>
>
> 
>
> Map 1 .. SUCCEEDED 1 1 0 0 0 0
>
> Reducer 2 .. SUCCEEDED 1 1 0 0 0 0
>
>
> 
>
> VERTICES: 02/02 [==>>] 100% ELAPSED TIME: 5.48 s
>
>
> 
>
> ++--+
>
> | total |
>
> ++--+
>
> | 1960 |
>
> ++--+
>
> 1 row selected (15.853 seconds)
>
>
>
> ---
>
> 0: jdbc:hive2://hdpx03:1/> drop table agenti1;
>
> No rows affected (184.164 seconds)
>
> 
>
> 0: jdbc:hive2://hdpx03:1/> CREATE EXTERNAL TABLE IF NOT EXISTS agenti1
> (...) STORED AS ORC LOCATION
> '/staging/core/agenti/2019-03-18/29d52a54eecae3731b31a3d6ef45d012';
>
> No rows affected (190.288 seconds)
>
>
>
> Thanks,
>
>
>
> Iulian
>
>
>


[ANNOUNCE] Apache Hive 2.3.5 Released

2019-05-15 Thread Alan Gates
The Apache Hive team is proud to announce the release of Apache Hive
version 2.3.5.

The Apache Hive (TM) data warehouse software facilitates querying and
managing large datasets residing in distributed storage. Built on top
of Apache Hadoop (TM), it provides, among others:

* Tools to enable easy data extract/transform/load (ETL)

* Interactive query over terabytes sized datasets

* A mechanism to impose structure on a variety of data formats

* Access to files stored either directly in Apache HDFS (TM) or in other
  data storage systems such as Apache HBase (TM)

* Query execution via Apache Hadoop MapReduce, Apache Tez and Apache Spark
frameworks.

For Hive release details and downloads, please visit:
https://hive.apache.org/downloads.html

Hive 2.3.5 Release Notes are available here:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12345394&styleName=Text&projectId=12310843

We would like to thank the many contributors who made this release
possible.

Regards,

The Apache Hive Team


Re: Consuming delta from Hive tables

2019-05-06 Thread Alan Gates
The other issue is an external system has no ability to control when the
compactor is run (it rewrites deltas into the base files and thus erases
intermediate states that would interest you).  The mapping of writeids
(table specific) to transaction ids (system wide) is also cleaned
intermittently, again erasing history.  And there's no way to get the
mapping from writeids to transaction ids from outside of Hive.

Alan.

On Mon, May 6, 2019 at 6:23 AM Bhargav Bipinchandra Naik (Seller
Platform-BLR)  wrote:

> We have a scenario where we want to consume only delta updates from Hive
> tables.
> - Multiple producers are updating data in Hive table
> - Multiple consumer reading data from the Hive table
>
> Consumption pattern:
> - Get all data that has been updated since last time I read.
>
> Is there any mechanism in Hive 3.0 which can enable above consumption
> pattern?
>
> I see there is a construct of row__id(writeid, bucketid, rowid) in ACID
> tables.
> - Can row__id be used in this scenario?
> - How is the "writeid" generated?
> - Is there some meta information which captures the time when the rows
> were actually visible for read?
>


Re: HS2: Permission denied for my own table?

2019-04-17 Thread Alan Gates
See
https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2#SettingUpHiveServer2-Impersonation

Alan.

On Tue, Apr 16, 2019 at 10:03 PM Kaidi Zhao  wrote:

> Hello!
>
> Did I miss anything here or it is an known issue? Hive 1.2.1, hadoop
> 2.7.x, kerberos, impersonation.
>
> Using hive client, create a hive db and hive table. I can select from this
> table correctly.
> In hdfs, change the table folder's permission to be 711. In hive client, I
> can still select from the table.
> However, if using beeline client (which talks to HS2 I believe), it
> complains about can't read the table folder in hdfs, something like:
>
> Error: Error while compiling statement: FAILED: SemanticException Unable
> to fetch table fact_app_logs. java.security.AccessControlException:
> Permission denied: user=hive, access=READ,
> inode="/data/mydb.db/my_table":myuser:mygroup:drwxr-x--x
> at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:307)
> at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:220)
> at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1752)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1736)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1710)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8220)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1932)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2218)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2214)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1760)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2212)
> (state=42000,code=4)
>
> Note, from the log, it says it tries to use user "hive" (instead of my own
> user "myuser") to read the table's folder (the folder is only readable by
> its owner - myuser)
> Again, using hive client I can read the table, but using beeline it can't.
> If I change the folder's permission to 755, then it works.
>
> Why beeline / HS2 needs to use "hive" to read the table's folder?
>
> Thanks in advance.
>
> Kaidi
>
>
>


Re: How to update Hive ACID tables in Flink

2019-03-12 Thread Alan Gates
That's the old (Hive 2) version of ACID.  In the newer version (Hive 3)
there's no update, just insert and delete (update is insert + delete).  If
you're working against Hive 2 what you have is what you want.  If you're
working against Hive 3 you'll need the newer stuff.

Alan.

On Tue, Mar 12, 2019 at 12:24 PM David Morin 
wrote:

> Thanks Alan.
> Yes, the problem is fact was that this streaming API does not handle
> update and delete.
> I've used native Orc files and the next step I've planned to do is the use
> of ACID support as described here: https://orc.apache.org/docs/acid.html
> The INSERT/UPDATE/DELETE seems to be implemented:
> OPERATIONSERIALIZATION
> INSERT 0
> UPDATE 1
> DELETE 2
> Do you think this approach is suitable ?
>
>
>
> Le mar. 12 mars 2019 à 19:30, Alan Gates  a écrit :
>
>> Have you looked at Hive's streaming ingest?
>> https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest
>> It is designed for this case, though it only handles insert (not update),
>> so if you need updates you'd have to do the merge as you are currently
>> doing.
>>
>> Alan.
>>
>> On Mon, Mar 11, 2019 at 2:09 PM David Morin 
>> wrote:
>>
>>> Hello,
>>>
>>> I've just implemented a pipeline based on Apache Flink to synchronize data 
>>> between MySQL and Hive (transactional + bucketized) onto HDP cluster. Flink 
>>> jobs run on Yarn.
>>> I've used Orc files but without ACID properties.
>>> Then, we've created external tables on these hdfs directories that contain
>>> these delta Orc files.
>>> Then, MERGE INTO queries are executed periodically to merge data into the
>>> Hive target table.
>>> It works pretty well but we want to avoid the use of these Merge queries.
>>> How can I update Orc files directly from my Flink job ?
>>>
>>> Thanks,
>>> David
>>>
>>>


Re: Read Hive ACID tables in Spark or Pig

2019-03-12 Thread Alan Gates
If you want to read those tables directly in something other than Hive,
yes, you need to get the valid writeid list for each table you're reading
from the metastore.  If you want to avoid merging data in, take a look at
Hive's streaming ingest, which allows you to ingest data into Hive without
merges, though it doesn't support update, only insert.
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest

Alan.

On Mon, Mar 11, 2019 at 9:14 AM David Morin 
wrote:

> Hi,
>
> I've just implemented a pipeline to synchronize data between MySQL and
> Hive (transactional + bucketized) onto HDP cluster.
> I've used Orc files but without ACID properties.
> Then, we've created external tables on these hdfs directories that contain
> these delta Orc files.
> Then, MERGE INTO queries are executed periodically to merge data into the
> Hive target table.
> It works pretty well but we want to avoid the use of these Merge queries.
> It's not really clear at the moment. But thanks for your links. I'm going
> to delve into that point.
> To resume, if i want to avoid these queries, I have to get the valid
> transaction for each table from Hive Metastore and, then, read all related
> files.
> Is it correct ?
>
> Thanks,
> David
>
>
> Le dim. 10 mars 2019 à 01:45, Nicolas Paris  a
> écrit :
>
>> Thanks Alan for the clarifications.
>>
>> Hive has made such improvements it has lost its old friends in the
>> process. Hope one day all the friends speak together again: pig, spark,
>> presto read/write ACID together.
>>
>> On Sat, Mar 09, 2019 at 02:23:48PM -0800, Alan Gates wrote:
>> > There's only been one significant change in ACID that requires different
>> > implementations.  In ACID v1 delta files contained inserts, updates, and
>> > deletes.  In ACID v2 delta files are split so that inserts are placed
>> in one
>> > file, deletes in another, and updates are an insert plus a delete.
>> This change
>> > was put into Hive 3, so you have to upgrade your ACID tables when
>> upgrading
>> > from Hive 2 to 3.
>> >
>> > You can see info on ACID v1 at
>> https://cwiki.apache.org/confluence/display/Hive
>> > /Hive+Transactions
>> >
>> > You can get a start understanding ACID v2 with
>> https://issues.apache.org/jira/
>> > browse/HIVE-14035  This has design documents.  I don't guarantee the
>> > implementation completely matches the design, but you can at least get
>> an idea
>> > of the intent and follow the JIRA stream from there to see what was
>> > implemented.
>> >
>> > Alan.
>> >
>> > On Sat, Mar 9, 2019 at 3:25 AM Nicolas Paris 
>> wrote:
>> >
>> > Hi,
>> >
>> > > The issue is that outside readers don't understand which records
>> in
>> > > the delta files are valid and which are not. Theoretically all
>> this
>> > > is possible, as outside clients could get the valid transaction
>> list
>> > > from the metastore and then read the files, but no one has done
>> this
>> > > work.
>> >
>> > I guess each hive version (1,2,3) differ in how they manage delta
>> files
>> > isn't ? This means pig or spark need to implement 3 different ways
>> of
>> > dealing with hive.
>> >
>> > Is there any documentation that would help a developper to implement
>> > those specific connectors ?
>> >
>> > Thanks
>> >
>> >
>> > On Wed, Mar 06, 2019 at 09:51:51AM -0800, Alan Gates wrote:
>> > > Pig is in the same place as Spark, that the tables need to be
>> compacted
>> > first.
>> > > The issue is that outside readers don't understand which records
>> in the
>> > delta
>> > > files are valid and which are not.
>> > >
>> > > Theoretically all this is possible, as outside clients could get
>> the
>> > valid
>> > > transaction list from the metastore and then read the files, but
>> no one
>> > has
>> > > done this work.
>> > >
>> > > Alan.
>> > >
>> > > On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <
>> abhila...@gmail.com>
>> > wrote:
>> > >
>> > > Hi,
>> > >
>> > > Does Hive ACID tables for Hive version 1.2 posses the
>> capability of
>> > being
>> > > read into Apache Pig using HCatLoader or Spark using
>> SQLContext.
>> > > For Spark, it seems it is only possible to read ACID tables
>> if the
>> > table is
>> > > fully compacted i.e no delta folders exist in any partition.
>> Details
>> > in the
>> > > following JIRA
>> > >
>> > > https://issues.apache.org/jira/browse/SPARK-15348, https://
>> > > issues.apache.org/jira/browse/SPARK-15348
>> > >
>> > > However I wanted to know if it is supported at all in Apache
>> Pig to
>> > read
>> > > ACID tables in Hive
>> > >
>> >
>> > --
>> > nicolas
>> >
>>
>> --
>> nicolas
>>
>


Re: How to update Hive ACID tables in Flink

2019-03-12 Thread Alan Gates
Have you looked at Hive's streaming ingest?
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest
It is designed for this case, though it only handles insert (not update),
so if you need updates you'd have to do the merge as you are currently
doing.

Alan.

On Mon, Mar 11, 2019 at 2:09 PM David Morin 
wrote:

> Hello,
>
> I've just implemented a pipeline based on Apache Flink to synchronize data 
> between MySQL and Hive (transactional + bucketized) onto HDP cluster. Flink 
> jobs run on Yarn.
> I've used Orc files but without ACID properties.
> Then, we've created external tables on these hdfs directories that contain
> these delta Orc files.
> Then, MERGE INTO queries are executed periodically to merge data into the
> Hive target table.
> It works pretty well but we want to avoid the use of these Merge queries.
> How can I update Orc files directly from my Flink job ?
>
> Thanks,
> David
>
>


Re: Read Hive ACID tables in Spark or Pig

2019-03-09 Thread Alan Gates
There's only been one significant change in ACID that requires different
implementations.  In ACID v1 delta files contained inserts, updates, and
deletes.  In ACID v2 delta files are split so that inserts are placed in
one file, deletes in another, and updates are an insert plus a delete.
This change was put into Hive 3, so you have to upgrade your ACID tables
when upgrading from Hive 2 to 3.

You can see info on ACID v1 at
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

You can get a start understanding ACID v2 with
https://issues.apache.org/jira/browse/HIVE-14035  This has design
documents.  I don't guarantee the implementation completely matches the
design, but you can at least get an idea of the intent and follow the JIRA
stream from there to see what was implemented.

Alan.

On Sat, Mar 9, 2019 at 3:25 AM Nicolas Paris 
wrote:

> Hi,
>
> > The issue is that outside readers don't understand which records in
> > the delta files are valid and which are not. Theoretically all this
> > is possible, as outside clients could get the valid transaction list
> > from the metastore and then read the files, but no one has done this
> > work.
>
> I guess each hive version (1,2,3) differ in how they manage delta files
> isn't ? This means pig or spark need to implement 3 different ways of
> dealing with hive.
>
> Is there any documentation that would help a developper to implement
> those specific connectors ?
>
> Thanks
>
>
> On Wed, Mar 06, 2019 at 09:51:51AM -0800, Alan Gates wrote:
> > Pig is in the same place as Spark, that the tables need to be compacted
> first.
> > The issue is that outside readers don't understand which records in the
> delta
> > files are valid and which are not.
> >
> > Theoretically all this is possible, as outside clients could get the
> valid
> > transaction list from the metastore and then read the files, but no one
> has
> > done this work.
> >
> > Alan.
> >
> > On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta 
> wrote:
> >
> > Hi,
> >
> > Does Hive ACID tables for Hive version 1.2 posses the capability of
> being
> > read into Apache Pig using HCatLoader or Spark using SQLContext.
> > For Spark, it seems it is only possible to read ACID tables if the
> table is
> > fully compacted i.e no delta folders exist in any partition. Details
> in the
> > following JIRA
> >
> > https://issues.apache.org/jira/browse/SPARK-15348, https://
> > issues.apache.org/jira/browse/SPARK-15348
> >
> > However I wanted to know if it is supported at all in Apache Pig to
> read
> > ACID tables in Hive
> >
>
> --
> nicolas
>


Re: Read Hive ACID tables in Spark or Pig

2019-03-06 Thread Alan Gates
Pig is in the same place as Spark, that the tables need to be compacted
first.  The issue is that outside readers don't understand which records in
the delta files are valid and which are not.

Theoretically all this is possible, as outside clients could get the valid
transaction list from the metastore and then read the files, but no one has
done this work.

Alan.

On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta  wrote:

> Hi,
>
> Does Hive ACID tables for Hive version 1.2 posses the capability of being
> read into Apache Pig using HCatLoader or Spark using SQLContext.
> For Spark, it seems it is only possible to read ACID tables if the table
> is fully compacted i.e no delta folders exist in any partition. Details in
> the following JIRA
>
> https://issues.apache.org/jira/browse/SPARK-15348,
> https://issues.apache.org/jira/browse/SPARK-15348
>
> However I wanted to know if it is supported at all in Apache Pig to read
> ACID tables in Hive
>


Re: Standalone Metastore Question

2019-02-26 Thread Alan Gates
The standalone metastore released in 3.0 is the exact same metastore
released with Hive 3.0.  The only differences are in the install tool
'schematool' and the start and stop script.  Hive 3 is being used in
production a number of places.  I don't know if anyone is running the
metastore alone in production or not.

Alan.

On Tue, Feb 26, 2019 at 10:39 AM Abdoulaye Diallo 
wrote:

> Hi there,
>
> I am new to hive. My goal is to run the Standalone Metastore in the hope
> to integrate it with Spark/Iceberg without Hadoop/Hive.
> I downloaded the release from here
> (version
> 3.0.0) and successfully initialized a MySQL database and started the
> standalone-megastore process
> (org.apache.hadoop.hive.metastore.HiveMetaStore).
>
> Given how little documentation and examples I have found on the internet
> about this, I was wondering ion this Standalone Metastore is production
> tested/ready and what experience people who run it in production have with
> it in terms of reliability and scalability.
>
> thanks
> --
> Abdoulaye Diallo
>


Re: Difference in performance of temp table vs subqueries

2019-01-24 Thread Alan Gates
That's a broad question and it depends on what you're doing.  Since temp
tables will materialize the intermediate result while subqueries will not
I'd guess in most cases subqueries are faster.  But again, it depends on
what you're doing, and you'd need to benchmark your particular queries both
ways.

Alan.

On Thu, Jan 24, 2019 at 2:08 AM Shivam Sharma <28shivamsha...@gmail.com>
wrote:

> Hi All,
>
> As per subject of mail, I want to know performance difference in both
> subquery and temp table.
>
> Thanks
>
>
> --
> Shivam Sharma
> Indian Institute Of Information Technology, Design and Manufacturing
> Jabalpur
> Email:- 28shivamsha...@gmail.com
> LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
> *
>


Re: Hive Metastore Hook to to fire only on success

2018-10-05 Thread Alan Gates
Which version of Hive are you on and which hook are you seeing fire?  Based
on looking at the master code you should only see the commitCreateTable
hook call if the creation succeeds.

Alan.

On Thu, Oct 4, 2018 at 12:36 AM Daniel Haviv  wrote:

> Hi,
> I'm writing a HMS hook and I noticed that the hook fires no matter if the
> operation succeeded or not.
> For example, if a user creates an already existing table, the operation
> will fail but the the hook will fire regardless.
>
> Is there a way to either validate that the operation succeeded or fire
> only upon success?
>
>
> TY.
> Daniel
>


Re: How to Grant All Privileges for All Databases except one in Hive SQL

2018-09-21 Thread Alan Gates
If I needed to set the permissions for every table in every database and
there were many, I'd write a shell script that first fetched all the
databases and tables (using show databases, use database, and show tables)
and then generated a "grant select on x" for each table.  Assuming you
don't want to grant every table to every user you'll probably need to
incorporate some filtering in your script.

Alan.

On Mon, Sep 17, 2018 at 11:31 AM Anup Tiwari  wrote:

> Hive doesn't have a "grant select on db.*" option, which is what I think
> you're looking for here.
>
> Yes i am looking something like this only and since it is not available,
> does that mean i have to go for each table ?
>
> I am asking because we have many DBs and a lot of tables within each DB so
> is there any other way ?
>
> Regards,
> Anup Tiwari
>
>
> On Mon, Sep 17, 2018 at 8:48 PM Alan Gates  wrote:
>
>> What you are seeing is correct behavior.  Select on the database means
>> the user can see objects in the database (ie, tables, views).  To see
>> contents of those objects you have to grant access on those objects.  Hive
>> doesn't have a "grant select on db.*" option, which is what I think you're
>> looking for here.
>>
>> Alan.
>>
>> On Mon, Sep 17, 2018 at 5:50 AM Anup Tiwari 
>> wrote:
>>
>>> Hi Alan,
>>>
>>> I have given select access of a database to a role which is attached to
>>> a user but after this also that user is not able to execute select
>>> statements on tables of that database. But if i provide access at table
>>> level then that is working. Can you please help me here ?
>>>
>>> Hive Version : 2.3.2
>>>
>>> Please find below steps :-
>>>
>>> 1. Added below confifuration in hive-site.xml
>>>
>>>   
>>> hive.server2.enable.doAs
>>> false
>>>   
>>>
>>>   
>>> hive.users.in.admin.role
>>> hadoop
>>>   
>>>
>>> 
>>>  hive.security.authorization.manager
>>>
>>>  
>>> org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
>>> 
>>>
>>> 
>>>  hive.security.authorization.enabled
>>>  true
>>> 
>>>
>>> 
>>>  hive.security.authenticator.manager
>>>
>>>  
>>> org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
>>> 
>>>
>>> 2. Restarted Hive Server2.
>>>
>>> 3. Logged in to hive shell with hadoop user and executed below command
>>> without any error :-
>>>
>>> set role admin;
>>> create role readonly;
>>> GRANT ROLE readonly TO USER `user2`;
>>> GRANT SELECT ON DATABASE anup TO ROLE readonly;
>>>
>>> 4. Logged in to hive shell with user2 and executed below commands :-
>>>
>>> select * from anup.t2 limit 5;
>>>
>>> *Error :-*
>>> Error: Error while compiling statement: FAILED:
>>> HiveAccessControlException Permission denied: Principal [name=mohan.b,
>>> type=USER] does not have following privileges for operation QUERY [[SELECT]
>>> on Object [type=TABLE_OR_VIEW, name=anup.t2]] (state=42000,code=4)
>>>
>>>
>>> show current roles;
>>> +---+
>>> |   role|
>>> +---+
>>> | public|
>>> | readonly  |
>>> +---+
>>> 2 rows selected (0.085 seconds)
>>>
>>> SHOW GRANT ROLE `readonly` ON DATABASE anup;
>>>
>>> +---+++-+-+-++---++--+
>>> | database  | table  | partition  | column  | principal_name  |
>>> principal_type  | privilege  | grant_option  |   grant_time   | grantor  |
>>>
>>> +---+++-+-+-++---++--+
>>> | anup  ||| | readonly|
>>> ROLE| SELECT | false | 1537187896000  | hadoop   |
>>>
>>> +---+++-+-+-++---++--+
>>>
>>> Regards,
>>> Anup Tiwari
>>>
>>>
>>> On Fri, Sep 14, 2018 at 10:50 PM Alan Gates 
>>> wrote:
>>>
>>>> You can 

Re: Question about OVER clause

2018-09-21 Thread Alan Gates
This article might be helpful.  It's for SQL Server, but the semantics
should be similar.

https://www.sqlpassion.at/archive/2015/01/22/sql-server-windowing-functions-rows-vs-range/

Alan.

On Wed, Sep 19, 2018 at 6:47 AM 孙志禹  wrote:

> Dears,
>What is the difference between *ROW BETWEEN* and *RANGE BETWEEN* when
> using a OVER clause? I found it difficult to get an answer about this for
> hive.
>Hope there would be a more detailed help article about OVER clause at
> the Confluence.
>Thanks!
>


Re: How to Grant All Privileges for All Databases except one in Hive SQL

2018-09-17 Thread Alan Gates
What you are seeing is correct behavior.  Select on the database means the
user can see objects in the database (ie, tables, views).  To see contents
of those objects you have to grant access on those objects.  Hive doesn't
have a "grant select on db.*" option, which is what I think you're looking
for here.

Alan.

On Mon, Sep 17, 2018 at 5:50 AM Anup Tiwari  wrote:

> Hi Alan,
>
> I have given select access of a database to a role which is attached to a
> user but after this also that user is not able to execute select statements
> on tables of that database. But if i provide access at table level then
> that is working. Can you please help me here ?
>
> Hive Version : 2.3.2
>
> Please find below steps :-
>
> 1. Added below confifuration in hive-site.xml
>
>   
> hive.server2.enable.doAs
> false
>   
>
>   
> hive.users.in.admin.role
> hadoop
>   
>
> 
>  hive.security.authorization.manager
>
>  
> org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
> 
>
> 
>  hive.security.authorization.enabled
>  true
> 
>
> 
>  hive.security.authenticator.manager
>
>  
> org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
> 
>
> 2. Restarted Hive Server2.
>
> 3. Logged in to hive shell with hadoop user and executed below command
> without any error :-
>
> set role admin;
> create role readonly;
> GRANT ROLE readonly TO USER `user2`;
> GRANT SELECT ON DATABASE anup TO ROLE readonly;
>
> 4. Logged in to hive shell with user2 and executed below commands :-
>
> select * from anup.t2 limit 5;
>
> *Error :-*
> Error: Error while compiling statement: FAILED: HiveAccessControlException
> Permission denied: Principal [name=mohan.b, type=USER] does not have
> following privileges for operation QUERY [[SELECT] on Object
> [type=TABLE_OR_VIEW, name=anup.t2]] (state=42000,code=4)
>
>
> show current roles;
> +---+
> |   role|
> +---+
> | public|
> | readonly  |
> +---+
> 2 rows selected (0.085 seconds)
>
> SHOW GRANT ROLE `readonly` ON DATABASE anup;
>
> +---+++-+-+-++---++--+
> | database  | table  | partition  | column  | principal_name  |
> principal_type  | privilege  | grant_option  |   grant_time   | grantor  |
>
> +---+++-+-+-++---++--+
> | anup  ||| | readonly|
> ROLE| SELECT     | false | 1537187896000  | hadoop   |
>
> +---+++-+-+-++---++--+
>
> Regards,
> Anup Tiwari
>
>
> On Fri, Sep 14, 2018 at 10:50 PM Alan Gates  wrote:
>
>> You can see a full list of what grant supports at
>> https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization#SQLStandardBasedHiveAuthorization-Grant
>>
>> There is no "grant x to user on all databases" or regex expressions for
>> database names.  So you'll have to do the databases one by one.
>>
>> External security managers such as Apache Ranger (and I think Apache
>> Sentry, but I'm not sure) can do blanket policies or default policies.
>> This has the added advantage that as new databases are created the policies
>> immediately apply.
>>
>> Alan.
>>
>> On Thu, Sep 13, 2018 at 10:37 PM Anup Tiwari 
>> wrote:
>>
>>> Hi,
>>>
>>> Can someone reply on this?
>>>
>>> On Tue, 11 Sep 2018 19:21 Anup Tiwari,  wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have similar requirement as mentioned in the link Link to question
>>>> <https://stackoverflow.com/questions/38199021/how-to-grant-all-privileges-for-all-databases-except-one-in-hive-sql>
>>>> .
>>>>
>>>> *Requirement :-*
>>>>
>>>> I know how to grant privileges on a database to a role in Hive SQL.
>>>> For example, GRANT ALL ON DATABASE analyst1 TO ROLE analyst_role;
>>>> But there are hundreds of databases on my system, it's almost
>>>> impossible to grant one by one.
>>>> Is it possible to grant all privileges for all databases ?
>>>> Also Is it possible to grant all privileges for all databases except
>>>> one database(ex: db.name = temp)?
>>>>
>>>>
>>>> Regards,
>>>> Anup Tiwari
>>>>
>>>


Re: How to Grant All Privileges for All Databases except one in Hive SQL

2018-09-14 Thread Alan Gates
You can see a full list of what grant supports at
https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization#SQLStandardBasedHiveAuthorization-Grant

There is no "grant x to user on all databases" or regex expressions for
database names.  So you'll have to do the databases one by one.

External security managers such as Apache Ranger (and I think Apache
Sentry, but I'm not sure) can do blanket policies or default policies.
This has the added advantage that as new databases are created the policies
immediately apply.

Alan.

On Thu, Sep 13, 2018 at 10:37 PM Anup Tiwari  wrote:

> Hi,
>
> Can someone reply on this?
>
> On Tue, 11 Sep 2018 19:21 Anup Tiwari,  wrote:
>
>> Hi All,
>>
>> I have similar requirement as mentioned in the link Link to question
>> 
>> .
>>
>> *Requirement :-*
>>
>> I know how to grant privileges on a database to a role in Hive SQL.
>> For example, GRANT ALL ON DATABASE analyst1 TO ROLE analyst_role;
>> But there are hundreds of databases on my system, it's almost impossible
>> to grant one by one.
>> Is it possible to grant all privileges for all databases ?
>> Also Is it possible to grant all privileges for all databases except one
>> database(ex: db.name = temp)?
>>
>>
>> Regards,
>> Anup Tiwari
>>
>


Re: Hive Metada as a microservice

2018-07-05 Thread Alan Gates
In 3.0, you can download the metastore as a separate artifact, either
source or binary (e.g.
http://ftp.wayne.edu/apache/hive/hive-standalone-metastore-3.0.0/).  It
does not require any other parts of Hive beyond what's released in that
artifact.  I'm not sure if this meets your definition of a loosely coupled
microservice or not.

Alan.

On Thu, Jul 5, 2018 at 11:49 AM Mich Talebzadeh 
wrote:

> Hi,
>
> My understanding is that in later releases of Hive, the metadata will be a
> separate offerings. Will this be a type of microservice offering providing
> loose coupling to various other artefact?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: drop partitions

2018-06-18 Thread Alan Gates
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-DropPartitions

Alan.

On Sat, Jun 16, 2018 at 8:03 PM Mahender Sarangam <
mahender.bigd...@outlook.com> wrote:

> Hi All,
>
> What is right syntax for dropping the partitions. Alter table drop if
> exists partition(date >'date1'),partition(date <'date2')  or Alter table
> drop if exists partition(date >'date1', date <'date2') ?
>
>
> Mahens
>
>


Re: Oracle 11g Hive 2.1 metastore backend

2018-06-06 Thread Alan Gates
We currently run our Oracle tests against 11g, but that is only for the 3.0
and beyond releases.  Given the error I am guessing this is a result of the
Oracle version plus the datanucleus version, which we changed between 2.1
and 2.3.

Alan.

On Wed, Jun 6, 2018 at 12:12 PM Mich Talebzadeh 
wrote:

> My Hive is 2.3.2
>
> my Oracle is 12.c
>
> Connected to:
> Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit
> Production
> With the Partitioning, OLAP, Advanced Analytics and Real Application
> Testing options
>
> and these are hive connections
>
> sessions as on JUN 06 2018 08:21 PM
>
> LOGIN   SID/serial# LOGGED IN S HOST   OS PID Client
> PID PROGRAM   MEM/KB  Logical I/O Physical I/O
> --- --- --- -- --
> -- ---   
> ACT INFO
> --- ---
> HIVEUSER46,3413906/06 07:29 rhes75 oracle/28441
> hduser/1234JDBC Thin Clien1,088   460
> N
> HIVEUSER325,856906/06 08:01 rhes75 oracle/28748
> hduser/1234JDBC Thin Clien1,088   440
> N
> HIVEUSER407,64925   06/06 07:29 rhes75 oracle/28437
> hduser/1234JDBC Thin Clien1,088   440
> N
>
>
> I have no issues
>
> Is this your issue?
>
> Caused by: MetaException(message:java.lang.ClassCastException:
> org.datanucleus.store.rdbms.mapping.datastore.ClobImpl cannot be cast to
> oracle.sql.CLOB)
>
> HTH,
>
> Mich
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 6 June 2018 at 05:10, Arjun kr  wrote:
>
>> Hi All,
>>
>> Is anyone using Oracle 11g configured as Hive 2.1 metastore backend? I'm
>> encountering below exception with Oracle 11g configured as Hive 2.1
>> metastore backend. Any help would be appreciated.
>>
>> 2018-05-23T13:05:03,219 DEBUG [main] transport.TSaslTransport: CLIENT:
>> reading data length: 211
>> 2018-05-23T13:05:03,220 DEBUG [main] transport.TSaslTransport: data
>> length after unwrap: 179
>> 2018-05-23T13:05:03,245 ERROR [main] exec.DDLTask:
>> org.apache.hadoop.hive.ql.metadata.HiveException:
>> MetaException(message:java.lang.ClassCastException:
>> org.datanucleus.store.rdbms.mapping.datastore.ClobImpl cannot be cast to
>> oracle.sql.CLOB)
>> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:842)
>> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:847)
>> at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3992)
>> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:332)
>> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
>> at
>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2074)
>> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1745)
>> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1454)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1162)
>> at
>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:234)
>> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:185)
>> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:401)
>> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:337)
>> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:435)
>> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:451)
>> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:763)
>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:729)
>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:652)
>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:647)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>> Caused by: MetaException(message:java.lang.ClassCastException:
>> org.datanucleus.store.rdbms.mapping.datastore.ClobImpl cannot be cast to
>> oracle.sql.CLOB)

Re: Is 'application' a reserved word?

2018-05-30 Thread Alan Gates
It is.  You can see the definitive list of keywords at
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g
(Note this is for the master branch, you can switch the branch around to
find the list for a particular release.)  It would be good to file a JIRA
on this so we fix the documentation.

Alan.

On Wed, May 30, 2018 at 7:48 AM Matt Burgess  wrote:

> I tried the following simple statement in beeline (Hive 3.0.0):
>
> create table app (application STRING);
>
> And got the following error:
>
> Error: Error while compiling statement: FAILED: ParseException line
> 1:18 cannot recognize input near 'application' 'STRING' ')' in column
> name or constraint (state=42000,code=4)
>
> I checked the Wiki [1] but didn't see 'application' on the list of
> reserved words. However if I change the column name to anything else
> (even 'applicatio') it works. Can someone confirm whether this is a
> reserved word?
>
> Thanks in advance,
> Matt
>
> [1]
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Keywords,Non-reservedKeywordsandReservedKeywords
>


Re: Combining hive tables as one query

2018-05-15 Thread Alan Gates
You are correct that Hive does not support "with recursive".  A few more
details of what you are trying to do would be helpful, since it's not clear
why you need the iteration provided by "with recursive".  If you really
need the iteration I don't think Hive can do what you want.

Alan.

On Tue, May 15, 2018 at 9:34 AM, Sowjanya Kakarala 
wrote:

> ok. but in my usecase join's/union's wont help.
> Here is an example of my usecase from postgres, which I have to do it in
> similar way for hive.
> with recursive
> a as (select col from tb1),
> b as (select col from tb2),
> c as (select col from tb3)
>
> select a,b,c from a,b,c;
>
> which output's me in a dataframe which i am writing to a csv and it looks
> like
> a bc
> 0.1 0.2  0.3
>
> where hive is not supporting `with rec` in same functionality as in
> postgres and views are also not helping here.
>
> On Tue, May 15, 2018 at 11:19 AM, Alan Gates  wrote:
>
>> In general this is done using joins, as in all SQL engines.  A google
>> search on "intro to SQL joins" will suggest a number of resources, for
>> example https://www.essentialsql.com/get-ready-to-learn-sql-
>> 12-introduction-to-database-joins/
>>
>> Alan.
>>
>> On Tue, May 15, 2018 at 7:37 AM, Sowjanya Kakarala 
>> wrote:
>>
>>> Hi all,
>>>
>>> Is there a way in hive that different tables data, can be read as in a
>>> single query?
>>>
>>> example:
>>> (something like)
>>> select a,b from (select col1 from tbl1)a , (select col1 from tb2)b);
>>>
>>> output as :
>>> a  b
>>> 0.1  0.2
>>>
>>> Any help is appreciated.
>>>
>>> Thanks
>>> Sowjanya
>>>
>>
>>
>
>
> --
>
> Sowjanya Kakarala
>
> Infrastructure Software Engineer
>
>
>
> Agrible, Inc. | sowja...@agrible.com  | 217-848-1128
>
> 2021 S. First Street, Suite 201, Champaign, IL 61820
> <https://maps.google.com/?q=2021+S.+First+Street,+Suite+201,+Champaign,+IL+61820&entry=gmail&source=g>
>
>
>
> Agrible.com <http://agrible.com/> | facebook
> <https://www.facebook.com/Agrible> | youtube
> <https://www.youtube.com/c/AgribleInc_TheInsightToDecide> | twitter
> <https://twitter.com/Agribleinc>
>
> [image: Agrible_Logo_Email_Signature.jpg]
>


Re: Combining hive tables as one query

2018-05-15 Thread Alan Gates
In general this is done using joins, as in all SQL engines.  A google
search on "intro to SQL joins" will suggest a number of resources, for
example
https://www.essentialsql.com/get-ready-to-learn-sql-12-introduction-to-database-joins/

Alan.

On Tue, May 15, 2018 at 7:37 AM, Sowjanya Kakarala 
wrote:

> Hi all,
>
> Is there a way in hive that different tables data, can be read as in a
> single query?
>
> example:
> (something like)
> select a,b from (select col1 from tbl1)a , (select col1 from tb2)b);
>
> output as :
> a  b
> 0.1  0.2
>
> Any help is appreciated.
>
> Thanks
> Sowjanya
>


Re: Dynamic vs Static partition

2018-01-08 Thread Alan Gates
When doing dynamic partitioning, Hive needs to look at each record and
determine which partition to place it in.  In the case where all records go
to the same partition, it is more efficient to tell Hive up front, that is,
to use static partitioning.  So you can use dynamic partition for large
files, it will just take more processing power and time.

Alan.

On Tue, Jan 2, 2018 at 5:53 PM, Sachit Murarka 
wrote:

> Hi,
>
> I am unclear about Dynamic and static partition.
>
>1. Why static partition is good for loading large files and why can’t
>we use dynamic partition for the same?
>2. Why does dynamic partition take more time in loading data than
>static partitions?
>
>Also please explain when to use strict and nonstrict mode.
>
> Kind Regards,
> Sachit Murarka
>


Re: Creating Surrogate Keys in Hive

2017-11-21 Thread Alan Gates
It isn't possible to guarantee sequential keys because tasks run in
parallel.  You can write a UDF to assign a unique id or sequential ids
within a task.

Alan.

On Tue, Nov 21, 2017 at 3:53 AM, kishore kumar 
wrote:

> Hi,
>
> Could some one suggest how to create surrogate keys sequentially in hive ?
>
> --
> Thanks,
> Kishore.
>


Re: Options for connecting to Apache Hive

2017-11-10 Thread Alan Gates
There are ODBC drivers available for Hive, though they aren’t part of the
open source package and are not free.  Google can help you find them.  As
Elliot says, you can use the Thrift protocol, which is what the JDBC driver
uses.  You can find the thrift definition in the code at
https://github.com/apache/hive/blob/master/service-rpc/if/TCLIService.thrift


Alan.

On Fri, Nov 10, 2017 at 5:43 AM, Elliot West  wrote:

> Hi Jakob,
>
> Assuming that your Hive deployment is running HiveServer2, you could issue
> queries and obtain result sets via its Thrift API. Thrift has a broad set
> of language implementations, including C IIRC. I believe this is also the
> API used by Hive's JDBC connector, so it should be capable from a feature
> set perspective.
>
> Cheers - Elliot.
>
> On 10 November 2017 at 10:28, Jakob Egger  wrote:
>
>> Hi!
>>
>> I'm the developer of a database client tool, and I've received a request
>> to add support for querying Apache Hive.
>>
>> (My tool lets the user execute SQL queries, and it allows browsing tables
>> etc.)
>>
>> As a first step of evaluating this suggestion, I'm trying to find out if
>> there is a convenient way to connect to Hive.
>>
>> From reading the documentation, it seems that the preferred way to
>> connect seems to be using the JDBC driver. Since my app is not written in
>> Java, this is probably not the way to go. Apart from that, I didn't find
>> much on this topic in the docs.
>>
>> I have a few questions:
>>
>> 1) What ways are there to connect to Apache Hive?
>>
>> 2) Is there a C client library?
>>
>> 3) Is there any documentation on the wire protocol that Hive uses for
>> client / server communication?
>>
>> I'd appreciate if someone who knows more about the project could point me
>> in the right direction!
>>
>> Best regards,
>> Jakob
>
>
>


Re: HCatClient vs HiveMetaStoreClient (or IMetaStoreClient)

2017-11-10 Thread Alan Gates
HCatClient is useful if you are already using HCat.  If not, use
HiveMetaStoreClient.  It’s been kept much more up to date.

Alan.

On Fri, Nov 10, 2017 at 9:23 AM, Patel,Stephen 
wrote:

> From a cursory inspection, it seems that HCatClient provides a subset of
> the functionality that the HiveMetaStoreClient provides.
>
>
>
> Why might a consumer choose to use one interface over the other?
>
>
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024 <(816)%20221-1024>.
>


Re: on master branch, hive code has some itests maven build error

2017-09-25 Thread Alan Gates
I would suggest removing -Pdist in the initial mvn command.  That should
only be used to build tarballs for distribution.  So your initial mvn
command should just be:
mvn clean install -DskipTests

Alan.

On Sat, Sep 23, 2017 at 3:55 AM, eric wong  wrote:

> I try to add some q file test to hive, so follow hive dev documents, but
> build error occur.
>
> Any suggestion would be welcome, thanks.
>
> command list:
> git fetch
> git checkout -b master-0923 origin/master
> mvn clean install -DskipTests -Pdist
> mvn test -Pitests -Dtest=TestCliDriver -Dqfile=union_different_types_three.q
> -Dtest.output.overwrite=true
>
>
> mvn error stacktrace:
>
> [INFO] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java:
> Some input files use unchecked or unsafe operations.
> [INFO] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java:
> Recompile with -Xlint:unchecked for details.
> [INFO] -
> [ERROR] COMPILATION ERROR :
> [INFO] -
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/beeline/hs2connection/
> BeelineWithHS2ConnectionFileTestBase.java:[145,12] cannot find symbol
>   symbol:   method cleanupLocalDir()
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/beeline/hs2connection/
> BeelineWithHS2ConnectionFileTestBase.java:[151,12] cannot find symbol
>   symbol:   method cleanupLocalDir()
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/beeline/hs2connection/
> BeelineWithHS2ConnectionFileTestBase.java:[169,52] cannot find symbol
>   symbol:   method cleanupLocalDirOnStartup(boolean)
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2.Builder
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestXSRFFilter.java:[56,12]
> cannot find symbol
>   symbol:   method cleanupLocalDir()
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestXSRFFilter.java:[61,12]
> cannot find symbol
>   symbol:   method cleanupLocalDir()
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestXSRFFilter.java:[70,51]
> cannot find symbol
>   symbol:   method cleanupLocalDirOnStartup(boolean)
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2.Builder
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestServiceDiscoveryWithMiniHS2.java:[55,12]
> cannot find symbol
>   symbol:   method cleanupLocalDir()
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestServiceDiscoveryWithMiniHS2.java:[72,12]
> cannot find symbol
>   symbol:   method cleanupLocalDir()
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestServiceDiscoveryWithMiniHS2.java:[77,55]
> cannot find symbol
>   symbol:   method cleanupLocalDirOnStartup(boolean)
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2.Builder
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestSSL.java:[73,12] cannot find
> symbol
>   symbol:   method cleanupLocalDir()
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestSSL.java:[79,12] cannot find
> symbol
>   symbol:   method cleanupLocalDir()
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestSSL.java:[85,51] cannot find
> symbol
>   symbol:   method cleanupLocalDirOnStartup(boolean)
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2.Builder
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestSSL.java:[441,73] cannot find
> symbol
>   symbol:   method cleanupLocalDirOnStartup(boolean)
>   location: class org.apache.hive.jdbc.miniHS2.MiniHS2.Builder
> [ERROR] /home/master/haihua/project/hive-official/itests/hive-
> unit/src/test/java/org/apache/hive/jdbc/TestSSL.java:[475,73] cannot find
> symbol
>   symbol:   method cleanupLocalDirOnStartup(boolean)
>   location: class org.apache.hive.jdbc.m

Re: Aug. 2017 Hive User Group Meeting

2017-08-22 Thread Alan Gates
The address is at the top of the text description, even though it isn’t in
the location field:

5470 Great America Parkway, Santa Clara, CA

Alan.

On Mon, Aug 21, 2017 at 5:50 PM, dan young  wrote:

> For us out of town folks, where is the location of this meetup? Says
> Hortonworks but do you have an address?
>
> Regards
>
> Dano
>
> On Mon, Aug 21, 2017, 1:33 PM Xuefu Zhang  wrote:
>
>> Dear Hive users and developers,
>>
>> As reminder, the next Hive User Group Meeting will occur this Thursday,
>> Aug. 24. The agenda is available on the event page (
>> https://www.meetup.com/Hive-User-Group-Meeting/events/242210487/).
>>
>> See you all there!
>>
>> Thanks,
>> Xuefu
>>
>> On Tue, Aug 1, 2017 at 7:18 PM, Xuefu Zhang  wrote:
>>
>>> Hi all,
>>>
>>> It's an honor to announce that Hive community is launching a Hive user
>>> group meeting in the bay area this month. The details can be found at
>>> https://www.meetup.com/Hive-User-Group-Meeting/events/242210487/.
>>>
>>> We are inviting talk proposals from Hive users as well as developers at
>>> this time. We currently have 5 openings.
>>>
>>> Please let me know if you have any questions or suggestions.
>>>
>>> Thanks,
>>> Xuefu
>>>
>>>
>>


Re: please help unsubscribing from mailing lists

2017-08-15 Thread Alan Gates
I think http://untroubled.org/ezmlm/manual/Unsubscribing.html#Unsubscribing
has what you need.

Alan.

On Tue, Aug 15, 2017 at 1:09 PM, Chris Drome  wrote:

> I am currently subscribed to all three Hive mailing lists (user, dev,
> commits) using cdr...@yahoo-inc.com.
>
> I'm trying to unsubscribe from these lists, but am no longer able to send
> emails from my old cdr...@yahoo-inc.com account.
>
> Is there anyone who can help me unsubscribe from these mailing lists?
>
> Much appreciated.
>
> chris
>


Re: FYI: Backports of Hive UDFs

2017-06-02 Thread Alan Gates
Rather than put that code in hive/contrib I was thinking that you could
just backport the Hive 2.2 UDFs into the same locations in Hive 1 branch.
That seems better than putting them into different locations on different
branches.

If you are willing to do the porting and post the patches (including
relevant unit tests so we know they work) I and other Hive committers can
review the patches and commit them to branch-1.

Alan.

On Thu, Jun 1, 2017 at 6:36 PM, Makoto Yui  wrote:

> That's would be a help for existing Hive users.
> Welcome to put it into hive/contrib or something else.
>
> Minimum dependancies are hive 0.13.0 and hadoop 2.4.0.
> It'll work for any Hive environment, version 0.13.0 or later.
> https://github.com/myui/hive-udf-backports/blob/master/pom.xml#L49
>
> Thanks,
> Makoto
>
> --
> Makoto YUI 
> Research Engineer, Treasure Data, Inc.
> http://myui.github.io/
>
> 2017-06-02 2:24 GMT+09:00 Alan Gates :
> > I'm curious why these can't be backported inside Hive.  If someone is
> > willing to do the work to do the backport we can check them into the
> Hive 1
> > branch.
> >
> > On Thu, Jun 1, 2017 at 1:44 AM, Makoto Yui  wrote:
> >>
> >> Hi,
> >>
> >> I created a repository for backporting recent Hive UDFs (as of v2.2.0)
> >> to legacy Hive environment (v0.13.0 or later).
> >>
> >>https://github.com/myui/hive-udf-backports
> >>
> >> Hope this helps for those who are using old Hive env :-(
> >>
> >> FYI
> >>
> >> Makoto
> >>
> >> --
> >> Makoto YUI 
> >> Research Engineer, Treasure Data, Inc.
> >> http://myui.github.io/
> >
> >
>


Re: FYI: Backports of Hive UDFs

2017-06-01 Thread Alan Gates
I'm curious why these can't be backported inside Hive.  If someone is
willing to do the work to do the backport we can check them into the Hive 1
branch.

On Thu, Jun 1, 2017 at 1:44 AM, Makoto Yui  wrote:

> Hi,
>
> I created a repository for backporting recent Hive UDFs (as of v2.2.0)
> to legacy Hive environment (v0.13.0 or later).
>
>https://github.com/myui/hive-udf-backports
>
> Hope this helps for those who are using old Hive env :-(
>
> FYI
>
> Makoto
>
> --
> Makoto YUI 
> Research Engineer, Treasure Data, Inc.
> http://myui.github.io/
>


Re: Compaction - get compacted files

2017-04-13 Thread Alan Gates
Answers inline.

Alan.

> On Mar 29, 2017, at 03:08, Riccardo Iacomini  
> wrote:
> 
> Hello,
> I have some questions about the compaction process. I need to manually 
> trigger compaction operations on a standard partitioned orc table (not ACID), 
> and be able to get back the list of compacted files. I could achieve this via 
> HDFS, getting the directory listing and then triggering the compaction, but 
> will imply stopping the underlying processing to avoid new files to be added 
> in between. Here are some questions I could not answer myself from the 
> material I found online:
>   • Is the compaction executed as a MapReduce job?
Yes.

> 
>   • Is there a way to get back the list of compacted files?
No.  Note that even doing listing in HDFS will be somewhat confusing because 
production of the new delta or base file (depending on whether it's a minor or 
major compaction) is decoupled from removing the old delta and/or base files.  
This is because readers may still be using the old files, and the cleanup 
cannot be done until those readers have finished.

> 
>   • How can you customize the compaction criteria?
You can modify when Hive decides to initiate compaction and how many resources 
it allocates to compacting.  See 
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-NewConfigurationParametersforTransactions

Alan.

> Also, any link to documentation/material is really appreciated. 
> 
> Thank you all for your time.
> 
> Riccardo



Re: Adding a Hive Statement of SQL Conformance to the docs

2017-01-13 Thread Alan Gates
+1.  I think this will be great for existing and potential Hive users.

Alan.

> On Jan 13, 2017, at 9:09 AM, Carter Shanklin  wrote:
> 
> I get asked from time to time what Hive's level of SQL conformance is, and 
> it's difficult to provide a clean answer. Most SQL systems have some detailed 
> statement of SQL conformance to help answer this question.
> 
> For a year or so I've maintained a spreadsheet that tracks Hive's SQL 
> conformance, inspired by the Postgres SQL Conformance page. I've copied this 
> spreadsheet into a publicly viewable Google Spreadsheet here: 
> https://docs.google.com/spreadsheets/d/1VaAqkPXtjhT_oYniUW1I2xMmAFlxqp2UFkemL9-U14Q/edit#gid=0
> 
> I propose to add a static version of this document to the Hive Wiki, and to 
> version it with one static SQL Conformance page per Hive major release, 
> starting with Hive 2.1 and moving forward. So for example there would be one 
> page for Hive 2.1, one for Hive 2.2 when it is released, and so on.
> 
> At this point I don't guarantee the spreadsheet's complete accuracy. Getting 
> it into the public wiki with multiple editors should quickly eliminate any 
> errors.
> 
> Does anyone have comments, suggestions or objections?
> 
> Thanks,
> 



Re: HMS connections to meta db

2016-12-19 Thread Alan Gates
Do you mean the connection between the Hive client and the Hive metastore (if 
you are using the command line?) or the connection between the metastore server 
code and the RDBMS.  The connection to the RDBMS uses JDBC connection pooling 
to avoid making and tearing down many connections.  The connection of the 
command line to the client to the Hive metastore uses thrift, which I believe 
by default builds a new TCP connection for each operation.

Alan.

> On Dec 13, 2016, at 10:07 PM, Huang Meilong  wrote:
> 
> Hi all,
> 
> Will HMS keep the connection to meta db when HMS is up? Or will HMS build 
> connection to meta db every time the query comes to HMS and release 
> connection to meta db when query finished?



Re: [ANNOUNCE] Apache Hive 2.1.1 Released

2016-12-08 Thread Alan Gates
Apache keeps just the latest version of each release on the mirrors.  You can 
find all Hive releases at https://archive.apache.org/dist/hive/ if you need 
2.1.0.

Alan.

> On Dec 8, 2016, at 14:40, Stephen Sprague  wrote:
> 
> out of curiosity any reason why release 2.1.0 disappeared from 
> apache.claz.org/hive ?   apologies if i missed the conversation about it.  
> thanks.
> 
> 
> 
> 
> On Thu, Dec 8, 2016 at 9:58 AM, Jesus Camacho Rodriguez  
> wrote:
> The Apache Hive team is proud to announce the release of Apache Hive
> version 2.1.1.
> 
> The Apache Hive (TM) data warehouse software facilitates querying and
> managing large datasets residing in distributed storage. Built on top
> of Apache Hadoop (TM), it provides, among others:
> 
> * Tools to enable easy data extract/transform/load (ETL)
> 
> * A mechanism to impose structure on a variety of data formats
> 
> * Access to files stored either directly in Apache HDFS (TM) or in other
>   data storage systems such as Apache HBase (TM)
> 
> * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
> 
> For Hive release details and downloads, please visit:
> https://hive.apache.org/downloads.html
> 
> Hive 2.1.1 Release Notes are available here:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310843&version=12335838
> 
> We would like to thank the many contributors who made this release
> possible.
> 
> Regards,
> 
> The Apache Hive Team
> 
> 
> 



Re: I delete my table in hive,but the file in HDFS not be deleted

2016-12-06 Thread Alan Gates
Is the table external or managed?  External tables do not remove their data 
when dropped, managed tables do.

Alan.

> On Dec 6, 2016, at 18:08, 446463...@qq.com wrote:
> 
> I meet a problem in hive.
> 
> I drop a table in hive and the table name ' user_info_20161206'
> ---
> hive> show tables;
> OK
> kylin_cal_dt
> kylin_category_groupings
> kylin_intermediate_dmp_cube_fb5904cf_a4d3_4815_802d_c31afe9119e9
> kylin_intermediate_test_cube_08677652_0f84_4322_a2a5_0a963723579e
> kylin_intermediate_test_cube_a37beebf_d7da_4956_8e25_d563dd834364
> kylin_intermediate_test_cube_aa9ee162_0d45_4ea6_853d_6df127799edf
> kylin_sales
> Time taken: 0.045 seconds, Fetched: 7 row(s)
> hive> 
> --
> but I find the user_info_20161206 file is exist in HDFS file
> 
> drwxrwxrwt   - hadoop hadoop  0 2016-09-26 11:50 
> /user/hive/warehouse/dm.db
> drwxrwxrwt   - hadoop hadoop  0 2016-11-21 15:39 
> /user/hive/warehouse/dw.db
> drwxrwxrwt   - hadoop hadoop  0 2016-11-03 12:44 
> /user/hive/warehouse/kylin_cal_dt
> drwxrwxrwt   - hadoop hadoop  0 2016-11-03 12:44 
> /user/hive/warehouse/kylin_category_groupings
> drwxrwxrwt   - hadoop hadoop  0 2016-11-03 12:44 
> /user/hive/warehouse/kylin_sales
> drwxrwxrwt   - hadoop hadoop  0 2016-09-26 11:50 
> /user/hive/warehouse/ods.db
> drwxrwxrwt   - hadoop hadoop  0 2016-11-30 17:53 
> /user/hive/warehouse/raw.db
> drwxrwxrwt   - hadoop hadoop  0 2016-09-26 11:50 
> /user/hive/warehouse/rpt.db
> drwxrwxrwt   - hadoop hadoop  0 2016-09-26 11:50 
> /user/hive/warehouse/temp.db
> drwxrwxrwt   - hadoop hadoop  0 2016-11-24 13:17 
> /user/hive/warehouse/test.db
> drwxrwxrwt   - hive   hadoop  0 2016-12-06 21:15 
> /user/hive/warehouse/user_info_20161206
> --
> I don't kown why I drop table in hive but the same file not delete in HDFS . 
> I test in mine test environment,it's work
>  the owner of file is 'hive' not 'hadoop'?
> 446463...@qq.com



Re: Compaction in hive

2016-12-06 Thread Alan Gates
What exactly do you mean by compaction?  Hive has a compactor that runs over 
ACID tables to handle the delta files[1], but I’m guessing you don’t mean that. 
 Are you wanting to concatenate files in existing tables?  The usual way to do 
that is alter table concatenate[2].  Or do you mean something else?

Alan.

1. see 
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Compactor
2. see 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate

> On Dec 6, 2016, at 07:03, Nishant Aggarwal  wrote:
> 
> Dear Hive Gurus,
> 
> I am looking to some practical solution on how to implement Compaction in 
> Hive. Hiveserver2 version 1.1.0.
> 
> We have some external Hive tables on which we  need to implement Compaction. 
> 
> Merging the map files is one option which is turned down since it is very CPU 
> intensive.
> 
> Need your help in order to implement Compaction, how to implement, what are 
> the pros and cons.
> 
> Also, is it mandatory to have bucketing to implement compaction?
> 
> Request you to please help.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Thanks and Regards
> Nishant Aggarwal, PMP
> Cell No:- +91 99588 94305
> http://in.linkedin.com/pub/nishant-aggarwal/53/698/11b
> 



Re: Difference between MANAGED_TABLE and EXTERNAL_TABLE in org.apache.hadoop.hive.metastore.TableType

2016-12-01 Thread Alan Gates
Hive does not assume that it owns the data for an external table.  Thus when an 
external table is dropped, the data is not deleted.  People often use this as a 
way to load data into a directory in HDFS and then “cast” a table structure 
over it by creating an external table with that directory as its location.

Alan.

> On Dec 1, 2016, at 06:15, Huang Meilong  wrote:
> 
> Hi all,
> 
> I found an enum TableType in package org.apache.hadoop.hive.metastore. What's 
> the difference between MANAGED_TABLE and EXTERNAL_TABLE?
> 
> Will the table be an EXTERNAL TABLE with setting table type EXTERNAL_TABLE 
> when creating table? 
> 
> I found the code to determine whether a table is an external table in 
> MetaStoreUtils.java
> https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L1425
> 
> I'm confused what is EXTERNAL_TABLE in TableType for?



Re: Problems with Hive Streaming. Compactions not working. Out of memory errors.

2016-11-29 Thread Alan Gates
I’m guessing that this is an issue in the metastore database where it is unable 
to read from the transaction tables due to the ingestion rate.  What version of 
Hive are you using?  What database are you storing the metadata in?

Alan.

> On Nov 29, 2016, at 00:05, Diego Fustes Villadóniga  wrote:
> 
> Hi all,
>  
> We are trying to use Hive streaming to ingest data in real time from Flink. 
> We send batches of data every 5 seconds to Hive. We are working version 
> 1.1.0-cdh5.8.2.
>  
> The ingestión works fine. However, compactions are not working, the log shows 
> this error:
>  
> Unable to select next element for compaction, ERROR: could not serialize 
> access due to concurrent update
>  
> In addition, when we run simple queries like SELECT COUNT(1) FROM events, we 
> are getting OutOfMemory errors, even though we have assigned 10GB to each 
> Mapper/Reducer. Seeing the logs, each map task tries to load
> all delta files, until it breaks, which does not make much sense to me.
>  
>  
> I think that we have followed all the steps described in the documentation, 
> so we are blocked in this point.
>  
> Could you help us?



Re: Adding a New Primitive Type in Hive

2016-11-21 Thread Alan Gates
A few questions:

1) What operators do you envision UUID supporting?  Are there UDFs specific to 
it?  Are there constraints on assuring its uniqueness?

2) A more general form of question 1, what about UUID is different from a 
string or decimal(20, 0) (either of which should be able to store a UUID) that 
requires defining a new type?

3) Is this mostly to make clear to users that this is UUID data and not a 
general string, or bigint, or whatever?

As Edward correctly points out adding types has implications on other users in 
the system who read and write your data.  I’m also worried about proliferating 
new types.  I’m wondering if we could approach this by supporting user defined 
types.

Full on UDTs are complex, but we could start with just the ability to take a 
Hive struct and define it as a UDT in the metadata, with definitions of how to 
convert this value to and from a string.  This would enable storage without 
changing every serde (as we’d store it as a string in the underlying file) and 
allow constant definitions in SQL (since we could convert from a string).  This 
would not enable any constraints or operators for the new type, but those could 
be added later if desired.

Alan.

> On Nov 19, 2016, at 13:11, Juan Delard de Rigoulières  
> wrote:
> 
> Hi,
> We'd like to extend Hive to support a new primitive type. For simplicity 
> sake, think of UUID. 
> (https://en.wikipedia.org/wiki/Universally_unique_identifier)
> UUIDs are string with a particular/simple structure - known regex matchable. 
> (/^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/i)
> We've looked into serde & udf but it doesn't seem elegant enough, so that 
> it's possible to write DDLs like:
> CREATE TABLE `awesome` {
>   users STRING,
>   id UUID
> };
> We are looking to validation of values on ingestion (INSERT); so in the 
> example, values for the second column will get validated as UUID records.
> Thanks in advance.
> 
> Juan
> 



Re: kylin log file is to large

2016-11-13 Thread Alan Gates
This question would be better asked on the kylin lists, since Hive doesn’t 
start either kylin.sh or diag.sh.

Alan.


> On Nov 14, 2016, at 06:29, 446463...@qq.com wrote:
> 
> Hi:
>  I run kylin instance in mine test environment.I find that the kylin log 
> increases so fast. and my hard disk is filled 
>  so I kill the kylin.sh  progress.But it's not stopped. I checkd the machine 
> and find the diag.sh is running. I never start this script.
> I kill this and it seems OK.why?
> 
> 446463...@qq.com



Re: Hive metadata on Hbase

2016-10-24 Thread Alan Gates
Some thoughts on this:

First, there’s no plan to remove the option to use an RDBMS such as Oracle as 
your backend.  Hive’s RawStore interface is built such that various 
implementations of the metadata storage can easily coexist.  Obviously 
different users will make different choices about what metadata store makes 
sense for them.

As to why HBase:
1) We desperately need to get rid of the ORM layer.  It’s causing us 
performance problems, as evidenced by things like it taking several minutes to 
fetch all of the partition data for queries that span many partitions.  HBase 
is a way to achieve this, not the only way.  See in particular Yahoo’s work on 
optimizing Oracle access https://issues.apache.org/jira/browse/HIVE-14870  The 
question around this is whether we can optimize for Oracle, MySQL, Postgres, 
and SQLServer without creating a maintenance and testing nightmare for 
ourselves.  I’m skeptical, but others think it’s possible.  See comments on 
that JIRA.

2) We’d like to scale to much larger sizes, both in terms of data and access 
from nodes.  Not that we’re worried about the amount of metadata, but we’d like 
to be able to cache more stats, file splits, etc.  And we’d like to allow nodes 
in the cluster to contact the metastore, which we do not today since many 
RDBMSs don’t handle a thousand plus simultaneous connections well.  Obviously 
both data and connection scale can be met with high end commercial stores.  But 
saying that we have this great open source database but you have to pay for an 
expensive commercial license to make the metadata really work well is a 
non-starter.

3) By using tools within the Hadoop ecosystem like HBase we are helping to 
drive improvement in the system

To explain the HBase work a little more, it doesn’t use Phoenix, but works 
directly against HBase, with the help of a transaction manager (Omid).  In 
performance tests we’ve done so far it’s faster than Hive 1 with the ORM layer, 
but not yet to the 10x range that we’d like to see.  We haven’t yet done the 
work to put in co-processors and such that we expect would speed it up further.

Alan.

> On Oct 23, 2016, at 15:46, Mich Talebzadeh  wrote:
> 
> 
> A while back there was some notes on having Hive metastore on Hbase as 
> opposed to conventional RDBMSs
> 
> I am currently involved with some hefty work with Hbase and Phoenix for batch 
> ingestion of trade data. As long as you define your Hbase table through 
> Phoenix and with secondary Phoenix indexes on Hbase, the speed is impressive.
> 
> I am not sure how much having Hbase as Hive metastore is going to add to Hive 
> performance. We use Oracle 12c as Hive metastore and the Hive database/schema 
> is built on solid state disks. Never had any issues with lock and concurrency.
> 
> Therefore I am not sure what one is going to gain by having Hbase as the Hive 
> metastore? I trust that we can still use our existing schemas on Oracle.
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  



Re: Hive in-memory offerings in forthcoming releases

2016-10-10 Thread Alan Gates
Hive doesn’t usually publish long term roadmaps.  

I am not familiar with either SAP ASE or Oracle 12c so I can’t say whether Hive 
is headed in that direction or not.

We see LLAP as very important for speeding up Hive processing, especially in 
the cloud where fetches from blob storage are very expensive.  As an example, 
see how HDInsight on Microsoft’s Azure cloud is already using LLAP.  At the 
moment LLAP is read only, so an obvious next step here is adding write 
capabilities (see https://issues.apache.org/jira/browse/HIVE-14535 for some 
thoughts on how this might work).

I don’t know if this answers your question or not.

Alan.

> On Oct 8, 2016, at 12:48, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> Is there any documentation on Apache Hive proposed new releases which is 
> going to offer an in-memory database (IMDB) in the form of LLAP or built on 
> LLAP.
> 
> Love to see something like SAP ASE IMDB or Oracle 12c in-memory offerings 
> with Hive as well.
> 
> Regards,
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  



Re: Hive orc use case

2016-09-26 Thread Alan Gates
As long as there is a spare worker thread this should be picked up within a few 
seconds.  It’s true you can’t force it to happen immediately if other 
compactions are happening, but that’s by design so that compaction work doesn’t 
take take too many resources.

Alan.

> On Sep 26, 2016, at 11:07, Mich Talebzadeh  wrote:
> 
> alter table payees compact 'minor';
> Compaction enqueued.
> OK
> 
> It queues compaction but there is no way I can force it to do compaction 
> immediately?
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 26 September 2016 at 18:54, Alan Gates  wrote:
> alter table compact forces a compaction.  See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionCompact
> 
> Alan.
> 
> > On Sep 26, 2016, at 10:41, Mich Talebzadeh  
> > wrote:
> >
> > Can the temporary table be a solution to the original thread owner issue?
> >
> > Hive streaming for example from Flume to Hive is interesting but the issue 
> > is that one ends up with a fair bit of delta files due to transactional 
> > nature of ORC table and I know that Spark will not be able to open the 
> > table until compaction takes place which cannot be forced. I don't know 
> > where there is a way to enforce quick compaction..
> >
> > Thanks
> >
> > Dr Mich Talebzadeh
> >
> > LinkedIn  
> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> > http://talebzadehmich.wordpress.com
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any 
> > loss, damage or destruction of data or any other property which may arise 
> > from relying on this email's technical content is explicitly disclaimed. 
> > The author will in no case be liable for any monetary damages arising from 
> > such loss, damage or destruction.
> >
> >
> > On 26 September 2016 at 17:41, Alan Gates  wrote:
> > ORC does not store data row by row.  It decomposes the rows into columns, 
> > and then stores pointer to those columns, as well as a number of indices 
> > and statistics, in a footer of the file.  Due to the footer, in the simple 
> > case you cannot read the file before you close it or append to it.  We did 
> > address both of these issues to support Hive streaming, but it’s a low 
> > level interface.  If you want to take a look at how Hive streaming handles 
> > this you could use it as your guide.  The starting point for that is 
> > HiveEndPoint in org.apache.hive.hcatalog.streaming.
> >
> > Alan.
> >
> > > On Sep 26, 2016, at 01:18, Amey Barve  wrote:
> > >
> > > Hi All,
> > >
> > > I have an use case where I need to append either 1 or many rows to 
> > > orcFile as well as read 1 or many rows from it.
> > >
> > > I observed that I cannot read rows from OrcFile unless I close the 
> > > OrcFile's writer, is this correct?
> > >
> > > Why doesn't write actually flush the rows to the orcFile, is there any 
> > > alternative where I write the rows as well as read them without closing 
> > > the orcFile's writer ?
> > >
> > > Thanks and Regards,
> > > Amey
> >
> >
> 
> 



Re: Hive orc use case

2016-09-26 Thread Alan Gates
alter table compact forces a compaction.  See 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionCompact

Alan.

> On Sep 26, 2016, at 10:41, Mich Talebzadeh  wrote:
> 
> Can the temporary table be a solution to the original thread owner issue?
> 
> Hive streaming for example from Flume to Hive is interesting but the issue is 
> that one ends up with a fair bit of delta files due to transactional nature 
> of ORC table and I know that Spark will not be able to open the table until 
> compaction takes place which cannot be forced. I don't know where there is a 
> way to enforce quick compaction..
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 26 September 2016 at 17:41, Alan Gates  wrote:
> ORC does not store data row by row.  It decomposes the rows into columns, and 
> then stores pointer to those columns, as well as a number of indices and 
> statistics, in a footer of the file.  Due to the footer, in the simple case 
> you cannot read the file before you close it or append to it.  We did address 
> both of these issues to support Hive streaming, but it’s a low level 
> interface.  If you want to take a look at how Hive streaming handles this you 
> could use it as your guide.  The starting point for that is HiveEndPoint in 
> org.apache.hive.hcatalog.streaming.
> 
> Alan.
> 
> > On Sep 26, 2016, at 01:18, Amey Barve  wrote:
> >
> > Hi All,
> >
> > I have an use case where I need to append either 1 or many rows to orcFile 
> > as well as read 1 or many rows from it.
> >
> > I observed that I cannot read rows from OrcFile unless I close the 
> > OrcFile's writer, is this correct?
> >
> > Why doesn't write actually flush the rows to the orcFile, is there any 
> > alternative where I write the rows as well as read them without closing the 
> > orcFile's writer ?
> >
> > Thanks and Regards,
> > Amey
> 
> 



Re: Hive orc use case

2016-09-26 Thread Alan Gates
ORC does not store data row by row.  It decomposes the rows into columns, and 
then stores pointer to those columns, as well as a number of indices and 
statistics, in a footer of the file.  Due to the footer, in the simple case you 
cannot read the file before you close it or append to it.  We did address both 
of these issues to support Hive streaming, but it’s a low level interface.  If 
you want to take a look at how Hive streaming handles this you could use it as 
your guide.  The starting point for that is HiveEndPoint in 
org.apache.hive.hcatalog.streaming.

Alan.

> On Sep 26, 2016, at 01:18, Amey Barve  wrote:
> 
> Hi All,
> 
> I have an use case where I need to append either 1 or many rows to orcFile as 
> well as read 1 or many rows from it.
> 
> I observed that I cannot read rows from OrcFile unless I close the OrcFile's 
> writer, is this correct?
> 
> Why doesn't write actually flush the rows to the orcFile, is there any 
> alternative where I write the rows as well as read them without closing the 
> orcFile's writer ?
> 
> Thanks and Regards,
> Amey 



Re: How can I force Hive to start compaction on a table immediately

2016-08-01 Thread Alan Gates
There’s no way to force immediate compaction.  If there are compaction workers 
in the metastore that aren’t busy they should pick that up immediately.  But 
there isn’t an ability to create a worker thread and start compacting.

Alan.

> On Aug 1, 2016, at 14:50, Mich Talebzadeh  wrote:
> 
> 
> Rather than queuing it
> 
> hive> alter table payees COMPACT 'major';
> Compaction enqueued.
> OK
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  



Re: Hive compaction didn't launch

2016-07-28 Thread Alan Gates
But until those transactions are closed you don’t know that they won’t write to 
partition B.  After they write to A they may choose to write to B and then 
commit.  The compactor can not make any assumptions about what sessions with 
open transactions will do in the future.

Alan.

> On Jul 28, 2016, at 09:19, Igor Kuzmenko  wrote:
> 
> But this minOpenTxn value isn't from from delta I want to compact. minOpenTxn 
> can point on transaction in partition A while in partition B there's deltas 
> ready for compaction. If minOpenTxn is less than txnIds in partition B 
> deltas, compaction won't happen. So open transaction in partition A blocks 
> compaction in partition B. That's seems wrong to me.
> 
> On Thu, Jul 28, 2016 at 7:06 PM, Alan Gates  wrote:
> Hive is doing the right thing there, as it cannot compact the deltas into a 
> base file while there are still open transactions in the delta.  Storm should 
> be committing on some frequency even if it doesn’t have enough data to commit.
> 
> Alan.
> 
> > On Jul 28, 2016, at 05:36, Igor Kuzmenko  wrote:
> >
> > I made some research on that issue.
> > The problem is in ValidCompactorTxnList::isTxnRangeValid method.
> >
> > Here's code:
> > @Override
> > public RangeResponse isTxnRangeValid(long minTxnId, long maxTxnId) {
> >   if (highWatermark < minTxnId) {
> > return RangeResponse.NONE;
> >   } else if (minOpenTxn < 0) {
> > return highWatermark >= maxTxnId ? RangeResponse.ALL : 
> > RangeResponse.NONE;
> >   } else {
> > return minOpenTxn > maxTxnId ? RangeResponse.ALL : RangeResponse.NONE;
> >   }
> > }
> >
> > In my case this method returned RangeResponce.NONE for most of delta files. 
> > With this value delta file doesn't include in compaction.
> >
> > Last 'else' bock compare minOpenTxn to maxTxnId and if maxTxnId bigger 
> > return RangeResponce.NONE, thats a problem for me, because of using Storm 
> > Hive Bolt. Hive Bolt gets transaction and maintain it open with heartbeat 
> > until there's data to commit.
> >
> > So if i get transaction and maintain it open all compactions will stop. Is 
> > it incorrect Hive behavior, or Storm should close transaction?
> >
> >
> >
> >
> > On Wed, Jul 27, 2016 at 8:46 PM, Igor Kuzmenko  wrote:
> > Thanks for reply, Alan. My guess with Storm was wrong. Today I get same 
> > behavior with running Storm topology.
> > Anyway, I'd like to know, how can I check that transaction batch was closed 
> > correctly?
> >
> > On Wed, Jul 27, 2016 at 8:09 PM, Alan Gates  wrote:
> > I don’t know the details of how the storm application that streams into 
> > Hive works, but this sounds like the transaction batches weren’t getting 
> > closed.  Compaction can’t happen until those batches are closed.  Do you 
> > know how you had storm configured?  Also, you might ask separately on the 
> > storm list to see if people have seen this issue before.
> >
> > Alan.
> >
> > > On Jul 27, 2016, at 03:31, Igor Kuzmenko  wrote:
> > >
> > > One more thing. I'm using Apache Storm to stream data in Hive. And when I 
> > > turned off Storm topology compactions started to work properly.
> > >
> > > On Tue, Jul 26, 2016 at 6:28 PM, Igor Kuzmenko  wrote:
> > > I'm using Hive 1.2.1 transactional table. Inserting data in it via Hive 
> > > Streaming API. After some time i expect compaction to start but it didn't 
> > > happen:
> > >
> > > Here's part of log, which shows that compactor initiator thread doesn't 
> > > see any delta files:
> > > 2016-07-26 18:06:52,459 INFO  [Thread-8]: compactor.Initiator 
> > > (Initiator.java:run(89)) - Checking to see if we should compact 
> > > default.data_aaa.dt=20160726
> > > 2016-07-26 18:06:52,496 DEBUG [Thread-8]: io.AcidUtils 
> > > (AcidUtils.java:getAcidState(432)) - in directory 
> > > hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/data_aaa/dt=20160726
> > >  base = null deltas = 0
> > > 2016-07-26 18:06:52,496 DEBUG [Thread-8]: compactor.Initiator 
> > > (Initiator.java:determineCompactionType(271)) - delta size: 0 base size: 
> > > 0 threshold: 0.1 will major compact: false
> > >
> > > But in that directory there's actually 23 files:
> > >
> > > hadoop fs -ls /apps/hive/warehouse/data_aaa/dt=20160726
> > > Found 23 items
> > > -rw-r--r--   3 storm hdfs  4 2016-07-26 17:20 
> > > /apps/hive/warehouse

Re: Hive compaction didn't launch

2016-07-28 Thread Alan Gates
Hive is doing the right thing there, as it cannot compact the deltas into a 
base file while there are still open transactions in the delta.  Storm should 
be committing on some frequency even if it doesn’t have enough data to commit.

Alan.

> On Jul 28, 2016, at 05:36, Igor Kuzmenko  wrote:
> 
> I made some research on that issue.
> The problem is in ValidCompactorTxnList::isTxnRangeValid method.
> 
> Here's code:
> @Override
> public RangeResponse isTxnRangeValid(long minTxnId, long maxTxnId) {
>   if (highWatermark < minTxnId) {
> return RangeResponse.NONE;
>   } else if (minOpenTxn < 0) {
> return highWatermark >= maxTxnId ? RangeResponse.ALL : RangeResponse.NONE;
>   } else {
> return minOpenTxn > maxTxnId ? RangeResponse.ALL : RangeResponse.NONE;
>   }
> }
> 
> In my case this method returned RangeResponce.NONE for most of delta files. 
> With this value delta file doesn't include in compaction.
> 
> Last 'else' bock compare minOpenTxn to maxTxnId and if maxTxnId bigger return 
> RangeResponce.NONE, thats a problem for me, because of using Storm Hive Bolt. 
> Hive Bolt gets transaction and maintain it open with heartbeat until there's 
> data to commit.
> 
> So if i get transaction and maintain it open all compactions will stop. Is it 
> incorrect Hive behavior, or Storm should close transaction?
> 
> 
> 
> 
> On Wed, Jul 27, 2016 at 8:46 PM, Igor Kuzmenko  wrote:
> Thanks for reply, Alan. My guess with Storm was wrong. Today I get same 
> behavior with running Storm topology. 
> Anyway, I'd like to know, how can I check that transaction batch was closed 
> correctly?
> 
> On Wed, Jul 27, 2016 at 8:09 PM, Alan Gates  wrote:
> I don’t know the details of how the storm application that streams into Hive 
> works, but this sounds like the transaction batches weren’t getting closed.  
> Compaction can’t happen until those batches are closed.  Do you know how you 
> had storm configured?  Also, you might ask separately on the storm list to 
> see if people have seen this issue before.
> 
> Alan.
> 
> > On Jul 27, 2016, at 03:31, Igor Kuzmenko  wrote:
> >
> > One more thing. I'm using Apache Storm to stream data in Hive. And when I 
> > turned off Storm topology compactions started to work properly.
> >
> > On Tue, Jul 26, 2016 at 6:28 PM, Igor Kuzmenko  wrote:
> > I'm using Hive 1.2.1 transactional table. Inserting data in it via Hive 
> > Streaming API. After some time i expect compaction to start but it didn't 
> > happen:
> >
> > Here's part of log, which shows that compactor initiator thread doesn't see 
> > any delta files:
> > 2016-07-26 18:06:52,459 INFO  [Thread-8]: compactor.Initiator 
> > (Initiator.java:run(89)) - Checking to see if we should compact 
> > default.data_aaa.dt=20160726
> > 2016-07-26 18:06:52,496 DEBUG [Thread-8]: io.AcidUtils 
> > (AcidUtils.java:getAcidState(432)) - in directory 
> > hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/data_aaa/dt=20160726
> >  base = null deltas = 0
> > 2016-07-26 18:06:52,496 DEBUG [Thread-8]: compactor.Initiator 
> > (Initiator.java:determineCompactionType(271)) - delta size: 0 base size: 0 
> > threshold: 0.1 will major compact: false
> >
> > But in that directory there's actually 23 files:
> >
> > hadoop fs -ls /apps/hive/warehouse/data_aaa/dt=20160726
> > Found 23 items
> > -rw-r--r--   3 storm hdfs  4 2016-07-26 17:20 
> > /apps/hive/warehouse/data_aaa/dt=20160726/_orc_acid_version
> > drwxrwxrwx   - storm hdfs  0 2016-07-26 17:22 
> > /apps/hive/warehouse/data_aaa/dt=20160726/delta_71741256_71741355
> > drwxrwxrwx   - storm hdfs  0 2016-07-26 17:23 
> > /apps/hive/warehouse/data_aaa/dt=20160726/delta_71762456_71762555
> > drwxrwxrwx   - storm hdfs  0 2016-07-26 17:25 
> > /apps/hive/warehouse/data_aaa/dt=20160726/delta_71787756_71787855
> > drwxrwxrwx   - storm hdfs  0 2016-07-26 17:26 
> > /apps/hive/warehouse/data_aaa/dt=20160726/delta_71795756_71795855
> > drwxrwxrwx   - storm hdfs  0 2016-07-26 17:27 
> > /apps/hive/warehouse/data_aaa/dt=20160726/delta_71804656_71804755
> > drwxrwxrwx   - storm hdfs  0 2016-07-26 17:29 
> > /apps/hive/warehouse/data_aaa/dt=20160726/delta_71828856_71828955
> > drwxrwxrwx   - storm hdfs  0 2016-07-26 17:30 
> > /apps/hive/warehouse/data_aaa/dt=20160726/delta_71846656_71846755
> > drwxrwxrwx   - storm hdfs  0 2016-07-26 17:32 
> > /apps/hive/warehouse/data_aaa/dt=20160726/delta_71850756_71850855
> > drwxrwxrwx   - storm hdfs

Re: Hive compaction didn't launch

2016-07-27 Thread Alan Gates
I don’t know the details of how the storm application that streams into Hive 
works, but this sounds like the transaction batches weren’t getting closed.  
Compaction can’t happen until those batches are closed.  Do you know how you 
had storm configured?  Also, you might ask separately on the storm list to see 
if people have seen this issue before.

Alan.

> On Jul 27, 2016, at 03:31, Igor Kuzmenko  wrote:
> 
> One more thing. I'm using Apache Storm to stream data in Hive. And when I 
> turned off Storm topology compactions started to work properly.
> 
> On Tue, Jul 26, 2016 at 6:28 PM, Igor Kuzmenko  wrote:
> I'm using Hive 1.2.1 transactional table. Inserting data in it via Hive 
> Streaming API. After some time i expect compaction to start but it didn't 
> happen:
> 
> Here's part of log, which shows that compactor initiator thread doesn't see 
> any delta files:
> 2016-07-26 18:06:52,459 INFO  [Thread-8]: compactor.Initiator 
> (Initiator.java:run(89)) - Checking to see if we should compact 
> default.data_aaa.dt=20160726
> 2016-07-26 18:06:52,496 DEBUG [Thread-8]: io.AcidUtils 
> (AcidUtils.java:getAcidState(432)) - in directory 
> hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/data_aaa/dt=20160726 
> base = null deltas = 0
> 2016-07-26 18:06:52,496 DEBUG [Thread-8]: compactor.Initiator 
> (Initiator.java:determineCompactionType(271)) - delta size: 0 base size: 0 
> threshold: 0.1 will major compact: false
> 
> But in that directory there's actually 23 files:
> 
> hadoop fs -ls /apps/hive/warehouse/data_aaa/dt=20160726
> Found 23 items
> -rw-r--r--   3 storm hdfs  4 2016-07-26 17:20 
> /apps/hive/warehouse/data_aaa/dt=20160726/_orc_acid_version
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:22 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71741256_71741355
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:23 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71762456_71762555
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:25 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71787756_71787855
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:26 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71795756_71795855
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:27 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71804656_71804755
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:29 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71828856_71828955
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:30 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71846656_71846755
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:32 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71850756_71850855
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:33 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71867356_71867455
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:34 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71891556_71891655
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:36 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71904856_71904955
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:37 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71907256_71907355
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:39 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71918756_71918855
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:40 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71947556_71947655
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:41 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71960656_71960755
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:43 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71963156_71963255
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:44 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71964556_71964655
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:46 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_71987156_71987255
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:47 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_72015756_72015855
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:48 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_72021356_72021455
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:50 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_72048756_72048855
> drwxrwxrwx   - storm hdfs  0 2016-07-26 17:50 
> /apps/hive/warehouse/data_aaa/dt=20160726/delta_72070856_72070955
> 
> Full log here.
> 
> What could go wrong?
> 



Re: Want to be one contributor

2016-07-18 Thread Alan Gates
I believe the answer is yes, you need cygwin to develop Hive on Windows.  Many 
of the Hadoop family of projects run on Windows natively, but require Cygwin 
for development.

Alan.

> On Jul 16, 2016, at 18:15, Alpesh Patel  wrote:
> 
> ​I am facing below mentioned issue while running. Do we really need cygwin on 
> windows development machine if you have windows as dev machine ? 
> 
> Kindly advise something on this ? 
> 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.7:run 
> (generate-version-annotation) on project hive-common: An Ant BuildException 
> has occured: Execute failed: java.io.IOException: Cannot run program "bash" 
> (in directory "F:\workspace\hive\common"): CreateProcess error=2, The system 
> cannot find the file specified
> [ERROR] around Ant part .. @ 
> 4:46 in F:\workspace\hive\common\target\antrun\build-main.xml
> [ERROR] -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn  -rf :hive-common​
> 
> 
> Rgds,
> Alpesh
> 
> On Thu, Jul 14, 2016 at 5:11 PM, Alan Gates  wrote:
> https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
>  is a good place to start.
> 
> Welcome to Hive.
> 
> Alan.
> 
> > On Jul 14, 2016, at 16:01, Alpesh Patel  wrote:
> >
> > Hi Guys,
> >
> > I am part of this group since 1 year. Just an audience and now want to be 
> > contributor in Hive code base.
> >
> > Can you please guide me like how can i be contributor ? Is there any wiki 
> > which i can read for this ?
> >
> > Rgds,
> > Alpesh
> >
> 
> 



Re: Want to be one contributor

2016-07-14 Thread Alan Gates
https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors
 is a good place to start.

Welcome to Hive.

Alan.

> On Jul 14, 2016, at 16:01, Alpesh Patel  wrote:
> 
> Hi Guys, 
> 
> I am part of this group since 1 year. Just an audience and now want to be 
> contributor in Hive code base. 
> 
> Can you please guide me like how can i be contributor ? Is there any wiki 
> which i can read for this ?
> 
> Rgds,
> Alpesh
> 



Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Alan Gates

> On Jul 11, 2016, at 16:22, Mich Talebzadeh  wrote:
> 
> 
>   • If I add LLAP, will that be more efficient in terms of memory usage 
> compared to Hive or not? Will it keep the data in memory for reuse or not.
>   
Yes, this is exactly what LLAP does.  It keeps a cache of hot data (hot columns 
of hot partitions) and shares that across queries.  Unlike many MPP caches it 
will cache the same data on multiple nodes if it has more workers that want to 
access the data than can be run on a single node.

As a side note, it is considered bad form in Apache to send a message to two 
lists.  It causes a lot of background noise for people on the Spark list who 
probably aren’t interested in Hive performance.

Alan.




Re: Delete hive partition while executing query.

2016-06-07 Thread Alan Gates
utFormat.getSplits(HiveInputFormat.java:407)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:155)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:255)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:248)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:248)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:235)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.ExecutionException: 
> java.io.FileNotFoundException: File 
> hdfs://jupiter.bss:8020/apps/hive/warehouse/mobile_connections/dt=20151124/msisdn_last_digit=2
>  does not exist.
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1036)
>   ... 15 more
> 
> So second thread definitely waits until first thread completes and than make 
> a partition drop. Than, somehow, after partition was droped, third query 
> completes and shows result. Fourth query doesn't complete at all, throwing 
> exception. 
> 
> 
> 
> 
> 
> 
> On Mon, Jun 6, 2016 at 8:30 PM, Alan Gates  wrote:
> Do you have the system configured to use the DbTxnManager?  See 
> https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Configuration
>  for details on how to set this up.  The transaction manager is what manages 
> locking and makes sure that your queries don’t stomp each other.
> 
> Alan.
> 
> > On Jun 6, 2016, at 06:01, Igor Kuzmenko  wrote:
> >
> > Hello, I'm trying to find a safe way to delete partition with all data it 
> > includes.
> >
> > I'm using Hive 1.2.1, Hive JDBC driver 1.2.1 and perform simple test on 
> > transactional table:
> >
> > asyncExecute("Select count(distinct in_info_msisdn) from mobile_connections 
> > where dt=20151124 and msisdn_last_digit=2", 1);
> > Thread.sleep(3000);
> > asyncExecute("alter table mobile_connections drop if exists partition 
> > (dt=20151124, msisdn_last_digit=2) purge", 2);
> > Thread.sleep(3000);
> > asyncExecute("Select count(distinct in_info_msisdn) from mobile_connections 
> > where dt=20151124 and msisdn_last_digit=2", 3);
> > Thread.sleep(3000);
> > asyncExecute("Select count(distinct in_info_msisdn) from mobile_connections 
> > where dt=20151124 and msisdn_last_digit=2", 4);
> > (full code here)
> >
> > I cretate several threads, each execute query async. First is querying 
> > partition. Second drop partition. Others are the same as first. First query 
> > takes about 10-15 seconds to complete, so "alter table" query starts before 
> > first query completes.
> > As a result i get:
> >   • First query - successfully completes
> >   • Second query - successfully completes
> >   • Third query - successfully completes
> >   • Fourth query - throw exception:
> > java.sql.SQLException: Error while processing statement: FAILED: Execution 
> > Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. 
> > Vertex failed, vertexName=Map 1, vertexId=vertex_1461923723503_0189_1_00, 
> > diagnostics=[Vertex vertex_1461923723503_0189_1_00 [Map 1] killed/failed 
> > due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: mobile_connections 
> > initializer failed, vertex=vertex_1461923723503_0189_1_00 [Map 1], 
> > java.lang.RuntimeException: serious problem
> >   at 
> > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1059)
> >   at 
> > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1086)
> >   at 
> > org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInpu

Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-06 Thread Alan Gates
This JIRA https://issues.apache.org/jira/browse/HIVE-12366 moved the heartbeat 
logic from the engine to the client.  AFAIK this was the only issue preventing 
working with Spark as an engine.  That JIRA was released in 2.0.

I want to stress that to my knowledge no one has tested this combination of 
features, so there may be other problem.  But at least this issue has been 
resolved.

Alan.

> On Jun 2, 2016, at 01:54, Mich Talebzadeh  wrote:
> 
> 
> Hi,
> 
> Spark does not support transactions because as I understand there is a piece 
> in the execution side that needs to send heartbeats to Hive metastore saying 
> a transaction is still alive". That has not been implemented in Spark yet to 
> my knowledge."
> 
> Any idea on the timelines when we are going to have support for transactions 
> in Spark for Hive ORC tables. This will really be useful.
> 
> 
> Thanks,
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  



Re: Delete hive partition while executing query.

2016-06-06 Thread Alan Gates
Do you have the system configured to use the DbTxnManager?  See 
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Configuration
 for details on how to set this up.  The transaction manager is what manages 
locking and makes sure that your queries don’t stomp each other.

Alan.

> On Jun 6, 2016, at 06:01, Igor Kuzmenko  wrote:
> 
> Hello, I'm trying to find a safe way to delete partition with all data it 
> includes.
> 
> I'm using Hive 1.2.1, Hive JDBC driver 1.2.1 and perform simple test on 
> transactional table:
> 
> asyncExecute("Select count(distinct in_info_msisdn) from mobile_connections 
> where dt=20151124 and msisdn_last_digit=2", 1);
> Thread.sleep(3000);
> asyncExecute("alter table mobile_connections drop if exists partition 
> (dt=20151124, msisdn_last_digit=2) purge", 2);
> Thread.sleep(3000);
> asyncExecute("Select count(distinct in_info_msisdn) from mobile_connections 
> where dt=20151124 and msisdn_last_digit=2", 3);
> Thread.sleep(3000);
> asyncExecute("Select count(distinct in_info_msisdn) from mobile_connections 
> where dt=20151124 and msisdn_last_digit=2", 4);
> (full code here)
> 
> I cretate several threads, each execute query async. First is querying 
> partition. Second drop partition. Others are the same as first. First query 
> takes about 10-15 seconds to complete, so "alter table" query starts before 
> first query completes.
> As a result i get:
>   • First query - successfully completes 
>   • Second query - successfully completes
>   • Third query - successfully completes
>   • Fourth query - throw exception:
> java.sql.SQLException: Error while processing statement: FAILED: Execution 
> Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex 
> failed, vertexName=Map 1, vertexId=vertex_1461923723503_0189_1_00, 
> diagnostics=[Vertex vertex_1461923723503_0189_1_00 [Map 1] killed/failed due 
> to:ROOT_INPUT_INIT_FAILURE, Vertex Input: mobile_connections initializer 
> failed, vertex=vertex_1461923723503_0189_1_00 [Map 1], 
> java.lang.RuntimeException: serious problem
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1059)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1086)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:305)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:407)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:155)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:255)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:248)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:248)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:235)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.ExecutionException: 
> java.io.FileNotFoundException: File 
> hdfs://jupiter.bss:8020/apps/hive/warehouse/mobile_connections/dt=20151124/msisdn_last_digit=2
>  does not exist.
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1036)
>   ... 15 more
> Caused by: java.io.FileNotFoundException: File 
> hdfs://jupiter.bss:8020/apps/hive/warehouse/mobile_connections/dt=20151124/msisdn_last_digit=2
>  does not exist.
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:958)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:937)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:882)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:878)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:878)
>   at 
> org.apa

Re: Hive and using Pooled Connections

2016-05-25 Thread Alan Gates
It’s depends on how it’s configured.  In $HIVE_HOME/conf/hive-site.xml you can 
set the datanucleus.connectionPoolingType variable to BONECP or DBCP.  By 
default it should be using BONECP I believe.  (I think NONE is also a valid 
value, but that doesn’t yet work with ACID turned on.)

Alan.

> On May 25, 2016, at 13:03, Mich Talebzadeh  wrote:
> 
> 
> Hi,
> 
> 
> I am sure someone knows the answer to this question.
> 
> Does Hive 2.0 use connection pool to connect to its metastore? I see a lot of 
> open and closed connections to the metastore that may not be necessary.
> 
> A connection pool is a cache of database connection objects. Connection pools 
> promote the reuse of connection objects and reduce the number of times that 
> connection objects are created. Connection pools significantly improve 
> performance for database-intensive applications because creating connection 
> objects is costly both in terms of time and resources.
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  



Re: Hello, I have an issue about hcatalog

2016-05-18 Thread Alan Gates
This looks to me like a Hadoop issue rather than Hive.  It appears that you 
cannot connect to HDFS.  Have you tried connecting to HDFS outside of 
Hive/HCatalog?  

Alan.

> On May 18, 2016, at 04:24, Mark Memory  wrote:
> 
> hello guys, sorry to bother you.
> 
> I'm using hcatalog to write hive tables, but I don't know how to do with 
> namenode HA
> my code was copied from 
> https://github.com/apache/hive/blob/master/hcatalog/core/src/test/java/org/apache/hive/hcatalog/data/TestReaderWriter.java
> 
> below is my config:
> hiveConf.setVar(HiveConf.ConfVars.HADOOPBIN, "/opt/modules/hadoop/bin");
> 
> hiveConf.setVar(HiveConf.ConfVars.LOCALSCRATCHDIR, 
> "/opt/modules/hive/temp");
> 
> hiveConf.setVar(HiveConf.ConfVars.DOWNLOADED_RESOURCES_DIR, 
> "/opt/modules/hive/temp");
> 
> hiveConf.setBoolVar(HiveConf.ConfVars.HIVE_SUPPORT_CONCURRENCY, false);
> 
> hiveConf.setVar(HiveConf.ConfVars.METASTOREWAREHOUSE, "/warehouse");
> 
> hiveConf.setVar(HiveConf.ConfVars.METASTOREURIS, 
> "thrift://127.0.0.1:9083");
> 
> hiveConf.setVar(HiveConf.ConfVars.METASTORE_CONNECTION_DRIVER, 
> "com.mysql.jdbc.Driver");
> 
> hiveConf.setVar(HiveConf.ConfVars.METASTORECONNECTURLKEY, 
> "jdbc:mysql://192.168.5.29:3306/hive?createDatabaseIfNotExist=true");
> 
> hiveConf.setVar(HiveConf.ConfVars.METASTORE_CONNECTION_USER_NAME, "hive");
> 
> hiveConf.setVar(HiveConf.ConfVars.METASTOREPWD, "123456");
> 
> hiveConf.setVar(HiveConf.ConfVars.HIVEHISTORYFILELOC, 
> "/opt/modules/hive/temp");
> 
> and the error is:
> 
> Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: 
> cluster
> 
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> 
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
> 
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
> 
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
> 
> Is anyone can help me? thank you 
> 



Re: Hive SQL based authorization don't have support for group

2016-05-12 Thread Alan Gates
By group here I assume you mean posix style file group (from HDFS)?  No, there 
isn’t any connection right now.  We’d like to be able to pick up groups from 
HDFS and define those as roles in Hive, but we haven’t added that feature.

You’ll need to define a role that includes the members of that group.

Alan.

> On May 12, 2016, at 02:01, kumar r  wrote:
> 
> Hi,
> 
> I have configured hive-1.2.1 with Hadoop-2.7.2. I have enabled SQL standard 
> based authorization.
> 
> I can give permission for user or roles. i cannot find any option for groups.
> 
> Is there any feature available for group permission on Hive SQL standard 
> based authorization?
> 
> When i am trying to execute below command getting exception,
> 
> grant select on sample to group Testing;
> 
> Exception
> 
> Error: Error while processing statement: FAILED: Execution Error, return 
> code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Invalid principal 
> type in principal Principal [name=Testing, type=GROUP] (state=08S01,code=1)
> 
> Stack overflow question: 
> http://stackoverflow.com/questions/37070419/hive-authorization-dont-have-support-for-group
> 
> Thanks,
> Kumar
> 



Re: Standard Deviation in Hive 2 is still incorrect

2016-04-19 Thread Alan Gates
Have you filed a JIRA ticket for this?  If not, please do so we can track it 
and fix it.  Patches are welcomed as well. :)

Alan.

> On Apr 4, 2016, at 15:27, Mich Talebzadeh  wrote:
> 
> 
> Hi,
> 
> I reported back in April 2015 that what Hive calls Standard Deviation 
> Function  STDDEV is a pointer to STDDEV_POP. This is incorrect and has not 
> been rectified in Hive 2
> 
> Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I did 
> tests with Spark 1.6 as well and Spark correctly points STTDEV to STDDEV_SAMP.
> 
> The following query was used
> 
> SELECT
> 
> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>  AS MYSTDDEV,
> STDDEV(amount_sold) AS STDDEV,
> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
> STDDEV_POP(amount_sold) AS STDDEV_POP
> fromsales;
> 
> The following is from running the above query on Hive where STDDEV -->  
> STDDEV_POP which is incorrect
> 
> 
> ++-++-+--+
> |  mystddev  |   stddev|stddev_samp | 
> stddev_pop  |
> ++-++-+--+
> | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  | 
> 260.72704617040444  |
> ++-++-+--+
> 
> The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is correct
> 
> ++-++-+--+
> |  mystddev  |   stddev|stddev_samp | 
> stddev_pop  |
> ++-++-+--+
> | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  | 
> 260.72704617042166  |
> ++-++-+--+
> 
> Hopefully The Hive one will be corrected.
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  



Re: Hive footprint

2016-04-18 Thread Alan Gates

> On Apr 18, 2016, at 15:34, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> 
> If Hive had the ability (organic) to have local variable and stored procedure 
> support then it would be top notch Data Warehouse. Given its metastore, I 
> don't see any technical reason why it cannot support these constructs.
> 
Are you aware of the HPL/SQL module added in Hive 2.0?  If not see 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156  It’s 
not deeply integrated yet but it’s a step in the procedural direction. 

Alan.

Re: [VOTE] Bylaws change to allow some commits without review

2016-04-18 Thread Alan Gates
+1.

Alan.

> On Apr 18, 2016, at 06:53, Lars Francke  wrote:
> 
> Thanks for the votes so far. Can I get some more people interested please?
> 
> On Fri, Apr 15, 2016 at 7:35 PM, Jason Dere  wrote:
> ​+1
> From: Lefty Leverenz 
> Sent: Friday, April 15, 2016 1:36 AM
> To: user@hive.apache.org
> Subject: Re: [VOTE] Bylaws change to allow some commits without review
>  
> +1
> 
> Navis, you've just reactivated your PMC membership.  ;-)
> 
> A PMC member is considered emeritus by their own declaration or by not 
> contributing in any form to the project for over six months.
> 
> Actually your old patch for HIVE-9499 was committed in March and you added a 
> comment to HIVE-11752 in February, so you have been active recently.  And now 
> you can let it slide until October 
> 
> -- Lefty
> 
> On Thu, Apr 14, 2016 at 5:57 PM, Sushanth Sowmyan  wrote:
> +1
> 
> On Apr 13, 2016 17:20, "Prasanth Jayachandran" 
>  wrote:
> +1
> 
> Thanks
> Prasanth
> 
> 
> 
> 
> On Wed, Apr 13, 2016 at 5:14 PM -0700, "Navis Ryu"  wrote:
> 
> not sure I'm active PMC member but +1, anyway. 
> 
> 2016년 4월 14일 목요일, Lars Francke님이 작성한 메시지:
> Hi everyone,
> 
> we had a discussion on the dev@ list about allowing some forms of 
> contributions to be committed without a review.
> 
> The exact sentence I propose to add is: "Minor issues (e.g. typos, code style 
> issues, JavaDoc changes. At committer's discretion) can be committed after 
> soliciting feedback/review on the mailing list and not receiving feedback 
> within 2 days."
> 
> The proposed bylaws can also be seen here 
> 
> 
> This vote requires a 2/3 majority of all Active PMC members so I'd love to 
> get as many votes as possible. The vote will run for at least six days.
> 
> Thanks,
> Lars
> 
> 



Re: Automatic Update statistics on ORC tables in Hive

2016-03-28 Thread Alan Gates
I resolved that as Won’t Fix.  See the last comment on the JIRA for my 
rationale.

Alan.

> On Mar 28, 2016, at 03:53, Mich Talebzadeh  wrote:
> 
> Thanks. This does not seem to be implemented although the Jira says resolved. 
> It also mentions the timestamp of the last update stats. I do not see it yet.
> 
> Regards,
> 
> Mich
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
> On 28 March 2016 at 06:19, Gopal Vijayaraghavan  wrote:
> 
> > This might be a bit far fetched but is there any plan for background
> >ANALYZE STATISTICS to be performed  on ORC tables
> 
> 
> https://issues.apache.org/jira/browse/HIVE-12669
> 
> Cheers,
> Gopal
> 
> 
> 



Re: Spark SQL is not returning records for HIVE transactional tables on HDP

2016-03-14 Thread Alan Gates

> On Mar 14, 2016, at 10:31, Mich Talebzadeh  wrote:
> 
> That is an interesting point Alan.
> 
> Does this imply that Hive on Spark (Hive 2 encourages Spark or TEZ)  is going 
> to have an issue with transactional tables?
I was partially wrong.  In HIVE-12366 Wei refactored this so that the 
heartbeats are sent from the client rather than from the execution engine.  
This JIRA has been committed to master and branch-1, but after 1.2.  So Hive on 
Spark should work fine with transactional tables in Hive 2.x.  In 1.2 and 
earlier it will not.

Alan.




Re: Spark SQL is not returning records for HIVE transactional tables on HDP

2016-03-14 Thread Alan Gates
I don’t know why you’re seeing Hive on Spark sometimes work with transactional 
tables and sometimes not.  But note that in general it doesn’t work.  The Spark 
runtime in Hive does not send heartbeats to the transaction/lock manager so it 
will timeout any job that takes longer than the heartbeat interval (5 min by 
default).  

Alan.

> On Mar 12, 2016, at 00:24, @Sanjiv Singh  wrote:
> 
> Hi All,
> 
> I am facing this issue on HDP setup on which COMPACTION is required only once 
> for transactional tables to fetch records with Spark SQL.
> On the other hand, Apache setup doesn't required compaction even once.
> 
> May be something got triggered on meta-store after compaction, Spark SQL 
> start recognizing delta files.
>   
> Let know me if needed other details to get root cause.
> 
> Try this,
> 
> See complete scenario :
> 
> hive> create table default.foo(id int) clustered by (id) into 2 buckets 
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into default.foo values(10);
> 
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong 
> because data is still in delta files
> 
> Now run major compaction:
> 
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
> 
> scala> sqlContext.table("default.foo").count // Gives 1
> 
> hive> insert into foo values(20);
> 
> scala> sqlContext.table("default.foo").count // Gives 2 , no compaction 
> required.
> 
> 
> 
> 
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339



Re: Hive StreamingAPI leaves table in not consistent state

2016-03-11 Thread Alan Gates
I believe this is an issue in the Storm Hive bolt.  I don’t have an Apache JIRA 
on it, but if you ask on the Hortonworks lists we can connect you with the fix 
for the storm bolt.

Alan.

> On Mar 10, 2016, at 04:02, Igor Kuzmenko  wrote:
> 
> Hello, I'm using Hortonworks Data Platform 2.3.4 which includes Apache Hive 
> 1.2.1 and Apache Storm 0.10.
> I've build Storm topology using Hive Bolt, which eventually using Hive 
> StreamingAPI to stream data into hive table.
> In Hive I've created transactional table:
> 
>   • CREATE EXTERNAL TABLE cdr1 (
>   • 
>   • )
>   • PARTITIONED BY (dt INT)
>   • CLUSTERED BY (telcoId) INTO 5 buckets
>   • STORED AS ORC
>   • LOCATION '/data/sorm3/cdr/cdr1'
>   • TBLPROPERTIES ("transactional"="true")
> 
> Hive settings:
> 
>   • hive.support.concurrency=true
>   • hive.enforce.bucketing=true
>   • hive.exec.dynamic.partition.mode=nonstrict
>   • hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
>   • hive.compactor.initiator.on=true
>   • hive.compactor.worker.threads=1
> 
> When I run my Storm Topology it fails with OutOfMemoryException. The Storm 
> exception doesn't bother me, it was just a test. But after topology fail my 
> Hive table is not consistent.
> Simple select from table leads into exception:
> 
> SELECT COUNT(*) FROM cdr1
> ERROR : Status: Failed
> ERROR : Vertex failed, vertexName=Map 1, 
> vertexId=vertex_1453891518300_0098_1_00, diagnostics=[Task failed, 
> taskId=task_1453891518300_0098_1_00_00, diagnostics=[TaskAttempt 0 
> failed, info=[Error: Failure while running task:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: java.io.EOFException
> 
> Caused by: java.io.IOException: java.io.EOFException
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
>   ... 19 more
> Caused by: java.io.EOFException
>   at java.io.DataInputStream.readFully(DataInputStream.java:197)
>   at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:370)
>   at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:317)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:238)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:460)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1269)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1151)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:249)
>   ... 20 more
> ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 
> killedTasks:0, Vertex vertex_1453891518300_0098_1_00 [Map 1] killed/failed 
> due to:OWN_TASK_FAILURE]
> ERROR : Vertex killed, vertexName=Reducer 2, 
> vertexId=vertex_1453891518300_0098_1_01, diagnostics=[Vertex received Kill 
> while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, 
> failedTasks:0 killedTasks:1, Vertex vertex_1453891518300_0098_1_01 [Reducer 
> 2] killed/failed due to:OTHER_VERTEX_FAILURE]
> ERROR : DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 
> killedVertices:1
> 
> 
> Compaction fails with same exception:
> 
> 2016-03-10 13:20:54,550 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.io.EOFException: Cannot seek after EOF
>   at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1488)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
>   at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:368)
>   at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:317)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:238)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:460)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRawReader(OrcInputFormat.java:1362)
>   at 
> org.apache.hadoop.hive.ql.txn.compactor.CompactorMR$CompactorMap.map(CompactorMR.java:565)
>   at 
> org.apache.hadoop.hive.ql.txn.compactor.CompactorMR$CompactorMap.map(CompactorMR.java:544)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:34

Re: Hive Context: Hive Metastore Client

2016-03-09 Thread Alan Gates
One way people have gotten around the lack of LDAP connectivity in HS2 has been 
to use Apache Knox.  That project’s goal is to provide a single login 
capability for Hadoop related projects so that users can tie their LDAP or 
Active Directory servers into Hadoop.

Alan.

> On Mar 8, 2016, at 16:00, Mich Talebzadeh  wrote:
> 
> The current scenario resembles a three tier architecture but without the 
> security of second tier. In a typical three-tier you have users connecting to 
> the application server (read Hive server2) are independently authenticated 
> and if OK, the second tier creates new ,NET type or JDBC threads to connect 
> to database much like multi-threading. The problem I believe is that Hive 
> server 2 does not have that concept of handling the individual loggings yet. 
> Hive server 2 should be able to handle LDAP logins as well. It is a useful 
> layer to have.
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
> On 8 March 2016 at 23:28, Alex  wrote:
> Yes, when creating a Hive Context a Hive Metastore client should be created 
> with a user that the Spark application will talk to the *remote* Hive 
> Metastore with. We would like to add a custom authorization plugin to our 
> remote Hive Metastore to authorize the query requests that the spark 
> application is submitting which would also add authorization for any other 
> applications hitting the Hive Metastore. Furthermore we would like to extend 
> this so that we can submit "jobs" to our Spark application that will allow us 
> to run against the metastore as different users while leveraging the 
> abilities of our spark cluster. But as you mentioned only one login connects 
> to the Hive Metastore is shared among all HiveContext sessions.
> 
> Likely the authentication would have to be completed either through a secured 
> Hive Metastore (Kerberos) or by having the requests go through HiveServer2.
> 
> --Alex
> 
> 
> On 3/8/2016 3:13 PM, Mich Talebzadeh wrote:
>> Hi,
>> 
>> What do you mean by Hive Metastore Client? Are you referring to Hive server 
>> login much like beeline?
>> 
>> Spark uses hive-site.xml to get the details of Hive metastore and the login 
>> to the metastore which could be any database. Mine is Oracle and as far as I 
>> know even in  Hive 2, hive-site.xml has an entry for 
>> javax.jdo.option.ConnectionUserName that specifies username to use against 
>> metastore database. These are all multi-threaded JDBC connections to the 
>> database, the same login as shown below:
>> 
>> LOGINSID/serial# LOGGED IN S HOST   OS PID Client PID 
>> PROGRAM   MEM/KB  Logical I/O Physical I/O ACT
>>  --- --- -- -- -- 
>> ---    ---
>> INFO
>> ---
>> HIVEUSER 67,6160 08/03 08:11 rhes564oracle/20539   hduser/1234
>> JDBC Thin Clien1,017   370 N
>> HIVEUSER 89,6421 08/03 08:11 rhes564oracle/20541   hduser/1234
>> JDBC Thin Clien1,081  5280 N
>> HIVEUSER 112,561 08/03 10:45 rhes564oracle/24624   hduser/1234
>> JDBC Thin Clien  889   370 N
>> HIVEUSER 131,881108/03 08:11 rhes564oracle/20543   hduser/1234
>> JDBC Thin Clien1,017   370 N
>> HIVEUSER 47,3011408/03 10:45 rhes564oracle/24626   hduser/1234
>> JDBC Thin Clien1,017   370 N
>> HIVEUSER 170,895508/03 08:11 rhes564oracle/20545   hduser/1234
>> JDBC Thin Clien1,017  3230 N
>> 
>> As I understand what you are suggesting is that each Spark user uses 
>> different login to connect to Hive metastore. As of now there is only one 
>> login that connects to Hive metastore shared among all
>> 
>> 2016-03-08T23:08:01,890 INFO  [pool-5-thread-72]: HiveMetaStore.audit 
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser  ip=50.140.197.217  
>>  cmd=source:50.140.197.217 get_table : db=test tbl=t
>> 2016-03-08T23:18:10,432 INFO  [pool-5-thread-81]: HiveMetaStore.audit 
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser  ip=50.140.197.216  
>>  cmd=source:50.140.197.216 get_tables: db=asehadoop pat=.*
>> 
>> And this is an entry in Hive log when connection is made theough Zeppelin UI
>> 
>> 2016-03-08T23:20:13,546 INFO  [pool-5-thread-84]: metastore.HiveMetaStore 
>> (HiveMetaStore.java:newRawStore(499)) - 84: Opening raw store with 
>> implementation class:org.apache.hadoop.hive.metastore.ObjectStore
>> 2016-03-08T23:20:13,547 INFO  [pool-5-thread-84]: metastore.ObjectStore 
>> (ObjectStore.java:initialize(318)) - ObjectStore, initialize called
>> 2016-03-08T23:20:13,550 INFO  [pool-5-thread-84]: 
>> metastore.MetaStoreDirectSql (MetaStore

Re: How does Hive do authentication on UDF

2016-03-01 Thread Alan Gates
There are several Hive authorization schemes, but at the moment none of them 
restrict function use.  At some point we’d like to add that feature to SQL 
standard authorization (see 
https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization
 ) but no one has done it yet.

Alan.

> On Mar 1, 2016, at 02:01, Todd  wrote:
> 
> Hi ,
> 
> We are allowing users to write/use their own UDF in our Hive environment.When 
> they create the function against the db, then all the users that can use the 
> db will see(or use) the udf.
> I would ask how UDF authentication is done, can UDF be granted to some 
> specific users,so that other users can't see it?
>  
> Thanks a lot!
> 



Re: Hive-2.0.1 Release date

2016-02-29 Thread Alan Gates
HPLSQL released with Hive in 2.0, but it hasn’t been integrated as in there 
isn’t a single parser that handles HPLSQL and Hive SQL.  The work to integrate 
it is significant, and a new feature, so it definitely won’t be done in a bug 
fix release but in a feature bearing release (that is 2.x, not 2.0.x).  AFAIK 
no one is working on this at the moment.

Alan.

> On Feb 29, 2016, at 16:08, Mich Talebzadeh  wrote:
> 
> Thanks. I believe Alan Gate mentioned that HPLSQL is not yet integrated into 
> Hive 2.0.May be later?
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
> On 1 March 2016 at 00:05, Sergey Shelukhin  wrote:
> HPLSQL is available as part of Hive 2.0. I am not sure to which extent the 
> integration goes as I wasn’t involved in that work.
> As far as I understand HPLSQL and Hive on Spark are kind of orthogonal…
> 
> Hive 2.0.1 is purely a bug fix release for Hive 2.0; Hive 2.1 will be the 
> next feature release if some major feature is missing.
> 
> From: Mich Talebzadeh 
> Reply-To: "user@hive.apache.org" 
> Date: Monday, February 29, 2016 at 15:53
> To: "user@hive.apache.org" 
> Subject: Re: Hive-2.0.1 Release date
> 
> Hi Sergey,
> 
> Will HPLSQL be part of 2.0.1.release?
> 
> I am using 2.0 and found Hive on Spark to be much more stable.
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
> On 29 February 2016 at 23:46, Sergey Shelukhin  wrote:
> Hi. It will be released when some critical mass of bugfixes is accumulated. 
> We already found some issues that would be nice to fix, so it may be some 
> time in March. Is there a particular fix that interests you?
> 
> From: Oleksiy MapR 
> Reply-To: "user@hive.apache.org" 
> Date: Monday, February 29, 2016 at 00:43
> To: "user@hive.apache.org" 
> Subject: Hive-2.0.1 Release date
> 
> Hi all!
> 
> Are you planing to release Hive-2.0.1? If yes, when it probably may be?
> 
> Thanks,
> Oleksiy.
> 
> 



Re: Hive 2 performance

2016-02-25 Thread Alan Gates
HPLSQL is part of Hive, but it is not fully integrated into Hive itself yet.  
It is still an external module that handles the control flow while passing Hive 
SQL into Hive via JDBC.  We’d like to integrate it fully with Hive’s parser but 
we’re not there yet.

Alan.

> On Feb 25, 2016, at 14:26, Mich Talebzadeh 
>  wrote:
> 
> Hi Gopal,
> 
>  
> Is HPLSQL is integrated into Hive 2 as part of its SQL? 
> 
> Thanks,
> 
>  
> Mich
> 
> On 25/02/2016 10:38, Mich Talebzadeh wrote:
> 
>> Apologies the job on Spark using  Functional programming was run on a bigger 
>> table.
>> 
>> The correct timing is 42 seconds for Spark
>> 
>>  
>> On 25/02/2016 10:15, Mich Talebzadeh wrote:
>> 
>> hanks Gopal I made the following observation so far:
>> 
>> Using the old MR you get this message now which is fine
>> 
>> Hive-on-MR is deprecated in Hive 2 and may not be available in the future 
>> versions. Consider using a different execution engine (i.e. tez, spark) or 
>> using Hive 1.X releases.
>> 
>> use oraclehadoop;
>> --set hive.execution.engine=spark;
>> set hive.execution.engine=mr;
>> --
>> -- Get the total amount sold for each calendar month
>> --
>> 
>> select from_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') AS 
>> StartTime;
>> 
>> CREATE TEMPORARY TABLE tmp AS
>> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS 
>> TotalSales
>> --FROM smallsales s, times t, channels c
>> FROM smallsales s, times t, channels c
>> WHERE s.time_id = t.time_id
>> AND   s.channel_id = c.channel_id
>> GROUP BY t.calendar_month_desc, c.channel_desc
>> ;
>> 
>> select from_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') AS 
>> FirstQuery;
>> SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
>> from tmp
>> ORDER BY MONTH, CHANNEL LIMIT 5
>> ;
>> select from_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') AS 
>> SecondQuery;
>> SELECT channel_desc AS CHANNEL, MAX(TotalSales)  AS SALES
>> FROM tmp
>> GROUP BY channel_desc
>> order by SALES DESC LIMIT 5
>> ;
>> select from_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') AS EndTime;
>> 
>> This batch returns results on MR in 2 min, 3 seconds
>> 
>> If I change my engine to Hive 2 on Spark 1.3.1. I get it back in 1 min, 9 sec
>> 
>>  
>> If I run that job on Spark 1.5.2 shell  against the same tables using 
>> Functional programming and Hive Context for tables
>> 
>> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> println ("\nStarted at"); HiveContext.sql("SELECT 
>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') 
>> ").collect.foreach(println)
>> HiveContext.sql("use oraclehadoop")
>> var s = 
>> HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID")
>> val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
>> val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")
>> println ("\ncreating data set at"); HiveContext.sql("SELECT 
>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') 
>> ").collect.foreach(println)
>> val rs = 
>> s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales"))
>> println ("\nfirst query at"); HiveContext.sql("SELECT 
>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') 
>> ").collect.foreach(println)
>> val rs1 = 
>> rs.orderBy("calendar_month_desc","channel_desc").take(5).foreach(println)
>> println ("\nsecond query at"); HiveContext.sql("SELECT 
>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') 
>> ").collect.foreach(println)
>> val rs2 
>> =rs.groupBy("channel_desc").agg(max("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5).foreach(println)
>> println ("\nFinished at"); HiveContext.sql("SELECT 
>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') 
>> ").collect.foreach(println)
>> 
>> I get the job done in under 8  min. Ok this is not a benchmark for Spark but 
>> shows that Hive 2 has improved significantly IMO. I also had Hive on Spark 
>> 1.3.1 crashing on certain large tables(had to revert to MR) but no issues 
>> now.
>> 
>> HTH
>> 
>> On 25/02/2016 09:13, Gopal Vijayaraghavan wrote:
>> 
>> Correct hence the question as I have done some preliminary tests on Hive 2. 
>> I want to share insights with other people who have performed the same
>> If you have feedback on Hive-2.0, I'm all ears.
>> 
>> I'm building up 2.1 features & fixes, so now would be a good time to bring
>> stuff up.
>> 
>> Speed mostly depends on whether you're using Hive-2.0 with LLAP or not -
>> if you're using the old engines, the plans still get much better (even for
>> MR).
>> 
>> Tez does get some stuff out of it, like the new shuffle join vertex
>> manager (hive.optimize.dynamic.partition.hashjoin).
>> 
>> LLAP will still win that out for <10s queries, because it takes approx ~10
>> mins for all the auto-generated vectorized classes to get JIT'd into tight
>> SIMD loops.
>> 
>> For something like TPC-H Q1, you c

Re: Stroing boolean value in Hive table

2016-02-18 Thread Alan Gates
How the data is stored is up to the storage format (text, rcfile, orc, etc.).  
Do you mean in your text file you’d like booleans stored as 0 or 1?  You could 
use the case statement to convert them to integers like:

select case _boolvar_ when true then 1 when false then 0 end from …

Alan.

> On Feb 18, 2016, at 04:18, mahender bigdata  
> wrote:
> 
> Hi,
> 
> How can we store Boolean value with 1 or 0 instead of storing true or false 
> string. we can make use of CAST function to convert boolean into 1 or 0. Is 
> there any built-in setting in hive, which enable and store hive Boolean 
> column values in 0 or 1 instead of true and false.
> 
> 
> 



Re: ORC format

2016-02-01 Thread Alan Gates
ORC does not currently expose a primary key to the user, though we have 
talked of having it do that.  As Mich says the indexing on ORC is 
oriented towards statistics that help the optimizer plan the query.  
This can be very important in split generation (determining which parts 
of the input will be read by which tasks) as well as on the fly input 
pruning (deciding not to read a section of the file because the stats 
show that no rows in that section will match a predicate).  Either of 
these can help joins.  But as there is not a user visible primary key 
there's no ability to rewrite the join as an index based join, which I 
think is what you were asking about in your original email.


Alan.


Philip Lee 
February 1, 2016 at 7:27
Also,
when making ORC from CSV,
for indexing every key on each coulmn is made, or a primary on a table 
is made ?


If keys are made on each column in a table, accessing any column in 
some functions like filtering should be faster.





--

==

/*Hae Joon Lee*/


Now, in Germany,

M.S. Candidate, Interested in Distributed System, Iterative Processing

Dept. of Computer Science, Informatik in German, TUB

Technical University of Berlin


In Korea,

M.S. Candidate, Computer Architecture Laboratory

Dept. of Computer Science, KAIST


Rm# 4414 CS Dept. KAIST

373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)


Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

==

Philip Lee 
February 1, 2016 at 7:21
Hello,

I experiment the performance of some systems between ORC and CSV file.
I read about ORC documentation on Hive website, but still curious of 
some things.


I know ORC format is faster on filtering or reading because it has 
indexing.

Has it advantage of joining two tables of ORC dataset as well?

Could you explain about it in detail?
When experimenting, it seems like it has some advantages of joining in 
some aspect, but not quite sure what characteristic of ORC make this 
happening rather than CSV.


Best,
Phil



Re: Indexes in Hive

2016-01-06 Thread Alan Gates
The issue with this is that HDFS lacks the ability to co-locate blocks.  
So if you break your columns into one file per column (the more 
traditional column route) you end up in a situation where 2/3 of the 
time only one of your columns is being locally read, which results in a 
significant performance penalty.  That's why ORC and Parquet and RCFile 
all use one file for their "columnar" stores.


Alan.


Mich Talebzadeh 
January 5, 2016 at 22:24
Hi,

Thinking loudly.

Ideally we should consider a totally columnar storage offering in 
which each

column of table is stored as compressed value (I disregard for now how
actually ORC does this but obviously it is not exactly a columnar 
storage).


So each table can be considered as a loose federation of columnar storage
and each column is effectively an index?

As columns are far narrower than tables, each index block will be very
higher density and all operations like aggregates can be done directly on
index rather than table.

This type of table offering will be in true nature of data warehouse
storage. Of course row operations (get me all rows for this table) will be
slower but that is the trade-off that we need to consider.

Expecting users to write their own IndexHandler may be technically
interesting but commercially not viable as Hive needs to be a product 
on its

own merit not a development base. Writing your own storage attributes etc.
requires skills that will put off people seeing Hive as an attractive
proposition (requiring considerable investment in skill sets in order to
maintain Hive).

Thus my thinking on this is to offer true columnar storage in Hive to be a
proper data warehouse. In addition, the development tools cab ne made
available for those interested in tailoring their own specific Hive
solutions.


HTH



Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 
15",

ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale 
Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. 
It is
the responsibility of the recipient to ensure that this email is virus 
free,
therefore neither Peridale Ltd, its subsidiaries nor their employees 
accept

any responsibility.


-Original Message-
From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of 
Gopal

Vijayaraghavan
Sent: 05 January 2016 23:55
To: user@hive.apache.org
Subject: Re: Is Hive Index officially not recommended?


now?

The builtin indexes - those that write data as smaller tables are only
useful in a pre-columnar world, where the indexes offer a huge 
reduction in

IO.

Part #1 of using hive indexes effectively is to write your own
HiveIndexHandler, with usesIndexTable=false;

And then write a IndexPredicateAnalyzer, which lets you map arbitrary
lookups into other range conditions.

Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
which consolidates the "internal" index into an external store (HBase).

Some of the index data now lives in the HBase metastore, so that the
inclusion/exclusion of whole partitions can be done off the consolidated
index.

https://issues.apache.org/jira/browse/HIVE-11676


The experience from BI workloads run by customers is that in general, the
lookup to the right "slice" of data is more of a problem than the actual
aggregate.

And that for a workhorse data warehouse, this has to survive even if 
there's

a non-stop stream of updates into it.

Cheers,
Gopal



Re: Immutable data in Hive

2015-12-30 Thread Alan Gates
Traditionally data in Hive was write once (insert) read many.  You could 
append to tables and partitions, add new partitions, etc.  You could 
remove data by dropping tables or partitions.  But there was no updates 
of data or deletes of particular rows.  This was what was meant by 
immutable.  Hive was originally done this way because it was based on 
MapReduce and HDFS and these were the natural semantics given those 
underlying systems.


For many use cases (e.g. ETL) this is sufficient, and the vast majority 
of people still run Hive this way.


We added transactions and updates and deletes to Hive because some use 
cases require these features.  Hive is being used more and more as a 
data warehouse, and while updates and deletes are less common there they 
are still required (slow changing dimensions, fixing wrong data, 
deleting records for compliance, etc.)  Also streaming data into 
warehouses from transactional systems is a common use case.


Alan.


Ashok Kumar 
December 29, 2015 at 14:59
Hi,

Can someone please clarify what  "immutable data" in Hive means?

I have been told that data in Hive is/should be immutable but in that 
case why we need transactional tables in Hive that allow updates to data.


thanks and greetings






Re: Loop if table is not empty

2015-12-28 Thread Alan Gates
Have you looked at the new procedural HPL/SQL available in recent Hive?  
If you are using an older version of Hive you can check out hplsql.org, 
which allows you to install it separately.


Alan.


Thomas Achache 
December 28, 2015 at 2:30

Hi everyone,

I am running a series of queries in a file named rgm.hql. At the end 
of the execution, if a specific table is not empty, I'd like to run 
rgm.hql all over again (and again if it the table is not empty). By 
design it won't be an infinite loop because the table will be empty 
eventually but it is not possible to predict the number of iterations 
required. I'm thinking about a solution along the lines of:


IF (SELECT COUNT(*) FROM mytable)>0 THEN source rgm.hql;

I really have no clue how to implement that within hive. It could be 
done outside hive but I'd like to find a purely hive-based solution.


Thanks for your help!

Thomas



Re: Attempt to do update or delete using transaction manager that does not support these operations. (state=42000,code=10294)

2015-12-22 Thread Alan Gates
Correct.  What doesn't work in Spark are actually the transactions, 
because there's a piece in the execution side that needs to send 
heartbeats to the metastore saying a transaction is still alive.  That 
hasn't been implemented for Spark.  It's very simple and could be done 
(see ql.exec.Heartbeater use in ql.exec.tez.TezJobMonitor for an example 
of how it would work).  AFAIK everything else would work just fine.


Alan.


Mich Talebzadeh <mailto:m...@peridale.co.uk>
December 22, 2015 at 13:45

Thanks for the feedback Alan

It seems that one can do INSERTS with Hive on Spark but no updates or 
deletes. Is this correct?


Cheers,

Mich Talebzadeh

/Sybase ASE 15 Gold Medal Award 2008/

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books*"A Practitioner’s Guide to Upgrading to Sybase ASE 
15", ISBN 978-0-9563693-0-7*.


co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4*


_Publications due shortly:___

*Complex Event Processing in Heterogeneous Environments*, ISBN: 
978-0-9563693-3-8


*Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
volume one out shortly


http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>

NOTE: The information in this email is proprietary and confidential. 
This message is for the designated recipient only, if you are not the 
intended recipient, you should destroy it immediately. Any information 
in this message shall not be understood as given or endorsed by 
Peridale Technology Ltd, its subsidiaries or their employees, unless 
expressly so stated. It is the responsibility of the recipient to 
ensure that this email is virus free, therefore neither Peridale Ltd, 
its subsidiaries nor their employees accept any responsibility.


*From:*Alan Gates [mailto:alanfga...@gmail.com]
*Sent:* 22 December 2015 20:39
*To:* user@hive.apache.org
*Subject:* Re: Attempt to do update or delete using transaction 
manager that does not support these operations. (state=42000,code=10294)


Also note that transactions only work with MR or Tez as the backend.  
The required work to have them work with Spark hasn't been done.


Alan.


Alan Gates <mailto:alanfga...@gmail.com>
December 22, 2015 at 12:38
Also note that transactions only work with MR or Tez as the backend.  
The required work to have them work with Spark hasn't been done.


Alan.

Mich Talebzadeh <mailto:m...@peridale.co.uk>
December 22, 2015 at 9:14

Thanks Elliot,

Sounds like that table was created as create table tt as select * from 
t. Although the original table t was created as transactional shown 
below, the table tt is not!


0: jdbc:hive2://rhes564:10010/default> show create table t;

+-+--+

|   createtab_stmt|

+-+--+

| CREATE TABLE `t`(   |

|   `owner` varchar(30),  |

|   `object_name` varchar(30),|

|   `subobject_name` varchar(30), |

|   `object_id` bigint,   |

|   `data_object_id` bigint,  |

|   `object_type` varchar(19),|

|   `created` timestamp,  |

|   `last_ddl_time` timestamp,|

|   `timestamp2` varchar(19), |

|   `status` varchar(7),  |

|   `temporary2` varchar(1),  |

|   `generated` varchar(1),   |

|   `secondary` varchar(1),   |

|   `namespace` bigint,   |

|   `edition_name` varchar(30),   |

|   `padding1` varchar(4000), |

|   `padding2` varchar(3500), |

|   `attribute` varchar(32),  |

|   `op_type` int,|

|   `op_time` timestamp,  |

|   `new_col` varchar(30))|

| CLUSTERED BY (  |

|   object_id)|

| INTO 256 BUCKETS|

| ROW FORMAT SERDE|

|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'   |

| STORED AS INPUTFORMAT  

Re: Attempt to do update or delete using transaction manager that does not support these operations. (state=42000,code=10294)

2015-12-22 Thread Alan Gates
Also note that transactions only work with MR or Tez as the backend.  
The required work to have them work with Spark hasn't been done.


Alan.


Mich Talebzadeh 
December 22, 2015 at 9:43

Dropped and created table tt as follows:

drop table if exists tt;

create table tt (

owner   varchar(30)

,object_name varchar(30)

,subobject_name  varchar(30)

,object_id   bigint

,data_object_id  bigint

,object_type varchar(19)

,created timestamp

,last_ddl_time   timestamp

,timestamp2   varchar(19)

,status  varchar(7)

,temporary2  varchar(1)

,generated   varchar(1)

,secondary   varchar(1)

,namespace   bigint

,edition_namevarchar(30)

,padding1varchar(4000)

,padding2varchar(3500)

,attribute   varchar(32)

,op_type int

,op_time timestamp

)

CLUSTERED BY (object_id) INTO 256 BUCKETS

STORED AS ORC

TBLPROPERTIES ( "orc.compress"="SNAPPY",

"transactional"="true",

"orc.create.index"="true",

"orc.bloom.filter.columns"="object_id",

"orc.bloom.filter.fpp"="0.05",

"orc.stripe.size"="268435456",

"orc.row.index.stride"="1" )

;

INSERT INTO TABLE tt

SELECT

  owner

, object_name

, subobject_name

, object_id

, data_object_id

, object_type

, cast(created AS timestamp)

, cast(last_ddl_time AS timestamp)

, timestamp2

, status

, temporary2

, generated

, secondary

, namespace

, edition_name

, padding1

, padding2

, attribute

, op_type

, op_time

FROM t

;

exit;

And tried delete again and got the same error!

delete from tt where exists(select 1 from t where tt.object_id = 
t.object_id);


Error: Error while compiling statement: FAILED: SemanticException 
[Error 10294]: Attempt to do update or delete using transaction 
manager that does not support these operations. (state=42000,code=10294)


Mich Talebzadeh

/Sybase ASE 15 Gold Medal Award 2008/

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books*"A Practitioner’s Guide to Upgrading to Sybase ASE 
15", ISBN 978-0-9563693-0-7*.


co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4*


_Publications due shortly:___

*Complex Event Processing in Heterogeneous Environments*, ISBN: 
978-0-9563693-3-8


*Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
volume one out shortly


http://talebzadehmich.wordpress.com 

NOTE: The information in this email is proprietary and confidential. 
This message is for the designated recipient only, if you are not the 
intended recipient, you should destroy it immediately. Any information 
in this message shall not be understood as given or endorsed by 
Peridale Technology Ltd, its subsidiaries or their employees, unless 
expressly so stated. It is the responsibility of the recipient to 
ensure that this email is virus free, therefore neither Peridale Ltd, 
its subsidiaries nor their employees accept any responsibility.


*From:*Mich Talebzadeh [mailto:m...@peridale.co.uk]
*Sent:* 22 December 2015 17:14
*To:* user@hive.apache.org
*Subject:* RE: Attempt to do update or delete using transaction 
manager that does not support these operations. (state=42000,code=10294)


Thanks Elliot,

Sounds like that table was created as create table tt as select * from 
t. Although the original table t was created as transactional shown 
below, the table tt is not!


0: jdbc:hive2://rhes564:10010/default> show create table t;

+-+--+

|   createtab_stmt|

+-+--+

| CREATE TABLE `t`(   |

|   `owner` varchar(30),  |

|   `object_name` varchar(30),|

|   `subobject_name` varchar(30), |

|   `object_id` bigint,   |

|   `data_object_id` bigint,  |

|   `object_type` varchar(19),|

|   `created` timestamp,  |

|   `last_ddl_time` timestamp,|

|   `timestamp2` varchar(19), |

|   `status` varchar(7),  |

|   `temporary2` varchar(1),  |

|   `generated` varchar(1),   |

|   `secondary` varchar(1),   

Re: Difference between ORC and RC files

2015-12-21 Thread Alan Gates

ORC offers a number of features not available in RC files:
* Better encoding of data.  Integer values are run length encoded.  
Strings and dates are stored in a dictionary (and the resulting pointers 
then run length encoded).
* Internal indexes and statistics on the data.  This allows for more 
efficient reading of the data as well as skipping of sections of the 
data not relevant to a given query.  These indexes can also be used by 
the Hive optimizer to help plan query execution.
* Predicate push down for some predicates.  For example, in the query 
"select * from user where state = 'ca'", ORC could look at a collection 
of rows and use the indexes to see that no rows in that group have that 
value, and thus skip the group altogether.
* Tight integration with Hive's vectorized execution, which produces 
much faster processing of rows
* Support for new ACID features in Hive (transactional insert, update, 
and delete).
* It has a much faster read time than RCFile and compresses much more 
efficiently.


Whether ORC is the best format for what you're doing depends on the data 
you're storing and how you are querying it.  If you are storing data 
where you know the schema and you are doing analytic type queries it's 
the best choice (in fairness, some would dispute this and choose 
Parquet, though much of what I said above about ORC vs RC applies to 
Parquet as well).  If you are doing queries that select the whole row 
each time columnar formats like ORC won't be your friend.  Also, if you 
are storing self structured data such as JSON or Avro you may find text 
or Avro storage to be a better format.


Alan.




Ashok Kumar 
December 21, 2015 at 9:45
Hi Gurus,

I am trying to understand the advantages that ORC file format offers 
over RC.


I have read the existing documents but I still don't seem to grasp the 
main differences.


Can someone explain to me as a user where ORC scores when compared to 
RC. What I like to know is mainly the performance. I am also aware 
that ORC does some smart compression as well.


Finally is ORC file format is the best choice in Hive.

Thank you




Re: Hive partition load

2015-12-17 Thread Alan Gates

Yes, you can load different partitions simultaneously.

Alan.


Suyog Parlikar 
December 17, 2015 at 5:02

Hello everyone,

Can we load different partitions of a hive table simultaneously.

Is there any locking issues in that if yes what are they?

Please find below example for more details.

Consider I have a hive table test with two partition p1 and p2.

I want to load the data into partition p1 and p2 at the same time.

Awaiting your reply.

Thanks,
Suyog



Re: How to register permanent function during hive thrift server is running

2015-12-03 Thread Alan Gates

No restart of the thrift service should be required.

Alan.


Todd 
December 3, 2015 at 3:12
Hi,
I am using Hive 0.14.0, and have hive thrift server running.During its 
running, I would use “create function” to add a permanent function,
Does hive support this **without restarting** the hive thrift 
server,that is, after creating the function, I will be able to use the 
function when I connect the hive server with jdbc?


Thanks.


Re: [VOTE] Hive 2.0 release plan

2015-11-30 Thread Alan Gates
Hive 2.0 will not be 100% backwards compatible with 1.x.  The following 
JIRA link shows JIRAs already committed to 2.0 that break compatibility:

https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20fixVersion%20%3D%202.0.0%20AND%20%22Hadoop%20Flags%22%20%3D%20%22Incompatible%20change%22

HIVE-12429 is not yet committed but may also make the list.

In summary, the biggest changes are that Hadoop 1.x is no longer 
supported, MapReduce as an engine is deprecated (though still supported 
for now), and HIVE-12429 proposes to switch the standard authorization 
model to SQL Standard Auth instead of the current default.


The goal from the beginning was for 2.0 to be allowed to break 
compatibility where necessary while branch-1 and subsequent 1.x releases 
would maintain backwards compatibility with the 1.x line.


Alan.


John Omernik 
November 30, 2015 at 9:25
Agreed, any plans for Hive 1.3?  Will Hive 2.0 be a breaking release 
for those running 1.x?






Wangwenli 
November 15, 2015 at 17:07
Good News, *Any release plan for hive 1.3*  ???


Wangwenli
Gopal Vijayaraghavan 
November 13, 2015 at 22:21
(+user@)

+1.

Cheers,
Gopal

On 11/13/15, 5:54 PM, "Lefty Leverenz"  wrote:


The Hive bylaws require this to be submitted on the user@hive mailing list
(even though users don't get to vote).  See Release Plan in Actions
.

-- Lefty

...

On Fri, Nov 13, 2015 at 1:38 PM, Sergey Shelukhin<

ser...@hortonworks.com>

wrote:


Hi.
With no strong objections on DISCUSS thread, some issues raised and
addressed, and a reminder from Carl about the bylaws for the release
process, I propose we release the first version of Hive 2 (2.0), and
nominate myself as release manager.
The goal is to have the first release of Hive with aggressive set of

new

features, some of which are ready to use and some are at experimental
stage and will be developed in future Hive 2 releases, in line with

the

Hive-1-Hive-2 split discussion.
If the vote passes, the timeline to create a branch should be around

the

end of next week (to minimize merging in the wake of the release),

and
the

timeline to release would be around the end of November, depending on

the

issues found during the RC cutting process, as usual.

Please vote:
+1 proceed with the release plan
+-0 don¹t care
-1 don¹t proceed with the release plan, for such and such reasons

The vote will run for 3 days.





Lefty Leverenz 
November 13, 2015 at 17:54
The Hive bylaws require this to be submitted on the user@hive mailing list
(even though users don't get to vote). See Release Plan in Actions
.

-- Lefty


Thejas Nair 
November 13, 2015 at 16:33
+1

On Fri, Nov 13, 2015 at 2:26 PM, Vaibhav Gumashta


Re: Query performance correlated to increase in delta files?

2015-11-20 Thread Alan Gates
Are you running the compactor as part of your metastore?  It's 
occasionally compacts the delta files in order to reduce read time.  See 
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions for 
details.


Alan.


Sai Gopalakrishnan 
November 19, 2015 at 21:17

Hello fellow developer,

Greetings!

I am using Hive for querying transactional data. I transfer data from 
RDBMS to Hive using Sqoop and prefer the ORC format for speed and its 
ACID properties. I found out that Sqoop has no support for reflecting 
the updated and deleted records in RDBMS and hence I am inserting 
those modified records into the HDFS and updating/deleting the Hive 
tables to reflect the changes. Every update/delete in the Hive table 
results in creation of new delta files. I noticed a considerable drop 
in speed over a period of time. I realize that lookups tend to take 
more time with growing files. Is there any way to overcome this issue? 
INSERT OVERWRITE the table is costly, I deal with about 1TB data, and 
it keeps growing every day.


Kindly reply with a suitable solution at the earliest.

Thanks & Regards,

Saisubramaniam Gopalakrishnan

Aspire Systems (India) Pvt. Ltd.

Aspire Systems

This e-mail message and any attachments are for the sole use of the 
intended recipient(s) and may contain proprietary, confidential, trade 
secret or privileged information. Any unauthorized review, use, 
disclosure or distribution is prohibited and may be a violation of 
law. If you are not the intended recipient, please contact the sender 
by reply e-mail and destroy all copies of the original message.




Re: ORC tables loading

2015-11-17 Thread Alan Gates
The reads and writes both happen in parallel, so as more nodes are 
available for read and write, at least in this case, the time stays 
roughly the same.


Alan.


James Pirz 
November 16, 2015 at 21:23
Hi,

I am using Hive 1.2 with ORC tables on Hadoop 2.6 on a cluster.
I load data into an ORC table by reading the data from an external 
table on raw text files and using insert statement:


INSERT into TABLE myorctab SELECT * FROM mytxttab;

I ran a simple scale-up test to find out how the loading time 
increases as I double the size of data and nodes. I realized that the 
total time remains more or less the same (scales properly).


I am just wondering why this is happening, as naively I think if I 
make the number of partitions and size of data double, the time should 
also be roughly double as the system needs to partition twice amount 
of data as it was doing before among twice number of partitions. Am I 
missing something here ?


Thnx


Re: hive locking doubt

2015-11-16 Thread Alan Gates
You are correct that DbTxnManager does not support the explicit locking 
of tables.  Instead it obtains locks based on SQL statements that are 
being executed.


If you use the DummyTxnManager (the default) and set concurrency to true 
and  the lock manager to ZooKeeperHiveLockManager then your locks should 
go away when the process dies (ZooKeeper handles this for you).  
Obviously you have to have a ZooKeeper service set up to use this.


Alan.


Shushant Arora 
November 16, 2015 at 10:47
Hi

I have a doubt on hive locking mechanism.
I have 0.13 deployed on my cluster.
When I create explicit lock using
lock table tablename partition(partitionname) exclusive. It acquires 
lock as expected.


I have a requirement to release the lock if hive connection with 
process who created the lock dies .How to achieve this? In current 
situation lock is released only explicitly by calling unlcok table 
tbalenme partition(). Requirement is to handle the process which 
acquires a lock and gets killed because of any reason before calling 
unlock.


While using TxnMgr org.apache.hadoop.hive.ql.lockmgr.DbTxnManager for 
handling transaction timeout it did n't allow explicit locking and 
threw below exception


FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. Current transaction manager 
does not support explicit lock requests.  Transaction manager:   
org.apache.hadoop.hive.ql.lockmgr.DbTxnManager


Re: clarification please

2015-10-29 Thread Alan Gates




Ashok Kumar 
October 28, 2015 at 22:43
hi gurus,

kindly clarify the following please

  * Hive currently does not support indexes or indexes are not used in
the query

Mostly true.  There is a create index, but Hive does not use the 
resulting index by default.  Some storage formats (ORC, Parquet I think) 
have their own indices they use internally to speed access.


  * The lowest granularity for concurrency is partition. If table is
partitioned, then partition will be lucked in DML operation

lucked =locked?  I'm not sure what you intended here.  If you mean 
locked, then it depends.  By default Hive doesn't use locking.  You can 
set it up to do locking via ZooKeeper or as part of Hive transactions.  
They have different locking models.  See 
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions and 
https://cwiki.apache.org/confluence/display/Hive/Locking for more 
information.


You can sub-partition using buckets, but for most queries partition is 
the lowest level of granularity.  Hive does a lot of work to optimize 
only reading relevant partitions for a query.


  * What is the best file format to store Hive table in HDFS? Is this
ORC or Avro that allow being split and support block compression?

It depends on what you want to do.  ORC and Parquet do better for 
traditional data warehousing type queries because they are columnar 
formats and have lots of optimization built in for fast access, pushing 
filter down into the storage level etc. People like Avro and other self 
describing formats when their data brings its own structure.  We very 
frequently see pipelines where people dump Avro, text, etc. into Hive 
and then ETL it into ORC.


  * Text/CSV files. By default if file type is not specified at
creation time, Hive will default to text file?

Out of the box yes, but you can change that in your Hive installation by 
setting hive.default.fileformat in your hive-site.xml.


Alan.



Thanks


  1   2   >