Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

Eugene Koifman Thu, 25 Aug 2016 11:21:55 -0700

Hive streaming API (which is what Storm uses) inserts multiple evens to a table 
per transaction.  It has been designed for this but not quite ready for prime 
time in 0.14.


Hive 1.3 has these metastore issues fixed as well as many others.     
HIVE-11948<https://issues.apache.org/jira/browse/HIVE-11948>, 
HIVE-13013<https://issues.apache.org/jira/browse/HIVE-13013> include the bulk 
of the improvements regarding the metastore but not all.  (HDP 2.5 will include 
all of them)
Back porting individual patches may prove difficult and error prone.  Using HDP 
2.5 would be much safer.
(If you are using Storm fro HDP 2.2 - it also has had a number of important 
fixes since 2.2 in the module that uses Hive Streaming API)

It looks like you have (roughly) 100k events per transaction.  How many 
transactions do you have per batch?

The only thing that can help in 0.14 is to reduce the load on the metastore.
You can do that by adjusting your ingest process to reduce the number of 
concurrent (Hive) transactions by making each transaction larger (most 
effective) and by making transaction batches larger.

Streaming API requires a heartbeat (also a metastore call) to be sent which 
storm does.  The frequency is controlled by 
hive.txn.timeout<https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.txn.timeout>
 .  you may want to set it to a larger value but make sure it's set to the same 
value for the metastore process and in storm topology.

If you are writing to partitioned tables, you may consider building a "shuffle" 
process in the Storm app so that all events for a given partition end up on the 
same bolt instance.  This would reduce the number of writers to Hive (and thus 
concurrent transactions) but of course may create unbalanced workload.


Hive's transactional tables do support concurrent inserts from multiple 
clients.  To avoid serialization errors you'd have to use 1.3/2.1.x

thanks,
Eugene

From: Jörn Franke <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, August 24, 2016 at 2:27 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Concurrency support of Apache Hive for streaming data ingest at 7K 
RPS into multiple tables

This is also a good option.

With respect to Hive transactional tables: I do to think they have been 
designed for massive inserts of single items. On the other hand you would not 
insert a lot of events using single inserts in a relational database. Same 
restrictions apply, it is not the use case you want to implement.


On 24 Aug 2016, at 13:55, Kit Menke 
<[email protected]<mailto:[email protected]>> wrote:


Joel,
Another option which you have is to use the Storm HDFS bolt to stream data into 
Hive external tables. The external tables then get loaded into ORC history 
tables for long term storage. We use this in a HDP cluster with similar load so 
I know it works. :)

I'm with Jörn on this one. My impression of hive transactions is that it is a 
new feature not totally ready for production.
Thanks,
Kit

On Aug 24, 2016 3:07 AM, "Joel Victor" 
<[email protected]<mailto:[email protected]>> wrote:
@Jörn: If I understood correctly even later versions of Hive won't be able to 
handle these kinds of workloads?

On Wed, Aug 24, 2016 at 1:26 PM, Jörn Franke 
<[email protected]<mailto:[email protected]>> wrote:
I think Hive especially these old versions have not been designed for this. Why 
not store them in Hbase and run a oozie job regularly that puts them all into 
Hive /Orc or parquet in a bulk job?

On 24 Aug 2016, at 09:35, Joel Victor 
<[email protected]<mailto:[email protected]>> wrote:

Currently I am using Apache Hive 0.14 that ships with HDP 2.2. We are trying 
perform streaming ingestion with it.
We are using the Storm Hive bolt and we have 7 tables in which we are trying to 
insert. The RPS (requests per second) of our bolts ranges from 7000 to 5000 and 
our commit policies are configured accordingly i.e 100k events or 15 seconds.

We see that there are many commitTxn exceptions due to serialization errors in 
the metastore (we are using PostgreSQL 9.5 as metastore)
The serialization errors will cause the topology to start lagging in terms of 
events processed as it will try to reprocess the batches that have failed.

I have already backported this 
HIVE-10500<https://issues.apache.org/jira/browse/HIVE-10500> to 0.14 and there 
isn't much improvement.
I went through most of the JIRA's about transaction and I found the following 
HIVE-11948<https://issues.apache.org/jira/browse/HIVE-11948>, 
HIVE-13013<https://issues.apache.org/jira/browse/HIVE-13013>. I would like to 
backport them to 0.14.
Going through the patches gives me an impression that I need to mostly update 
the queries and transaction levels.
Do these patches also require me to update the schema in the metastore? Please 
also let me know if there are any other patches that I missed.

I would also like to know whether Apache Hive can handle inserts to the 
same/different tables concurrently from multiple clients in 1.2.1 or later 
versions without many serialization errors in Hive metastore?

-Joel

Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

Reply via email to