Hive streaming API (which is what Storm uses) inserts multiple evens to a table per transaction. It has been designed for this but not quite ready for prime time in 0.14.
Hive 1.3 has these metastore issues fixed as well as many others. HIVE-11948<https://issues.apache.org/jira/browse/HIVE-11948>, HIVE-13013<https://issues.apache.org/jira/browse/HIVE-13013> include the bulk of the improvements regarding the metastore but not all. (HDP 2.5 will include all of them) Back porting individual patches may prove difficult and error prone. Using HDP 2.5 would be much safer. (If you are using Storm fro HDP 2.2 - it also has had a number of important fixes since 2.2 in the module that uses Hive Streaming API) It looks like you have (roughly) 100k events per transaction. How many transactions do you have per batch? The only thing that can help in 0.14 is to reduce the load on the metastore. You can do that by adjusting your ingest process to reduce the number of concurrent (Hive) transactions by making each transaction larger (most effective) and by making transaction batches larger. Streaming API requires a heartbeat (also a metastore call) to be sent which storm does. The frequency is controlled by hive.txn.timeout<https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.txn.timeout> . you may want to set it to a larger value but make sure it's set to the same value for the metastore process and in storm topology. If you are writing to partitioned tables, you may consider building a "shuffle" process in the Storm app so that all events for a given partition end up on the same bolt instance. This would reduce the number of writers to Hive (and thus concurrent transactions) but of course may create unbalanced workload. Hive's transactional tables do support concurrent inserts from multiple clients. To avoid serialization errors you'd have to use 1.3/2.1.x thanks, Eugene From: Jörn Franke <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wednesday, August 24, 2016 at 2:27 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables This is also a good option. With respect to Hive transactional tables: I do to think they have been designed for massive inserts of single items. On the other hand you would not insert a lot of events using single inserts in a relational database. Same restrictions apply, it is not the use case you want to implement. On 24 Aug 2016, at 13:55, Kit Menke <[email protected]<mailto:[email protected]>> wrote: Joel, Another option which you have is to use the Storm HDFS bolt to stream data into Hive external tables. The external tables then get loaded into ORC history tables for long term storage. We use this in a HDP cluster with similar load so I know it works. :) I'm with Jörn on this one. My impression of hive transactions is that it is a new feature not totally ready for production. Thanks, Kit On Aug 24, 2016 3:07 AM, "Joel Victor" <[email protected]<mailto:[email protected]>> wrote: @Jörn: If I understood correctly even later versions of Hive won't be able to handle these kinds of workloads? On Wed, Aug 24, 2016 at 1:26 PM, Jörn Franke <[email protected]<mailto:[email protected]>> wrote: I think Hive especially these old versions have not been designed for this. Why not store them in Hbase and run a oozie job regularly that puts them all into Hive /Orc or parquet in a bulk job? On 24 Aug 2016, at 09:35, Joel Victor <[email protected]<mailto:[email protected]>> wrote: Currently I am using Apache Hive 0.14 that ships with HDP 2.2. We are trying perform streaming ingestion with it. We are using the Storm Hive bolt and we have 7 tables in which we are trying to insert. The RPS (requests per second) of our bolts ranges from 7000 to 5000 and our commit policies are configured accordingly i.e 100k events or 15 seconds. We see that there are many commitTxn exceptions due to serialization errors in the metastore (we are using PostgreSQL 9.5 as metastore) The serialization errors will cause the topology to start lagging in terms of events processed as it will try to reprocess the batches that have failed. I have already backported this HIVE-10500<https://issues.apache.org/jira/browse/HIVE-10500> to 0.14 and there isn't much improvement. I went through most of the JIRA's about transaction and I found the following HIVE-11948<https://issues.apache.org/jira/browse/HIVE-11948>, HIVE-13013<https://issues.apache.org/jira/browse/HIVE-13013>. I would like to backport them to 0.14. Going through the patches gives me an impression that I need to mostly update the queries and transaction levels. Do these patches also require me to update the schema in the metastore? Please also let me know if there are any other patches that I missed. I would also like to know whether Apache Hive can handle inserts to the same/different tables concurrently from multiple clients in 1.2.1 or later versions without many serialization errors in Hive metastore? -Joel
