Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

Jörn Franke Wed, 24 Aug 2016 14:27:26 -0700

This is also a good option.

With respect to Hive transactional tables: I do to think they have been 
designed for massive inserts of single items. On the other hand you would not 
insert a lot of events using single inserts in a relational database. Same 
restrictions apply, it is not the use case you want to implement.



> On 24 Aug 2016, at 13:55, Kit Menke <[email protected]> wrote:
> 
> Joel,
> Another option which you have is to use the Storm HDFS bolt to stream data 
> into Hive external tables. The external tables then get loaded into ORC 
> history tables for long term storage. We use this in a HDP cluster with 
> similar load so I know it works. :)
> 
> I'm with Jörn on this one. My impression of hive transactions is that it is a 
> new feature not totally ready for production.
> Thanks,
> Kit
> 
> 
>> On Aug 24, 2016 3:07 AM, "Joel Victor" <[email protected]> wrote:
>> @Jörn: If I understood correctly even later versions of Hive won't be able 
>> to handle these kinds of workloads?
>> 
>>> On Wed, Aug 24, 2016 at 1:26 PM, Jörn Franke <[email protected]> wrote:
>>> I think Hive especially these old versions have not been designed for this. 
>>> Why not store them in Hbase and run a oozie job regularly that puts them 
>>> all into Hive /Orc or parquet in a bulk job?
>>> 
>>>> On 24 Aug 2016, at 09:35, Joel Victor <[email protected]> wrote:
>>>> 
>>>> Currently I am using Apache Hive 0.14 that ships with HDP 2.2. We are 
>>>> trying perform streaming ingestion with it.
>>>> We are using the Storm Hive bolt and we have 7 tables in which we are 
>>>> trying to insert. The RPS (requests per second) of our bolts ranges from 
>>>> 7000 to 5000 and our commit policies are configured accordingly i.e 100k 
>>>> events or 15 seconds.
>>>> 
>>>> We see that there are many commitTxn exceptions due to serialization 
>>>> errors in the metastore (we are using PostgreSQL 9.5 as metastore)
>>>> The serialization errors will cause the topology to start lagging in terms 
>>>> of events processed as it will try to reprocess the batches that have 
>>>> failed.
>>>> 
>>>> I have already backported this HIVE-10500 to 0.14 and there isn't much 
>>>> improvement.
>>>> I went through most of the JIRA's about transaction and I found the 
>>>> following HIVE-11948, HIVE-13013. I would like to backport them to 0.14.
>>>> Going through the patches gives me an impression that I need to mostly 
>>>> update the queries and transaction levels.
>>>> Do these patches also require me to update the schema in the metastore? 
>>>> Please also let me know if there are any other patches that I missed.
>>>> 
>>>> I would also like to know whether Apache Hive can handle inserts to the 
>>>> same/different tables concurrently from multiple clients in 1.2.1 or later 
>>>> versions without many serialization errors in Hive metastore?
>>>> 
>>>> -Joel

Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

Reply via email to