RE: Metastore performance on HDFS-backed table with 15000+ partitions

java8964 Thu, 27 Feb 2014 06:33:13 -0800

That is good to know.
We are using Hive 0.9. Right now the biggest table contains 2 years data, and 
we partitioned by hour, as the data volume is big.
So right now, it has 2*365*24 around 17000+ partitions. So far we didn't see 
too much problem yet, but I do have some concerns about it.
We are using IBM BigInsight, which is using derby as the hive metastore, not as 
mysql as my most experience was on.
Yong

From: norbert.bur...@gmail.com
Date: Thu, 27 Feb 2014 07:57:05 -0500
Subject: Re: Metastore performance on HDFS-backed table with 15000+ partitions
To: user@hive.apache.org

Thanks everyone for the feedback.  Just to follow up in case someone else runs 
into this: I can confirm that local client works around the OOMEs, but it's 
still very slow.
It does seem like we were hitting some combination of HIVE-4051 and HIVE-5158.  
We'll try reducing partition count first, and then switch to 0.12.0 if that 
doesn't improve things significantly.

Fwiw - http://www.slideshare.net/oom65/optimize-hivequeriespptx also has has 
some good rules-of-thumb.

Norbert

On Sat, Feb 22, 2014 at 1:27 PM, Stephen Sprague <sprag...@gmail.com> wrote:

yeah. That traceback pretty much spells it out - its metastore related and 
that's where the partitions are stored.

I'm with the others on this. HiveServer2 is still a little jankey on memory 
management.  I bounce mine once a day at midnight just to play it safe (and 
because i can.)

Again, for me, i use the hive local client for production jobs and remote 
client for adhoc stuff.

you may wish to confirm the local hive client has no problem with your query.

other than that you either increase your heap size on the HS2 process and hope 
for the best and/or file a bug report.

bottom line hiveserver2 isn't production bullet proof just yet, IMHO. Others 
may disagree.

Regards,
Stephen.

On Sat, Feb 22, 2014 at 9:50 AM, Norbert Burger <norbert.bur...@gmail.com> 
wrote:

Thanks all for the quick feedback.
I'm a bit surprised to learn 15k is considered too much, but we can work around 
it.  I guess I'm also curious why the query planner needs to know about all 
partitions even in the case of simple select/limit queries, where the query 
might target only a single partition.

Here's the client-side OOME with HADOOP_HEAPSIZE=2048:
https://gist.githubusercontent.com/nburger/3286d2052060e2efe161/raw/dc30231086803c1d33b9137b5844d2d0e20e350d/gistfile1.txt

This was from a CDH4.3.0 client hitting HIveServer2.  Any idea what's consuming 
the heap?
Norbert

On Sat, Feb 22, 2014 at 10:32 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote:

Dont make tbales with that many partitions. It is an anti pattern. I hwve 
tables with 2000 partitions a day and that is rewlly to many. Hive needs go 
load that informqtion into memory to plan the query.

On Saturday, February 22, 2014, Terje Marthinussen <tmarthinus...@gmail.com> 
wrote:

> Query optimizer in hive is awful on memory consumption. 15k partitions sounds 
> a bit early for it to fail though..
>
> What is your heap size?
>
> Regards,
> Terje
>
>> On 22 Feb 2014, at 12:05, Norbert Burger <norbert.bur...@gmail.com> wrote:

>>
>> Hi folks,
>>
>> We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.
>>
>> In Hive, we have an external table backed by HDFS which has a 3-level 
>> partitioning scheme that currently has 15000+ partitions.

>>
>> Within the last day or so, queries against this table have started failing.  
>> A simple query which shouldn't take very long at all (select * from ... 
>> limit 10) fails after several minutes with a client OOME.  I get the same 
>> outcome on count(*) queries (which I thought wouldn't send any data back to 
>> the client).  Increasing heap on both client and server JVMs (via 
>> HADOOP_HEAPSIZE) doesn't have any impact.

>>
>> We were only able to work around the client OOMEs by reducing the number of 
>> partitions in the table.
>>
>> Looking at the MySQL querylog, my thought is that the Hive client is quite 
>> busy making requests for partitions that doesn't contribute to the query.  
>> Has anyone else had similar experience against tables this size?

>>
>> Thanks,
>> Norbert
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than 
usual.

RE: Metastore performance on HDFS-backed table with 15000+ partitions

Reply via email to