subject:"RE\: Metastore performance on HDFS\-backed table with 15000\+ partitions"

Re: Metastore performance on HDFS-backed table with 15000+ partitions

2014-02-27 Thread Norbert Burger

Thanks everyone for the feedback. Just to follow up in case someone else
runs into this: I can confirm that local client works around the OOMEs, but
it's still very slow.

It does seem like we were hitting some combination of HIVE-4051 and
HIVE-5158. We'll try reducing partition count first, and then switch to
0.12.0 if that doesn't improve things significantly.

Fwiw - http://www.slideshare.net/oom65/optimize-hivequeriespptx also has
has some good rules-of-thumb.

Norbert

On Sat, Feb 22, 2014 at 1:27 PM, Stephen Sprague sprag...@gmail.com wrote:

yeah. That traceback pretty much spells it out - its metastore related and
that's where the partitions are stored.

I'm with the others on this. HiveServer2 is still a little jankey on
memory management. I bounce mine once a day at midnight just to play it
safe (and because i can.)

Again, for me, i use the hive local client for production jobs and remote
client for adhoc stuff.

you may wish to confirm the local hive client has no problem with your
query.

other than that you either increase your heap size on the HS2 process and
hope for the best and/or file a bug report.

bottom line hiveserver2 isn't production bullet proof just yet, IMHO.
Others may disagree.

Regards,
Stephen.

On Sat, Feb 22, 2014 at 9:50 AM, Norbert Burger
norbert.bur...@gmail.comwrote:

Thanks all for the quick feedback.

I'm a bit surprised to learn 15k is considered too much, but we can work
around it. I guess I'm also curious why the query planner needs to know
about all partitions even in the case of simple select/limit queries, where
the query might target only a single partition.

Here's the client-side OOME with HADOOP_HEAPSIZE=2048:

https://gist.githubusercontent.com/nburger/3286d2052060e2efe161/raw/dc30231086803c1d33b9137b5844d2d0e20e350d/gistfile1.txt

This was from a CDH4.3.0 client hitting HIveServer2. Any idea what's
consuming the heap?

Norbert

On Sat, Feb 22, 2014 at 10:32 AM, Edward Capriolo
edlinuxg...@gmail.comwrote:

Dont make tbales with that many partitions. It is an anti pattern. I
hwve tables with 2000 partitions a day and that is rewlly to many. Hive
needs go load that informqtion into memory to plan the query.

On Saturday, February 22, 2014, Terje Marthinussen
tmarthinus...@gmail.com wrote:
Query optimizer in hive is awful on memory consumption. 15k partitions
sounds a bit early for it to fail though..

What is your heap size?

Regards,
Terje

On 22 Feb 2014, at 12:05, Norbert Burger norbert.bur...@gmail.com
wrote:

Hi folks,

We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.

In Hive, we have an external table backed by HDFS which has a 3-level
partitioning scheme that currently has 15000+ partitions.

Within the last day or so, queries against this table have started
failing. A simple query which shouldn't take very long at all (select *
from ... limit 10) fails after several minutes with a client OOME. I get
the same outcome on count(*) queries (which I thought wouldn't send any
data back to the client). Increasing heap on both client and server JVMs
(via HADOOP_HEAPSIZE) doesn't have any impact.

We were only able to work around the client OOMEs by reducing the
number of partitions in the table.

Looking at the MySQL querylog, my thought is that the Hive client is
quite busy making requests for partitions that doesn't contribute to the
query. Has anyone else had similar experience against tables this size?

Thanks,
Norbert

--
Sorry this was sent from mobile. Will do less grammar and spell check
than usual.

RE: Metastore performance on HDFS-backed table with 15000+ partitions

2014-02-27 Thread java8964

That is good to know.
We are using Hive 0.9. Right now the biggest table contains 2 years data, and 
we partitioned by hour, as the data volume is big.
So right now, it has 2*365*24 around 17000+ partitions. So far we didn't see 
too much problem yet, but I do have some concerns about it.
We are using IBM BigInsight, which is using derby as the hive metastore, not as 
mysql as my most experience was on.
Yong

From: norbert.bur...@gmail.com
Date: Thu, 27 Feb 2014 07:57:05 -0500
Subject: Re: Metastore performance on HDFS-backed table with 15000+ partitions
To: user@hive.apache.org

Thanks everyone for the feedback.  Just to follow up in case someone else runs 
into this: I can confirm that local client works around the OOMEs, but it's 
still very slow.
It does seem like we were hitting some combination of HIVE-4051 and HIVE-5158.  
We'll try reducing partition count first, and then switch to 0.12.0 if that 
doesn't improve things significantly.


Fwiw - http://www.slideshare.net/oom65/optimize-hivequeriespptx also has has 
some good rules-of-thumb.



Norbert

On Sat, Feb 22, 2014 at 1:27 PM, Stephen Sprague sprag...@gmail.com wrote:


yeah. That traceback pretty much spells it out - its metastore related and 
that's where the partitions are stored.





I'm with the others on this. HiveServer2 is still a little jankey on memory 
management.  I bounce mine once a day at midnight just to play it safe (and 
because i can.)





Again, for me, i use the hive local client for production jobs and remote 
client for adhoc stuff.

you may wish to confirm the local hive client has no problem with your query.





other than that you either increase your heap size on the HS2 process and hope 
for the best and/or file a bug report.





bottom line hiveserver2 isn't production bullet proof just yet, IMHO. Others 
may disagree.

Regards,
Stephen.







On Sat, Feb 22, 2014 at 9:50 AM, Norbert Burger norbert.bur...@gmail.com 
wrote:




Thanks all for the quick feedback.
I'm a bit surprised to learn 15k is considered too much, but we can work around 
it.  I guess I'm also curious why the query planner needs to know about all 
partitions even in the case of simple select/limit queries, where the query 
might target only a single partition.






Here's the client-side OOME with HADOOP_HEAPSIZE=2048:
https://gist.githubusercontent.com/nburger/3286d2052060e2efe161/raw/dc30231086803c1d33b9137b5844d2d0e20e350d/gistfile1.txt







This was from a CDH4.3.0 client hitting HIveServer2.  Any idea what's consuming 
the heap?
Norbert





On Sat, Feb 22, 2014 at 10:32 AM, Edward Capriolo edlinuxg...@gmail.com wrote:






Dont make tbales with that many partitions. It is an anti pattern. I hwve 
tables with 2000 partitions a day and that is rewlly to many. Hive needs go 
load that informqtion into memory to plan the query.



On Saturday, February 22, 2014, Terje Marthinussen tmarthinus...@gmail.com 
wrote:

 Query optimizer in hive is awful on memory consumption. 15k partitions sounds 
 a bit early for it to fail though..

 What is your heap size?

 Regards,
 Terje

 On 22 Feb 2014, at 12:05, Norbert Burger norbert.bur...@gmail.com wrote:








 Hi folks,

 We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.

 In Hive, we have an external table backed by HDFS which has a 3-level 
 partitioning scheme that currently has 15000+ partitions.








 Within the last day or so, queries against this table have started failing.  
 A simple query which shouldn't take very long at all (select * from ... 
 limit 10) fails after several minutes with a client OOME.  I get the same 
 outcome on count(*) queries (which I thought wouldn't send any data back to 
 the client).  Increasing heap on both client and server JVMs (via 
 HADOOP_HEAPSIZE) doesn't have any impact.








 We were only able to work around the client OOMEs by reducing the number of 
 partitions in the table.

 Looking at the MySQL querylog, my thought is that the Hive client is quite 
 busy making requests for partitions that doesn't contribute to the query.  
 Has anyone else had similar experience against tables this size?








 Thanks,
 Norbert


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than 
usual.

Re: Metastore performance on HDFS-backed table with 15000+ partitions

2014-02-22 Thread Terje Marthinussen

Query optimizer in hive is awful on memory consumption. 15k partitions sounds a 
bit early for it to fail though.. 

What is your heap size?

Regards,
Terje

 On 22 Feb 2014, at 12:05, Norbert Burger norbert.bur...@gmail.com wrote:
 
 Hi folks,
 
 We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.
 
 In Hive, we have an external table backed by HDFS which has a 3-level 
 partitioning scheme that currently has 15000+ partitions.
 
 Within the last day or so, queries against this table have started failing.  
 A simple query which shouldn't take very long at all (select * from ... limit 
 10) fails after several minutes with a client OOME.  I get the same outcome 
 on count(*) queries (which I thought wouldn't send any data back to the 
 client).  Increasing heap on both client and server JVMs (via 
 HADOOP_HEAPSIZE) doesn't have any impact.
 
 We were only able to work around the client OOMEs by reducing the number of 
 partitions in the table.
 
 Looking at the MySQL querylog, my thought is that the Hive client is quite 
 busy making requests for partitions that doesn't contribute to the query.  
 Has anyone else had similar experience against tables this size?
 
 Thanks,
 Norbert

Re: Metastore performance on HDFS-backed table with 15000+ partitions

2014-02-22 Thread Edward Capriolo

Dont make tbales with that many partitions. It is an anti pattern. I hwve
tables with 2000 partitions a day and that is rewlly to many. Hive needs go
load that informqtion into memory to plan the query.

On Saturday, February 22, 2014, Terje Marthinussen tmarthinus...@gmail.com
wrote:
 Query optimizer in hive is awful on memory consumption. 15k partitions
sounds a bit early for it to fail though..

 What is your heap size?

 Regards,
 Terje

 On 22 Feb 2014, at 12:05, Norbert Burger norbert.bur...@gmail.com
wrote:

 Hi folks,

 We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.

 In Hive, we have an external table backed by HDFS which has a 3-level
partitioning scheme that currently has 15000+ partitions.

 Within the last day or so, queries against this table have started
failing.  A simple query which shouldn't take very long at all (select *
from ... limit 10) fails after several minutes with a client OOME.  I get
the same outcome on count(*) queries (which I thought wouldn't send any
data back to the client).  Increasing heap on both client and server JVMs
(via HADOOP_HEAPSIZE) doesn't have any impact.

 We were only able to work around the client OOMEs by reducing the number
of partitions in the table.

 Looking at the MySQL querylog, my thought is that the Hive client is
quite busy making requests for partitions that doesn't contribute to the
query.  Has anyone else had similar experience against tables this size?

 Thanks,
 Norbert


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: Metastore performance on HDFS-backed table with 15000+ partitions

2014-02-22 Thread Stephen Sprague

yeah. That traceback pretty much spells it out - its metastore related and
that's where the partitions are stored.

I'm with the others on this. HiveServer2 is still a little jankey on memory
management. I bounce mine once a day at midnight just to play it safe (and
because i can.)

Again, for me, i use the hive local client for production jobs and remote
client for adhoc stuff.

you may wish to confirm the local hive client has no problem with your
query.

other than that you either increase your heap size on the HS2 process and
hope for the best and/or file a bug report.

bottom line hiveserver2 isn't production bullet proof just yet, IMHO.
Others may disagree.

Regards,
Stephen.

On Sat, Feb 22, 2014 at 9:50 AM, Norbert Burger norbert.bur...@gmail.comwrote:

Thanks all for the quick feedback.

Here's the client-side OOME with HADOOP_HEAPSIZE=2048:

https://gist.githubusercontent.com/nburger/3286d2052060e2efe161/raw/dc30231086803c1d33b9137b5844d2d0e20e350d/gistfile1.txt

This was from a CDH4.3.0 client hitting HIveServer2. Any idea what's
consuming the heap?

Norbert

On Sat, Feb 22, 2014 at 10:32 AM, Edward Capriolo
edlinuxg...@gmail.comwrote:

Dont make tbales with that many partitions. It is an anti pattern. I hwve
tables with 2000 partitions a day and that is rewlly to many. Hive needs go
load that informqtion into memory to plan the query.

On Saturday, February 22, 2014, Terje Marthinussen
tmarthinus...@gmail.com wrote:
Query optimizer in hive is awful on memory consumption. 15k partitions
sounds a bit early for it to fail though..

What is your heap size?

Regards,
Terje

On 22 Feb 2014, at 12:05, Norbert Burger norbert.bur...@gmail.com
wrote:

Hi folks,

We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.

In Hive, we have an external table backed by HDFS which has a 3-level
partitioning scheme that currently has 15000+ partitions.

We were only able to work around the client OOMEs by reducing the
number of partitions in the table.

Thanks,
Norbert

--
Sorry this was sent from mobile. Will do less grammar and spell check
than usual.

Re: Metastore performance on HDFS-backed table with 15000+ partitions

2014-02-21 Thread Stephen Sprague

most interesting.  we had an issue recently with querying a table with 15K
columns and running out of heap storage but not 15K partitions.

15K partitions shouldn't be causing a problem in my humble estimation.
Maybe a million but not 15K. :)

So is there a traceback we can look at? or its not heap but real memory?

and this is the local hive client? or the hiveserver?

Thanks,
Stephen.



On Fri, Feb 21, 2014 at 7:05 PM, Norbert Burger norbert.bur...@gmail.comwrote:

 Hi folks,

 We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.

 In Hive, we have an external table backed by HDFS which has a 3-level
 partitioning scheme that currently has 15000+ partitions.

 Within the last day or so, queries against this table have started
 failing.  A simple query which shouldn't take very long at all (select *
 from ... limit 10) fails after several minutes with a client OOME.  I get
 the same outcome on count(*) queries (which I thought wouldn't send any
 data back to the client).  Increasing heap on both client and server JVMs
 (via HADOOP_HEAPSIZE) doesn't have any impact.

 We were only able to work around the client OOMEs by reducing the number
 of partitions in the table.

 Looking at the MySQL querylog, my thought is that the Hive client is quite
 busy making requests for partitions that doesn't contribute to the query.
  Has anyone else had similar experience against tables this size?

 Thanks,
 Norbert

Re: Metastore performance on HDFS-backed table with 15000+ partitions

RE: Metastore performance on HDFS-backed table with 15000+ partitions

Re: Metastore performance on HDFS-backed table with 15000+ partitions

Re: Metastore performance on HDFS-backed table with 15000+ partitions

Re: Metastore performance on HDFS-backed table with 15000+ partitions

Re: Metastore performance on HDFS-backed table with 15000+ partitions

6 matches

Site Navigation

Mail list logo

Footer information