RE: Partition performance

Peter Marron Tue, 02 Jul 2013 02:36:10 -0700

Hi Owen,

I’m curious about this advice about partitioning. Is there some fundamental 
reason why Hive
is slow when the number of partitions is 10,000 rather than 1,000? And the 
improvements
that you mention are they going to be in version 12? Is there a JIRA raised so 
that I can track them?
(It’s not currently a problem for me but I can see that I am going to need to 
be able to explain the situation.)

Warm regards,

Z

From: Owen O'Malley [mailto:omal...@apache.org]
Sent: 05 April 2013 00:26
To: user@hive.apache.org
Subject: Re: Partition performance

See slide #9 from my Optimizing Hive Queries talk 
http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will 
improve it, but for now you are much better off with 1,000 partitions than 
10,000.

-- Owen

On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle 
<ramki.pa...@gmail.com<mailto:ramki.pa...@gmail.com>> wrote:
Is it possible for you to send the explain plan of these two queries?
Regards,
Ramki.

On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
<sanjay.subraman...@wizecommerce.com<mailto:sanjay.subraman...@wizecommerce.com>>
 wrote:
The slow down is most possibly due to large number of partitions.
I believe the Hive book authors tell us to be cautious with large number of 
partitions :-)  and I abide by that.

Users
Please add your points of view and experiences

Thanks
sanjay

From: Ian <liu...@yahoo.com<mailto:liu...@yahoo.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>, Ian 
<liu...@yahoo.com<mailto:liu...@yahoo.com>>
Date: Thursday, April 4, 2013 4:01 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Partition performance

Hi,

I created 3 years of hourly log files (totally 26280 files), and use External 
Table with partition to query. I tried two partition methods.

1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per 
hour). Use date and hour as partition keys. Add 3 years of directories to the 
table partitions. So there are 26280 partitions.
        CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, 
hr int);
        ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION 
'/test1/2013/04/02/16';

2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per day, 
24 files in each directory). Use date as partition key. Add 3 years of 
directories to the table partitions. So there are 1095 partitions.
        CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string);
        ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION 
'/test2/2013/04/02';

When doing a simple query like
    SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <= '2013-02-14'
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

I'm wondering why there is a big performance difference between these two? 
These two approaches have the same number of files, only the directory 
structure is different. So Hive is going to load the same amount of files. Why 
does the number of partitions have such big impact? Does that mean #2 is a 
better partition strategy?

Thanks.

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

RE: Partition performance

Reply via email to