Re: Partition performance

Owen O'Malley Thu, 04 Apr 2013 16:25:42 -0700

See slide #9 from my Optimizing Hive Queries talk
http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we
will improve it, but for now you are much better off with 1,000 partitions
than 10,000.


-- Owen


On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle <ramki.pa...@gmail.com> wrote:

> Is it possible for you to send the explain plan of these two queries?
>
> Regards,
> Ramki.
>
>
> On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <
> sanjay.subraman...@wizecommerce.com> wrote:
>
>>  The slow down is most possibly due to large number of partitions.
>> I believe the Hive book authors tell us to be cautious with large number
>> of partitions :-)  and I abide by that.
>>
>>  Users
>> Please add your points of view and experiences
>>
>>  Thanks
>> sanjay
>>
>>   From: Ian <liu...@yahoo.com>
>> Reply-To: "user@hive.apache.org" <user@hive.apache.org>, Ian <
>> liu...@yahoo.com>
>> Date: Thursday, April 4, 2013 4:01 PM
>> To: "user@hive.apache.org" <user@hive.apache.org>
>> Subject: Partition performance
>>
>>   Hi,
>>
>> I created 3 years of hourly log files (totally 26280 files), and use
>> External Table with partition to query. I tried two partition methods.
>>
>> 1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory
>> per hour). Use date and hour as partition keys. Add 3 years of directories
>> to the table partitions. So there are 26280 partitions.
>>         CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
>> string, hr int);
>>         ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16)
>> LOCATION '/test1/2013/04/02/16';
>>
>> 2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory
>> per day, 24 files in each directory). Use date as partition key. Add 3
>> years of directories to the table partitions. So there are 1095 partitions.
>>          CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
>> string);
>>         ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
>> '/test2/2013/04/02';
>>
>> When doing a simple query like
>>     SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <=
>> '2013-02-14'
>>  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.
>>
>> I'm wondering why there is a big performance difference between these
>> two? These two approaches have the same number of files, only the directory
>> structure is different. So Hive is going to load the same amount of files.
>> Why does the number of partitions have such big impact? Does that mean #2
>> is a better partition strategy?
>>
>> Thanks.
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>

Re: Partition performance

Reply via email to