See slide #9 from my Optimizing Hive Queries talk http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will improve it, but for now you are much better off with 1,000 partitions than 10,000.
-- Owen On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle <ramki.pa...@gmail.com> wrote: > Is it possible for you to send the explain plan of these two queries? > > Regards, > Ramki. > > > On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian < > sanjay.subraman...@wizecommerce.com> wrote: > >> The slow down is most possibly due to large number of partitions. >> I believe the Hive book authors tell us to be cautious with large number >> of partitions :-) and I abide by that. >> >> Users >> Please add your points of view and experiences >> >> Thanks >> sanjay >> >> From: Ian <liu...@yahoo.com> >> Reply-To: "user@hive.apache.org" <user@hive.apache.org>, Ian < >> liu...@yahoo.com> >> Date: Thursday, April 4, 2013 4:01 PM >> To: "user@hive.apache.org" <user@hive.apache.org> >> Subject: Partition performance >> >> Hi, >> >> I created 3 years of hourly log files (totally 26280 files), and use >> External Table with partition to query. I tried two partition methods. >> >> 1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory >> per hour). Use date and hour as partition keys. Add 3 years of directories >> to the table partitions. So there are 26280 partitions. >> CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt >> string, hr int); >> ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) >> LOCATION '/test1/2013/04/02/16'; >> >> 2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory >> per day, 24 files in each directory). Use date as partition key. Add 3 >> years of directories to the table partitions. So there are 1095 partitions. >> CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt >> string); >> ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION >> '/test2/2013/04/02'; >> >> When doing a simple query like >> SELECT * FROM test1/test2 WHERE dt >= '2013-02-01' and dt <= >> '2013-02-14' >> Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. >> >> I'm wondering why there is a big performance difference between these >> two? These two approaches have the same number of files, only the directory >> structure is different. So Hive is going to load the same amount of files. >> Why does the number of partitions have such big impact? Does that mean #2 >> is a better partition strategy? >> >> Thanks. >> >> >> >> CONFIDENTIALITY NOTICE >> ====================== >> This email message and any attachments are for the exclusive use of the >> intended recipient(s) and may contain confidential and privileged >> information. Any unauthorized review, use, disclosure or distribution is >> prohibited. If you are not the intended recipient, please contact the >> sender by reply email and destroy all copies of the original message along >> with any attachments, from your computer system. If you are the intended >> recipient, please be advised that the content of this message is subject to >> access, review and disclosure by the sender's Email System Administrator. >> > >