Re: Partition performance

Dean Wampler Wed, 03 Jul 2013 06:52:48 -0700

How big were the files in each case in your experiment? Having lots of
small files will add Hadoop overhead.


Also, it would be useful to know the execution times of the map and reduce
tasks. The rule of thumb is that under 20 seconds each, or so, you're
paying a significant of the execution time in startup and shutdown overhead.

Of course, another factor is the number of tasks your cluster can run in
parallel. Scanning 20K partitions with a 1K MapReduce slot capacity over
the cluster will obviously take ~20 passes vs. ~1 pass for 1K partitions.

dean

On Tue, Jul 2, 2013 at 4:34 AM, Peter Marron <
peter.mar...@trilliumsoftware.com> wrote:

>  ...
>
> ** **
>
> *From: *Ian <liu...@yahoo.com>
> *Reply-To: *"user@hive.apache.org" <user@hive.apache.org>, Ian <
> liu...@yahoo.com>
> *Date: *Thursday, April 4, 2013 4:01 PM
> *To: *"user@hive.apache.org" <user@hive.apache.org>
> *Subject: *Partition performance****
>
> ** **
>
> Hi,****
>
>  ****
>
> I created 3 years of hourly log files (totally 26280 files), and use
> External Table with partition to query. I tried two partition methods.****
>
>  ****
>
> 1). Log files are stored as /test1/2013/04/02/16/000000_0 (A directory per
> hour). Use date and hour as partition keys. Add 3 years of directories to
> the table partitions. So there are 26280 partitions.****
>
>         CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
> string, hr int);****
>
>         ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION
> '/test1/2013/04/02/16';****
>
>  ****
>
> 2). Log files are stored as /test2/2013/04/02/16_000000_0 (A directory per
> day, 24 files in each directory). Use date as partition key. Add 3 years of
> directories to the table partitions. So there are 1095 partitions.****
>
>         CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
> string);****
>
>         ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
> '/test2/2013/04/02';****
>
>  ****
>
> When doing a simple query like ****
>
>     SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <=
> '2013-02-14'****
>
> Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. ****
>
>  ****
>
> I'm wondering why there is a big performance difference between these two?
> These two approaches have the same number of files, only the directory
> structure is different. So Hive is going to load the same amount of files.
> Why does the number of partitions have such big impact? Does that mean #2
> is a better partition strategy?****
>
>  ****
>
> Thanks.****
>
>  ****
>
>  ****
>
> ** **
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.*
> ***
>
> ** **
>
> ** **
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Partition performance

Reply via email to