Fwd: Files Per Partition Causing Slowness

John Omernik Tue, 02 Dec 2014 12:03:50 -0800

---------- Forwarded message ----------
From: John Omernik <j...@omernik.com>
Date: Tue, Dec 2, 2014 at 1:58 PM
Subject: Re: Files Per Partition Causing Slowness
To: user@hive.apache.org



Thank you Edward, I knew the number of partitions mattered,  but I
didn't think 1000 would be to much.  However, I didn't realize the
number of files per partition was also a fact prior to job submission.
I am looking at reducing some of those now too.

Out of curiosity, if I have a per day partition for three years of
data, how would I setup bucketing to keep my partitions lower? I am
struggling to find a way to approach this problem.


Thanks!

On Tue, Dec 2, 2014 at 12:28 PM, John Omernik <j...@omernik.com> wrote:
>
> Thank you Edward, I knew the number of partitions mattered, and knew I was 
> getting high, however, I didn't realize the number of files per partition was 
> also a fact prior to job submission.
>
> Thanks!
>
> John
>
> On Tue, Dec 2, 2014 at 11:35 AM, Edward Capriolo <edlinuxg...@gmail.com> 
> wrote:
>>
>> This is discussed in the programming hive book. The more files the longer it 
>> takes the job tracker to plan the job. The more tasks the more things the 
>> job tracker has to track. The more partitions the more metastore lookups are 
>> required. All of these things limit throughput. I do not like tables with 
>> more then 100 partitions above that I would switch to bucketing or some 
>> other mechanism (application level partitioning)
>>
>> On Tue, Dec 2, 2014 at 12:25 PM, John Omernik <j...@omernik.com> wrote:
>>>
>>> I am running Hive 0.12 in production, I have a table that ha 1100 
>>> partitions, (flat, no multi level partitions) and in those partitions some 
>>> have a small number of files (5- 10) and others have quite a few files (up 
>>> to 120).   The total table size is not "huge" around 285 GB.
>>>
>>> While this is not terrible to my eyes, when I try to run a query on lots of 
>>> partition say all 1100, the time from query start to the time the query is 
>>> submitted to the jobtracker is horribly slow.  For example, it can take up 
>>> to 3.5 minutes just to get to the point where the job is seen in the job 
>>> tracker.
>>> Is the number of files here what's hurting me? Is there some sort of per 
>>> file enumeration going on under the hood in Hive?  I ran Hive with debug 
>>> mode on and saw lots of file calls for each individual file... I guess I am 
>>> curious for others out there who may have similar tables, would a query 
>>> like that take a horribly long time for you as well? Is this "normal" or am 
>>> I seeing issues here?
>>>
>>>
>>
>

Fwd: Files Per Partition Causing Slowness

Reply via email to