RE: Files Per Partition Causing Slowness

Mike Roberts Tue, 02 Dec 2014 12:05:23 -0800

unsubscribe

-----Original Message-----
From: John Omernik [mailto:j...@omernik.com] 
Sent: Tuesday, December 2, 2014 1:01 PM
To: user@hive.apache.org
Subject: Fwd: Files Per Partition Causing Slowness


---------- Forwarded message ----------
From: John Omernik <j...@omernik.com>
Date: Tue, Dec 2, 2014 at 1:58 PM
Subject: Re: Files Per Partition Causing Slowness
To: user@hive.apache.org


Thank you Edward, I knew the number of partitions mattered,  but I didn't think 
1000 would be to much.  However, I didn't realize the number of files per 
partition was also a fact prior to job submission.
I am looking at reducing some of those now too.

Out of curiosity, if I have a per day partition for three years of data, how 
would I setup bucketing to keep my partitions lower? I am struggling to find a 
way to approach this problem.


Thanks!

On Tue, Dec 2, 2014 at 12:28 PM, John Omernik <j...@omernik.com> wrote:
>
> Thank you Edward, I knew the number of partitions mattered, and knew I was 
> getting high, however, I didn't realize the number of files per partition was 
> also a fact prior to job submission.
>
> Thanks!
>
> John
>
> On Tue, Dec 2, 2014 at 11:35 AM, Edward Capriolo <edlinuxg...@gmail.com> 
> wrote:
>>
>> This is discussed in the programming hive book. The more files the 
>> longer it takes the job tracker to plan the job. The more tasks the 
>> more things the job tracker has to track. The more partitions the 
>> more metastore lookups are required. All of these things limit 
>> throughput. I do not like tables with more then 100 partitions above 
>> that I would switch to bucketing or some other mechanism (application 
>> level partitioning)
>>
>> On Tue, Dec 2, 2014 at 12:25 PM, John Omernik <j...@omernik.com> wrote:
>>>
>>> I am running Hive 0.12 in production, I have a table that ha 1100 
>>> partitions, (flat, no multi level partitions) and in those partitions some 
>>> have a small number of files (5- 10) and others have quite a few files (up 
>>> to 120).   The total table size is not "huge" around 285 GB.
>>>
>>> While this is not terrible to my eyes, when I try to run a query on lots of 
>>> partition say all 1100, the time from query start to the time the query is 
>>> submitted to the jobtracker is horribly slow.  For example, it can take up 
>>> to 3.5 minutes just to get to the point where the job is seen in the job 
>>> tracker.
>>> Is the number of files here what's hurting me? Is there some sort of per 
>>> file enumeration going on under the hood in Hive?  I ran Hive with debug 
>>> mode on and saw lots of file calls for each individual file... I guess I am 
>>> curious for others out there who may have similar tables, would a query 
>>> like that take a horribly long time for you as well? Is this "normal" or am 
>>> I seeing issues here?
>>>
>>>
>>
>

RE: Files Per Partition Causing Slowness

Reply via email to