---------- Forwarded message ---------- From: John Omernik <j...@omernik.com> Date: Tue, Dec 2, 2014 at 1:58 PM Subject: Re: Files Per Partition Causing Slowness To: user@hive.apache.org
Thank you Edward, I knew the number of partitions mattered, but I didn't think 1000 would be to much. However, I didn't realize the number of files per partition was also a fact prior to job submission. I am looking at reducing some of those now too. Out of curiosity, if I have a per day partition for three years of data, how would I setup bucketing to keep my partitions lower? I am struggling to find a way to approach this problem. Thanks! On Tue, Dec 2, 2014 at 12:28 PM, John Omernik <j...@omernik.com> wrote: > > Thank you Edward, I knew the number of partitions mattered, and knew I was > getting high, however, I didn't realize the number of files per partition was > also a fact prior to job submission. > > Thanks! > > John > > On Tue, Dec 2, 2014 at 11:35 AM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: >> >> This is discussed in the programming hive book. The more files the longer it >> takes the job tracker to plan the job. The more tasks the more things the >> job tracker has to track. The more partitions the more metastore lookups are >> required. All of these things limit throughput. I do not like tables with >> more then 100 partitions above that I would switch to bucketing or some >> other mechanism (application level partitioning) >> >> On Tue, Dec 2, 2014 at 12:25 PM, John Omernik <j...@omernik.com> wrote: >>> >>> I am running Hive 0.12 in production, I have a table that ha 1100 >>> partitions, (flat, no multi level partitions) and in those partitions some >>> have a small number of files (5- 10) and others have quite a few files (up >>> to 120). The total table size is not "huge" around 285 GB. >>> >>> While this is not terrible to my eyes, when I try to run a query on lots of >>> partition say all 1100, the time from query start to the time the query is >>> submitted to the jobtracker is horribly slow. For example, it can take up >>> to 3.5 minutes just to get to the point where the job is seen in the job >>> tracker. >>> Is the number of files here what's hurting me? Is there some sort of per >>> file enumeration going on under the hood in Hive? I ran Hive with debug >>> mode on and saw lots of file calls for each individual file... I guess I am >>> curious for others out there who may have similar tables, would a query >>> like that take a horribly long time for you as well? Is this "normal" or am >>> I seeing issues here? >>> >>> >> >