This is discussed in the programming hive book. The more files the longer
it takes the job tracker to plan the job. The more tasks the more things
the job tracker has to track. The more partitions the more metastore
lookups are required. All of these things limit throughput. I do not like
tables with more then 100 partitions above that I would switch to bucketing
or some other mechanism (application level partitioning)

On Tue, Dec 2, 2014 at 12:25 PM, John Omernik <j...@omernik.com> wrote:

> I am running Hive 0.12 in production, I have a table that ha 1100
> partitions, (flat, no multi level partitions) and in those partitions some
> have a small number of files (5- 10) and others have quite a few files (up
> to 120).   The total table size is not "huge" around 285 GB.
>
> While this is not terrible to my eyes, when I try to run a query on lots
> of partition say all 1100, the time from query start to the time the query
> is submitted to the jobtracker is horribly slow.  For example, it can take
> up to 3.5 minutes just to get to the point where the job is seen in the job
> tracker.
> Is the number of files here what's hurting me? Is there some sort of per
> file enumeration going on under the hood in Hive?  I ran Hive with debug
> mode on and saw lots of file calls for each individual file... I guess I am
> curious for others out there who may have similar tables, would a query
> like that take a horribly long time for you as well? Is this "normal" or am
> I seeing issues here?
>
>
>

Reply via email to