Files Per Partition Causing Slowness

John Omernik Tue, 02 Dec 2014 09:27:02 -0800

I am running Hive 0.12 in production, I have a table that ha 1100
partitions, (flat, no multi level partitions) and in those partitions some
have a small number of files (5- 10) and others have quite a few files (up
to 120).   The total table size is not "huge" around 285 GB.


While this is not terrible to my eyes, when I try to run a query on lots of
partition say all 1100, the time from query start to the time the query is
submitted to the jobtracker is horribly slow.  For example, it can take up
to 3.5 minutes just to get to the point where the job is seen in the job
tracker.
Is the number of files here what's hurting me? Is there some sort of per
file enumeration going on under the hood in Hive?  I ran Hive with debug
mode on and saw lots of file calls for each individual file... I guess I am
curious for others out there who may have similar tables, would a query
like that take a horribly long time for you as well? Is this "normal" or am
I seeing issues here?

Files Per Partition Causing Slowness

Reply via email to