I am running Hive 0.12 in production, I have a table that ha 1100 partitions, (flat, no multi level partitions) and in those partitions some have a small number of files (5- 10) and others have quite a few files (up to 120). The total table size is not "huge" around 285 GB.
While this is not terrible to my eyes, when I try to run a query on lots of partition say all 1100, the time from query start to the time the query is submitted to the jobtracker is horribly slow. For example, it can take up to 3.5 minutes just to get to the point where the job is seen in the job tracker. Is the number of files here what's hurting me? Is there some sort of per file enumeration going on under the hood in Hive? I ran Hive with debug mode on and saw lots of file calls for each individual file... I guess I am curious for others out there who may have similar tables, would a query like that take a horribly long time for you as well? Is this "normal" or am I seeing issues here?