better partitioning strategy in hive

rk vishu Sat, 18 Feb 2012 01:39:14 -0800

Hello All,

We have a hive table partitioned by date and hour(330 columns). We have 5
years worth of data for the table. Each hourly partition have around 800MB.
So total 43,800 partitions with one file per partition.


When we run select count(*) from table, hive is taking for ever to submit
the job. I waited for 20 min and killed it. If i run for a month it takes
little time to submit the job, but at least hive is able to get the work
done?.

Questions:
1) first of all why hive is not able to even submit the job? Is it taking
for ever to query the list pf partitions from the meta store? getting 43K
recs should not be big deal at all??
2) So in order to improve my situation, what are my options? I can think of
changing the partition strategy to daily partition instead of hourly. What
should be the ideal partitioning strategy?
3) if we have one partition per day and 24 files under it (i.e less
partitions but same number of files), will it improve anything or i will
have same issue ?
4)Are there any special input formats or tricks to handle this?
5) When i tried to insert into a different table by selecting from whole
days data, hive generate 164mappers with map-only jobs, hence creating many
output files. How can force hive to create one output file instead of many.
Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i
can do to achieve this?


-RK

better partitioning strategy in hive

Reply via email to