Re: better partitioning strategy in hive

Mark Grover Fri, 02 Mar 2012 16:16:22 -0800

Sorry about the dealyed response, RK.

Here is what I think:
1) first of all why hive is not able to even submit the job? Is it taking for 
ever to query the list pf partitions from the meta store? getting 43K recs 
should not be big deal at all??


--> Hive is possibly taking a long time to figure out what partitions it needs 
to query. I experienced the same problem when I had a lot of partitions (with 
relatively small sized files). I reverted back to having less number of 
partitions with larger file sizes, that fixed the problem. Finding the balance 
between how many partitions you want and how big you want each partition to be 
is tricky, but, in general, it's better to have lesser number of partitions. 
You want to be aware of the small files problem. It has been discussed at many 
places. Some links are:
http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html

2) So in order to improve my situation, what are my options? I can think of 
changing the partition strategy to daily partition instead of hourly. What 
should be the ideal partitioning strategy? 

--> I would say that's a good step forward.

3) if we have one partition per day and 24 files under it (i.e less partitions 
but same number of files), will it improve anything or i will have same issue ? 

--> You probably wouldn't have the same issue; if you still do, it wouldn't be 
as bad. Since the number of partitions have been reduced by a factor of 24, 
hive doesn't have to go through as many number of partitions. However, your 
queries that look for data in a particular hour on a given day would be slower 
now that you don't have hour as a partition.

4)Are there any special input formats or tricks to handle this? 

--> This is a separate question. What format, SerDe and compression you use for 
your data, is a part of the design but isn't necessarily linked to the problem 
in question.

5) When i tried to insert into a different table by selecting from whole days 
data, hive generate 164mappers with map-only jobs, hence creating many output 
files. How can force hive to create one output file instead of many. Setting 
mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to 
achieve this? 

--> mapred.reduce.tasks wouldn't help because the job is map-only and has no 
reduce tasks. You should look into hive.merge.* properties. Setting them in 
your hive-site.xml would do the trick. You can see refer to this template 
(https://svn.apache.org/repos/asf/hive/trunk/conf/hive-default.xml.template) to 
see what properties exist. 

Good luck!
Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "rk vishu" <talk2had...@gmail.com>
To: cdh-u...@cloudera.org, common-user@hadoop.apache.org, u...@hive.apache.org
Sent: Saturday, February 18, 2012 4:39:48 AM
Subject: Re: better partitioning strategy in hive





Hello All, 

We have a hive table partitioned by date and hour(330 columns). We have 5 years 
worth of data for the table. Each hourly partition have around 800MB. 
So total 43,800 partitions with one file per partition. 

When we run select count(*) from table, hive is taking for ever to submit the 
job. I waited for 20 min and killed it. If i run for a month it takes little 
time to submit the job, but at least hive is able to get the work done?. 

Questions: 
1) first of all why hive is not able to even submit the job? Is it taking for 
ever to query the list pf partitions from the meta store? getting 43K recs 
should not be big deal at all?? 
2) So in order to improve my situation, what are my options? I can think of 
changing the partition strategy to daily partition instead of hourly. What 
should be the ideal partitioning strategy? 
3) if we have one partition per day and 24 files under it (i.e less partitions 
but same number of files), will it improve anything or i will have same issue ? 
4)Are there any special input formats or tricks to handle this? 
5) When i tried to insert into a different table by selecting from whole days 
data, hive generate 164mappers with map-only jobs, hence creating many output 
files. How can force hive to create one output file instead of many. Setting 
mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to 
achieve this? 


-RK

Re: better partitioning strategy in hive

Reply via email to