Hi,
In your case total file size isn't main factor that reduces performance, number of files is.

To test this try merging those over 2000 files into one (or few) big, then upload it to HDFS and test hive performance (it should be definitely higher). It this works you should think about merging those files before or after loading them to HDFS.

Second issue is counts, try to observe how your jobs uses mappers and reducers, my experience is that simple count() jobs might be stuck on one reducer (the one that does all counting) for longer time. I have not resolved this issue, but it was not significant in my case. set mapred.reduce.tasks=xyz doesn't change that behavior, but for example using GROUP with COUNT works much faster.

I hope this helps.
--
Wojciech Langiewicz

On 06.12.2011 12:00, Savant, Keshav wrote:
Hi All,



My setup is

hadoop-0.20.203.0

hive-0.7.1



I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.



I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.



The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.



For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data.

This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.



On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.



I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.



Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated.



Keshav


Reply via email to