Hi Wojciech Langiewicz/Paul Mackles,

 

I tried your suggestion and it worked, now the performance has increased
many folds, here are the results from my testing after implementing your
suggestion

 

Number of Files on HDFS

File Size

Select count(*) time taken in seconds

Select count(*) result

1 (created from 2624 CSVs )

708.8 MB

66.258

3,567,922

3 (each created from 2624 CSVs )

708.8 MB * 3

119.92

10,703,766

3 (each created from 2624 CSVs ) +
14 (each created from almost 200 CSVs)

708.8 MB *3 +
Combined size of 14 files (ranging 48 Mb to 68 MB) is : 708.8 MB 

153.306

14,271,688

 

Thanks a lot for your help.

 

Kind Regards,

Keshav C Savant

 

From: Paul Mackles [mailto:pmack...@adobe.com] 
Sent: Tuesday, December 06, 2011 8:14 PM
To: user@hive.apache.org
Subject: RE: Hive query taking too much time

 

How much time is it spending in the map/reduce phases, respectively? The
large number of files could be creating a lot of mappers which create a
lot of overhead. What happens if you merge the 2624 files into a smaller
number like 24 or 48. That should speed up the mapper phase
significantly.

 

From: Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] 
Sent: Tuesday, December 06, 2011 6:01 AM
To: user@hive.apache.org
Subject: Hive query taking too much time

 

Hi All,

 

My setup is 

hadoop-0.20.203.0

hive-0.7.1

 

I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.

 

I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.

 

The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.

 

For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data. 

This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.

 

On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.

 

I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.

 

Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated. 

 

Keshav

_____________
The information contained in this message is proprietary and/or
confidential. If you are not the intended recipient, please: (i) delete
the message and all copies; (ii) do not disclose, distribute or use the
message in any manner; and (iii) notify the sender immediately. In
addition, please be aware that any message addressed to our domain is
subject to archiving and review by persons other than the intended
recipient. Thank you.

_____________
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.

Reply via email to