How about a simple Pig script with a load and a store statement? Set the max # 
reducers to say 20 or 30, that way you will only have 20-30 files as output. 
Then put these files in the Hive dir. Make sure to match the delimiters in Hive 
& Pig.
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.

 From: Vikas Srivastava <>
Sent: Tuesday, December 6, 2011 10:00 PM
Subject: Re: Hive query taking too much time

hey if u having the same col of  all the files then you can easily merge by 
shell script

for file in $list
cat $file >>new_file.csv
hive -e "load data local inpath '$file' into table $table"

it will merge all the files in single file then you can upload it in the same 

On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta <> 

Hi Paul,
>I am having the same problem. Do you know any efficient way of merging the 
>On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <> wrote:
>How much time is it spending in the map/reduce phases, respectively? The large 
>number of files could be creating a lot of mappers which create a lot of 
>overhead. What happens if you merge the 2624 files into a smaller number like 
>24 or 48. That should speed up the mapper phase significantly.
>>From:Savant, Keshav [] 
>>Sent: Tuesday, December 06, 2011 6:01 AM
>>Subject: Hive query taking too much time
>>Hi All,
>>My setup is 
>>I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also 
>>acting as secondary name node). On namenode I have setup hive with 
>>HiveDerbyServerMode to support multiple hive server connection.
>>I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query 
>>statements, total number of files is 2624 an their combined size is only 713 
>>MB, which is very less from Hadoop perspective that can handle TBs of data 
>>very easily.
>>The problem is, when I run a simple count query (i.e. select count(*) from 
>>a_table), it takes too much time in executing the query.
>>For instance it takes almost 17 minutes to execute the said query if the 
>>table has 950,000 rows, I understand that time is too much for executing a 
>>query with only such small data. 
>>This is only a dev environment and in production environment the number of 
>>files and their combined size will move into millions and GBs respectively.
>>On analyzing the logs on all the datanodes and namenode/secondary namenode I 
>>do not find any error in them.
>>I have tried setting mapred.reduce.tasks to a fixed number also, but number 
>>of reduce always remains 1 while number of maps is determined by hive only.
>>Any suggestion what I am doing wrong, or how can I improve the performance of 
>>hive queries? Any suggestion or pointer is highly appreciated. 
>>The information contained in this message is proprietary and/or confidential. 
>>If you are not the intended recipient, please: (i) delete the message and all 
>>copies; (ii) do not disclose, distribute or use the message in any manner; 
>>and (iii) notify the sender immediately. In addition, please be aware that 
>>any message addressed to our domain is subject to archiving and review by 
>>persons other than the intended recipient. Thank you.
>Best Regards,
>Mohit Gupta
>Software Engineer at Vdopia Inc.

With Regards
Vikas Srivastava

DWH & Analytics Team
Mob:+91 9560885900
One97 | Let's get talking !

Reply via email to