How about a simple Pig script with a load and a store statement? Set the max # 
reducers to say 20 or 30, that way you will only have 20-30 files as output. 
Then put these files in the Hive dir. Make sure to match the delimiters in Hive 
& Pig.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
 From: Vikas Srivastava <vikas.srivast...@one97.net>
To: user@hive.apache.org 
Sent: Tuesday, December 6, 2011 10:00 PM
Subject: Re: Hive query taking too much time
 

hey if u having the same col of  all the files then you can easily merge by 
shell script

list=`*.csv`
$table=yourtable
for file in $list
do
cat $file >>new_file.csv
done
hive -e "load data local inpath '$file' into table $table"

it will merge all the files in single file then you can upload it in the same 
query


On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta <success.mohit.gu...@gmail.com> 
wrote:

Hi Paul,
>I am having the same problem. Do you know any efficient way of merging the 
>files?
>
>
>-Mohit
>
>
>
>On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <pmack...@adobe.com> wrote:
>
>How much time is it spending in the map/reduce phases, respectively? The large 
>number of files could be creating a lot of mappers which create a lot of 
>overhead. What happens if you merge the 2624 files into a smaller number like 
>24 or 48. That should speed up the mapper phase significantly.
>> 
>>From:Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] 
>>Sent: Tuesday, December 06, 2011 6:01 AM
>>To: user@hive.apache.org
>>Subject: Hive query taking too much time
>> 
>>Hi All,
>> 
>>My setup is 
>>hadoop-0.20.203.0
>>hive-0.7.1
>> 
>>I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also 
>>acting as secondary name node). On namenode I have setup hive with 
>>HiveDerbyServerMode to support multiple hive server connection.
>> 
>>I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query 
>>statements, total number of files is 2624 an their combined size is only 713 
>>MB, which is very less from Hadoop perspective that can handle TBs of data 
>>very easily.
>> 
>>The problem is, when I run a simple count query (i.e. select count(*) from 
>>a_table), it takes too much time in executing the query.
>> 
>>For instance it takes almost 17 minutes to execute the said query if the 
>>table has 950,000 rows, I understand that time is too much for executing a 
>>query with only such small data. 
>>This is only a dev environment and in production environment the number of 
>>files and their combined size will move into millions and GBs respectively.
>> 
>>On analyzing the logs on all the datanodes and namenode/secondary namenode I 
>>do not find any error in them.
>> 
>>I have tried setting mapred.reduce.tasks to a fixed number also, but number 
>>of reduce always remains 1 while number of maps is determined by hive only.
>> 
>>Any suggestion what I am doing wrong, or how can I improve the performance of 
>>hive queries? Any suggestion or pointer is highly appreciated. 
>> 
>>Keshav
>>_____________
>>The information contained in this message is proprietary and/or confidential. 
>>If you are not the intended recipient, please: (i) delete the message and all 
>>copies; (ii) do not disclose, distribute or use the message in any manner; 
>>and (iii) notify the sender immediately. In addition, please be aware that 
>>any message addressed to our domain is subject to archiving and review by 
>>persons other than the intended recipient. Thank you.
>
>
>
>-- 
>Best Regards,
>
>Mohit Gupta
>Software Engineer at Vdopia Inc.
>
>
>


-- 
With Regards
Vikas Srivastava

DWH & Analytics Team
Mob:+91 9560885900
One97 | Let's get talking !

Reply via email to