How about a simple Pig script with a load and a store statement? Set the max # reducers to say 20 or 30, that way you will only have 20-30 files as output. Then put these files in the Hive dir. Make sure to match the delimiters in Hive & Pig. -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions.
________________________________ From: Vikas Srivastava <vikas.srivast...@one97.net> To: user@hive.apache.org Sent: Tuesday, December 6, 2011 10:00 PM Subject: Re: Hive query taking too much time hey if u having the same col of all the files then you can easily merge by shell script list=`*.csv` $table=yourtable for file in $list do cat $file >>new_file.csv done hive -e "load data local inpath '$file' into table $table" it will merge all the files in single file then you can upload it in the same query On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta <success.mohit.gu...@gmail.com> wrote: Hi Paul, >I am having the same problem. Do you know any efficient way of merging the >files? > > >-Mohit > > > >On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <pmack...@adobe.com> wrote: > >How much time is it spending in the map/reduce phases, respectively? The large >number of files could be creating a lot of mappers which create a lot of >overhead. What happens if you merge the 2624 files into a smaller number like >24 or 48. That should speed up the mapper phase significantly. >> >>From:Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] >>Sent: Tuesday, December 06, 2011 6:01 AM >>To: user@hive.apache.org >>Subject: Hive query taking too much time >> >>Hi All, >> >>My setup is >>hadoop-0.20.203.0 >>hive-0.7.1 >> >>I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also >>acting as secondary name node). On namenode I have setup hive with >>HiveDerbyServerMode to support multiple hive server connection. >> >>I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query >>statements, total number of files is 2624 an their combined size is only 713 >>MB, which is very less from Hadoop perspective that can handle TBs of data >>very easily. >> >>The problem is, when I run a simple count query (i.e. select count(*) from >>a_table), it takes too much time in executing the query. >> >>For instance it takes almost 17 minutes to execute the said query if the >>table has 950,000 rows, I understand that time is too much for executing a >>query with only such small data. >>This is only a dev environment and in production environment the number of >>files and their combined size will move into millions and GBs respectively. >> >>On analyzing the logs on all the datanodes and namenode/secondary namenode I >>do not find any error in them. >> >>I have tried setting mapred.reduce.tasks to a fixed number also, but number >>of reduce always remains 1 while number of maps is determined by hive only. >> >>Any suggestion what I am doing wrong, or how can I improve the performance of >>hive queries? Any suggestion or pointer is highly appreciated. >> >>Keshav >>_____________ >>The information contained in this message is proprietary and/or confidential. >>If you are not the intended recipient, please: (i) delete the message and all >>copies; (ii) do not disclose, distribute or use the message in any manner; >>and (iii) notify the sender immediately. In addition, please be aware that >>any message addressed to our domain is subject to archiving and review by >>persons other than the intended recipient. Thank you. > > > >-- >Best Regards, > >Mohit Gupta >Software Engineer at Vdopia Inc. > > > -- With Regards Vikas Srivastava DWH & Analytics Team Mob:+91 9560885900 One97 | Let's get talking !