You can also take a look at-- https://issues.apache.org/jira/browse/HIVE-74
On Wed, Dec 7, 2011 at 9:05 PM, Savant, Keshav < keshav.c.sav...@fisglobal.com> wrote: > You are right Wojciech Langiewicz, we did the same thing and posted my > result yesterday. Now we are planning to do this using a shell script > because of dynamicity of our environment where file keep on coming. We > will schedule the shell script using cron job. > > A query on this, we are planning to merge files based on either of the > following approach > 1. Based on file count: If file count goes to X number of files, then > merge and insert in HDFS. > 2. Based on merged file size: If merged file size crosses beyond X > number of bytes, then insert into HDFS. > > I think option 2 is better because in that way we can say that all > merged files will be almost of same bytes. What do you suggest? > > Kind Regards, > Keshav C Savant > > > -----Original Message----- > From: Wojciech Langiewicz [mailto:wlangiew...@gmail.com] > Sent: Wednesday, December 07, 2011 8:15 PM > To: user@hive.apache.org > Subject: Re: Hive query taking too much time > > Hi, > In this case it's much easier and faster to merge all files using this > command: > > cat *.csv > output.csv > hive -e "load data local inpath 'output.csv' into table $table" > > On 07.12.2011 07:00, Vikas Srivastava wrote: > > hey if u having the same col of all the files then you can easily > > merge by shell script > > > > list=`*.csv` > > $table=yourtable > > for file in $list > > do > > cat $file>>new_file.csv > > done > > hive -e "load data local inpath '$file' into table $table" > > > > it will merge all the files in single file then you can upload it in > > the same query > > > > On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta > > <success.mohit.gu...@gmail.com>wrote: > > > >> Hi Paul, > >> I am having the same problem. Do you know any efficient way of > >> merging the files? > >> > >> -Mohit > >> > >> > >> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<pmack...@adobe.com> > wrote: > >> > >>> How much time is it spending in the map/reduce phases, respectively? > > >>> The large number of files could be creating a lot of mappers which > >>> create a lot of overhead. What happens if you merge the 2624 files > >>> into a smaller number like 24 or 48. That should speed up the mapper > > >>> phase significantly.**** > >>> > >>> ** ** > >>> > >>> *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] > >>> *Sent:* Tuesday, December 06, 2011 6:01 AM > >>> *To:* user@hive.apache.org > >>> *Subject:* Hive query taking too much time**** > >>> > >>> ** ** > >>> > >>> Hi All,**** > >>> > >>> ** ** > >>> > >>> My setup is **** > >>> > >>> hadoop-0.20.203.0**** > >>> > >>> hive-0.7.1**** > >>> > >>> ** ** > >>> > >>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it > >>> is also acting as secondary name node). On namenode I have setup > >>> hive with HiveDerbyServerMode to support multiple hive server > >>> connection.**** > >>> > >>> ** ** > >>> > >>> I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive > >>> query statements, total number of files is 2624 an their combined > >>> size is only > >>> 713 MB, which is very less from Hadoop perspective that can handle > >>> TBs of data very easily.**** > >>> > >>> ** ** > >>> > >>> The problem is, when I run a simple count query (i.e. *select > >>> count(*) from a_table*), it takes too much time in executing the > >>> query.**** > >>> > >>> ** ** > >>> > >>> For instance it takes almost 17 minutes to execute the said query if > > >>> the table has 950,000 rows, I understand that time is too much for > >>> executing a query with only such small data. **** > >>> > >>> This is only a dev environment and in production environment the > >>> number of files and their combined size will move into millions and > >>> GBs > >>> respectively.**** > >>> > >>> ** ** > >>> > >>> On analyzing the logs on all the datanodes and namenode/secondary > >>> namenode I do not find any error in them.**** > >>> > >>> ** ** > >>> > >>> I have tried setting mapred.reduce.tasks to a fixed number also, but > > >>> number of reduce always remains 1 while number of maps is determined > > >>> by hive only.**** > >>> > >>> ** ** > >>> > >>> Any suggestion what I am doing wrong, or how can I improve the > >>> performance of hive queries? Any suggestion or pointer is highly > >>> appreciated. **** > >>> > >>> ** ** > >>> > >>> Keshav**** > >>> > >>> _____________ > >>> The information contained in this message is proprietary and/or > >>> confidential. If you are not the intended recipient, please: (i) > >>> delete the message and all copies; (ii) do not disclose, distribute > >>> or use the message in any manner; and (iii) notify the sender > >>> immediately. In addition, please be aware that any message addressed > > >>> to our domain is subject to archiving and review by persons other > >>> than the intended recipient. Thank you.**** > >>> > >> > >> > >> > >> -- > >> Best Regards, > >> > >> Mohit Gupta > >> Software Engineer at Vdopia Inc. > >> > >> > >> > > > > > > _____________ > The information contained in this message is proprietary and/or > confidential. If you are not the intended recipient, please: (i) delete the > message and all copies; (ii) do not disclose, distribute or use the message > in any manner; and (iii) notify the sender immediately. In addition, please > be aware that any message addressed to our domain is subject to archiving > and review by persons other than the intended recipient. Thank you. > -- "...:::Aniket:::... Quetzalco@tl"