You can also take a look at--
https://issues.apache.org/jira/browse/HIVE-74

On Wed, Dec 7, 2011 at 9:05 PM, Savant, Keshav <
keshav.c.sav...@fisglobal.com> wrote:

> You are right Wojciech Langiewicz, we did the same thing and posted my
> result yesterday. Now we are planning to do this using a shell script
> because of dynamicity of our environment where file keep on coming. We
> will schedule the shell script using cron job.
>
> A query on this, we are planning to merge files based on either of the
> following approach
> 1. Based on file count: If file count goes to X number of files, then
> merge and insert in HDFS.
> 2. Based on merged file size: If merged file size crosses beyond X
> number of bytes, then insert into HDFS.
>
> I think option 2 is better because in that way we can say that all
> merged files will be almost of same bytes. What do you suggest?
>
> Kind Regards,
> Keshav C Savant
>
>
> -----Original Message-----
> From: Wojciech Langiewicz [mailto:wlangiew...@gmail.com]
> Sent: Wednesday, December 07, 2011 8:15 PM
> To: user@hive.apache.org
> Subject: Re: Hive query taking too much time
>
> Hi,
> In this case it's much easier and faster to merge all files using this
> command:
>
> cat *.csv > output.csv
> hive -e "load data local inpath 'output.csv' into table $table"
>
> On 07.12.2011 07:00, Vikas Srivastava wrote:
> > hey if u having the same col of  all the files then you can easily
> > merge by shell script
> >
> > list=`*.csv`
> > $table=yourtable
> > for file in $list
> > do
> > cat $file>>new_file.csv
> > done
> > hive -e "load data local inpath '$file' into table $table"
> >
> > it will merge all the files in single file then you can upload it in
> > the same query
> >
> > On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
> > <success.mohit.gu...@gmail.com>wrote:
> >
> >> Hi Paul,
> >> I am having the same problem. Do you know any efficient way of
> >> merging the files?
> >>
> >> -Mohit
> >>
> >>
> >> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<pmack...@adobe.com>
> wrote:
> >>
> >>> How much time is it spending in the map/reduce phases, respectively?
>
> >>> The large number of files could be creating a lot of mappers which
> >>> create a lot of overhead. What happens if you merge the 2624 files
> >>> into a smaller number like 24 or 48. That should speed up the mapper
>
> >>> phase significantly.****
> >>>
> >>> ** **
> >>>
> >>> *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
> >>> *Sent:* Tuesday, December 06, 2011 6:01 AM
> >>> *To:* user@hive.apache.org
> >>> *Subject:* Hive query taking too much time****
> >>>
> >>> ** **
> >>>
> >>> Hi All,****
> >>>
> >>> ** **
> >>>
> >>> My setup is ****
> >>>
> >>> hadoop-0.20.203.0****
> >>>
> >>> hive-0.7.1****
> >>>
> >>> ** **
> >>>
> >>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it
> >>> is also acting as secondary name node). On namenode I have setup
> >>> hive with HiveDerbyServerMode to support multiple hive server
> >>> connection.****
> >>>
> >>> ** **
> >>>
> >>> I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
> >>> query statements, total number of files is 2624 an their combined
> >>> size is only
> >>> 713 MB, which is very less from Hadoop perspective that can handle
> >>> TBs of data very easily.****
> >>>
> >>> ** **
> >>>
> >>> The problem is, when I run a simple count query (i.e. *select
> >>> count(*) from a_table*), it takes too much time in executing the
> >>> query.****
> >>>
> >>> ** **
> >>>
> >>> For instance it takes almost 17 minutes to execute the said query if
>
> >>> the table has 950,000 rows, I understand that time is too much for
> >>> executing a query with only such small data. ****
> >>>
> >>> This is only a dev environment and in production environment the
> >>> number of files and their combined size will move into millions and
> >>> GBs
> >>> respectively.****
> >>>
> >>> ** **
> >>>
> >>> On analyzing the logs on all the datanodes and namenode/secondary
> >>> namenode I do not find any error in them.****
> >>>
> >>> ** **
> >>>
> >>> I have tried setting mapred.reduce.tasks to a fixed number also, but
>
> >>> number of reduce always remains 1 while number of maps is determined
>
> >>> by hive only.****
> >>>
> >>> ** **
> >>>
> >>> Any suggestion what I am doing wrong, or how can I improve the
> >>> performance of hive queries? Any suggestion or pointer is highly
> >>> appreciated. ****
> >>>
> >>> ** **
> >>>
> >>> Keshav****
> >>>
> >>> _____________
> >>> The information contained in this message is proprietary and/or
> >>> confidential. If you are not the intended recipient, please: (i)
> >>> delete the message and all copies; (ii) do not disclose, distribute
> >>> or use the message in any manner; and (iii) notify the sender
> >>> immediately. In addition, please be aware that any message addressed
>
> >>> to our domain is subject to archiving and review by persons other
> >>> than the intended recipient. Thank you.****
> >>>
> >>
> >>
> >>
> >> --
> >> Best Regards,
> >>
> >> Mohit Gupta
> >> Software Engineer at Vdopia Inc.
> >>
> >>
> >>
> >
> >
>
> _____________
> The information contained in this message is proprietary and/or
> confidential. If you are not the intended recipient, please: (i) delete the
> message and all copies; (ii) do not disclose, distribute or use the message
> in any manner; and (iii) notify the sender immediately. In addition, please
> be aware that any message addressed to our domain is subject to archiving
> and review by persons other than the intended recipient. Thank you.
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Reply via email to