Re: Hive query taking too much time

2011-12-08 Thread Aniket Mokashi
You can also take a look at--
https://issues.apache.org/jira/browse/HIVE-74

On Wed, Dec 7, 2011 at 9:05 PM, Savant, Keshav 
keshav.c.sav...@fisglobal.com wrote:

 You are right Wojciech Langiewicz, we did the same thing and posted my
 result yesterday. Now we are planning to do this using a shell script
 because of dynamicity of our environment where file keep on coming. We
 will schedule the shell script using cron job.

 A query on this, we are planning to merge files based on either of the
 following approach
 1. Based on file count: If file count goes to X number of files, then
 merge and insert in HDFS.
 2. Based on merged file size: If merged file size crosses beyond X
 number of bytes, then insert into HDFS.

 I think option 2 is better because in that way we can say that all
 merged files will be almost of same bytes. What do you suggest?

 Kind Regards,
 Keshav C Savant


 -Original Message-
 From: Wojciech Langiewicz [mailto:wlangiew...@gmail.com]
 Sent: Wednesday, December 07, 2011 8:15 PM
 To: user@hive.apache.org
 Subject: Re: Hive query taking too much time

 Hi,
 In this case it's much easier and faster to merge all files using this
 command:

 cat *.csv  output.csv
 hive -e load data local inpath 'output.csv' into table $table

 On 07.12.2011 07:00, Vikas Srivastava wrote:
  hey if u having the same col of  all the files then you can easily
  merge by shell script
 
  list=`*.csv`
  $table=yourtable
  for file in $list
  do
  cat $filenew_file.csv
  done
  hive -e load data local inpath '$file' into table $table
 
  it will merge all the files in single file then you can upload it in
  the same query
 
  On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
  success.mohit.gu...@gmail.comwrote:
 
  Hi Paul,
  I am having the same problem. Do you know any efficient way of
  merging the files?
 
  -Mohit
 
 
  On Tue, Dec 6, 2011 at 8:14 PM, Paul Macklespmack...@adobe.com
 wrote:
 
  How much time is it spending in the map/reduce phases, respectively?

  The large number of files could be creating a lot of mappers which
  create a lot of overhead. What happens if you merge the 2624 files
  into a smaller number like 24 or 48. That should speed up the mapper

  phase significantly.
 
  ** **
 
  *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
  *Sent:* Tuesday, December 06, 2011 6:01 AM
  *To:* user@hive.apache.org
  *Subject:* Hive query taking too much time
 
  ** **
 
  Hi All,
 
  ** **
 
  My setup is 
 
  hadoop-0.20.203.0
 
  hive-0.7.1
 
  ** **
 
  I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it
  is also acting as secondary name node). On namenode I have setup
  hive with HiveDerbyServerMode to support multiple hive server
  connection.
 
  ** **
 
  I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
  query statements, total number of files is 2624 an their combined
  size is only
  713 MB, which is very less from Hadoop perspective that can handle
  TBs of data very easily.
 
  ** **
 
  The problem is, when I run a simple count query (i.e. *select
  count(*) from a_table*), it takes too much time in executing the
  query.
 
  ** **
 
  For instance it takes almost 17 minutes to execute the said query if

  the table has 950,000 rows, I understand that time is too much for
  executing a query with only such small data. 
 
  This is only a dev environment and in production environment the
  number of files and their combined size will move into millions and
  GBs
  respectively.
 
  ** **
 
  On analyzing the logs on all the datanodes and namenode/secondary
  namenode I do not find any error in them.
 
  ** **
 
  I have tried setting mapred.reduce.tasks to a fixed number also, but

  number of reduce always remains 1 while number of maps is determined

  by hive only.
 
  ** **
 
  Any suggestion what I am doing wrong, or how can I improve the
  performance of hive queries? Any suggestion or pointer is highly
  appreciated. 
 
  ** **
 
  Keshav
 
  _
  The information contained in this message is proprietary and/or
  confidential. If you are not the intended recipient, please: (i)
  delete the message and all copies; (ii) do not disclose, distribute
  or use the message in any manner; and (iii) notify the sender
  immediately. In addition, please be aware that any message addressed

  to our domain is subject to archiving and review by persons other
  than the intended recipient. Thank you.
 
 
 
 
  --
  Best Regards,
 
  Mohit Gupta
  Software Engineer at Vdopia Inc.
 
 
 
 
 

 _
 The information contained in this message is proprietary and/or
 confidential. If you are not the intended recipient, please: (i) delete the
 message and all copies; (ii) do not disclose, distribute or use the message
 in any manner; and (iii) notify the sender immediately. In addition, please
 be aware that any message addressed to our domain is subject

RE: Hive query taking too much time

2011-12-07 Thread Savant, Keshav
Hi Wojciech Langiewicz/Paul Mackles,

 

I tried your suggestion and it worked, now the performance has increased
many folds, here are the results from my testing after implementing your
suggestion

 

Number of Files on HDFS

File Size

Select count(*) time taken in seconds

Select count(*) result

1 (created from 2624 CSVs )

708.8 MB

66.258

3,567,922

3 (each created from 2624 CSVs )

708.8 MB * 3

119.92

10,703,766

3 (each created from 2624 CSVs ) +
14 (each created from almost 200 CSVs)

708.8 MB *3 +
Combined size of 14 files (ranging 48 Mb to 68 MB) is : 708.8 MB 

153.306

14,271,688

 

Thanks a lot for your help.

 

Kind Regards,

Keshav C Savant

 

From: Paul Mackles [mailto:pmack...@adobe.com] 
Sent: Tuesday, December 06, 2011 8:14 PM
To: user@hive.apache.org
Subject: RE: Hive query taking too much time

 

How much time is it spending in the map/reduce phases, respectively? The
large number of files could be creating a lot of mappers which create a
lot of overhead. What happens if you merge the 2624 files into a smaller
number like 24 or 48. That should speed up the mapper phase
significantly.

 

From: Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] 
Sent: Tuesday, December 06, 2011 6:01 AM
To: user@hive.apache.org
Subject: Hive query taking too much time

 

Hi All,

 

My setup is 

hadoop-0.20.203.0

hive-0.7.1

 

I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.

 

I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.

 

The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.

 

For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data. 

This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.

 

On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.

 

I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.

 

Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated. 

 

Keshav

_
The information contained in this message is proprietary and/or
confidential. If you are not the intended recipient, please: (i) delete
the message and all copies; (ii) do not disclose, distribute or use the
message in any manner; and (iii) notify the sender immediately. In
addition, please be aware that any message addressed to our domain is
subject to archiving and review by persons other than the intended
recipient. Thank you.

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


Re: Hive query taking too much time

2011-12-07 Thread Wojciech Langiewicz

Hi,
In this case it's much easier and faster to merge all files using this 
command:


cat *.csv  output.csv
hive -e load data local inpath 'output.csv' into table $table

On 07.12.2011 07:00, Vikas Srivastava wrote:

hey if u having the same col of  all the files then you can easily merge by
shell script

list=`*.csv`
$table=yourtable
for file in $list
do
cat $filenew_file.csv
done
hive -e load data local inpath '$file' into table $table

it will merge all the files in single file then you can upload it in the
same query

On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
success.mohit.gu...@gmail.comwrote:


Hi Paul,
I am having the same problem. Do you know any efficient way of merging the
files?

-Mohit


On Tue, Dec 6, 2011 at 8:14 PM, Paul Macklespmack...@adobe.com  wrote:


How much time is it spending in the map/reduce phases, respectively? The
large number of files could be creating a lot of mappers which create a lot
of overhead. What happens if you merge the 2624 files into a smaller number
like 24 or 48. That should speed up the mapper phase significantly.

** **

*From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
*Sent:* Tuesday, December 06, 2011 6:01 AM
*To:* user@hive.apache.org
*Subject:* Hive query taking too much time

** **

Hi All,

** **

My setup is 

hadoop-0.20.203.0

hive-0.7.1

** **

I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.

** **

I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query
statements, total number of files is 2624 an their combined size is only
713 MB, which is very less from Hadoop perspective that can handle TBs of
data very easily.

** **

The problem is, when I run a simple count query (i.e. *select count(*)
from a_table*), it takes too much time in executing the query.

** **

For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing a
query with only such small data. 

This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.

** **

On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.

** **

I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.

** **

Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated. 

** **

Keshav

_
The information contained in this message is proprietary and/or
confidential. If you are not the intended recipient, please: (i) delete the
message and all copies; (ii) do not disclose, distribute or use the message
in any manner; and (iii) notify the sender immediately. In addition, please
be aware that any message addressed to our domain is subject to archiving
and review by persons other than the intended recipient. Thank you.





--
Best Regards,

Mohit Gupta
Software Engineer at Vdopia Inc.










RE: Hive query taking too much time

2011-12-07 Thread Savant, Keshav
You are right Wojciech Langiewicz, we did the same thing and posted my
result yesterday. Now we are planning to do this using a shell script
because of dynamicity of our environment where file keep on coming. We
will schedule the shell script using cron job.

A query on this, we are planning to merge files based on either of the
following approach
1. Based on file count: If file count goes to X number of files, then
merge and insert in HDFS.
2. Based on merged file size: If merged file size crosses beyond X
number of bytes, then insert into HDFS.

I think option 2 is better because in that way we can say that all
merged files will be almost of same bytes. What do you suggest?

Kind Regards,
Keshav C Savant


-Original Message-
From: Wojciech Langiewicz [mailto:wlangiew...@gmail.com] 
Sent: Wednesday, December 07, 2011 8:15 PM
To: user@hive.apache.org
Subject: Re: Hive query taking too much time

Hi,
In this case it's much easier and faster to merge all files using this
command:

cat *.csv  output.csv
hive -e load data local inpath 'output.csv' into table $table

On 07.12.2011 07:00, Vikas Srivastava wrote:
 hey if u having the same col of  all the files then you can easily 
 merge by shell script

 list=`*.csv`
 $table=yourtable
 for file in $list
 do
 cat $filenew_file.csv
 done
 hive -e load data local inpath '$file' into table $table

 it will merge all the files in single file then you can upload it in 
 the same query

 On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
 success.mohit.gu...@gmail.comwrote:

 Hi Paul,
 I am having the same problem. Do you know any efficient way of 
 merging the files?

 -Mohit


 On Tue, Dec 6, 2011 at 8:14 PM, Paul Macklespmack...@adobe.com
wrote:

 How much time is it spending in the map/reduce phases, respectively?

 The large number of files could be creating a lot of mappers which 
 create a lot of overhead. What happens if you merge the 2624 files 
 into a smaller number like 24 or 48. That should speed up the mapper

 phase significantly.

 ** **

 *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
 *Sent:* Tuesday, December 06, 2011 6:01 AM
 *To:* user@hive.apache.org
 *Subject:* Hive query taking too much time

 ** **

 Hi All,

 ** **

 My setup is 

 hadoop-0.20.203.0

 hive-0.7.1

 ** **

 I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it 
 is also acting as secondary name node). On namenode I have setup 
 hive with HiveDerbyServerMode to support multiple hive server 
 connection.

 ** **

 I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive 
 query statements, total number of files is 2624 an their combined 
 size is only
 713 MB, which is very less from Hadoop perspective that can handle 
 TBs of data very easily.

 ** **

 The problem is, when I run a simple count query (i.e. *select 
 count(*) from a_table*), it takes too much time in executing the 
 query.

 ** **

 For instance it takes almost 17 minutes to execute the said query if

 the table has 950,000 rows, I understand that time is too much for 
 executing a query with only such small data. 

 This is only a dev environment and in production environment the 
 number of files and their combined size will move into millions and 
 GBs
 respectively.

 ** **

 On analyzing the logs on all the datanodes and namenode/secondary 
 namenode I do not find any error in them.

 ** **

 I have tried setting mapred.reduce.tasks to a fixed number also, but

 number of reduce always remains 1 while number of maps is determined

 by hive only.

 ** **

 Any suggestion what I am doing wrong, or how can I improve the 
 performance of hive queries? Any suggestion or pointer is highly 
 appreciated. 

 ** **

 Keshav

 _
 The information contained in this message is proprietary and/or 
 confidential. If you are not the intended recipient, please: (i) 
 delete the message and all copies; (ii) do not disclose, distribute 
 or use the message in any manner; and (iii) notify the sender 
 immediately. In addition, please be aware that any message addressed

 to our domain is subject to archiving and review by persons other 
 than the intended recipient. Thank you.




 --
 Best Regards,

 Mohit Gupta
 Software Engineer at Vdopia Inc.






_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


Re: Hive query taking too much time

2011-12-06 Thread Wojciech Langiewicz

Hi,
In your case total file size isn't main factor that reduces performance, 
number of files is.


To test this try merging those over 2000 files into one (or few) big, 
then upload it to HDFS and test hive performance (it should be 
definitely higher). It this works you should think about merging those 
files before or after loading them to HDFS.


Second issue is counts, try to observe how your jobs uses mappers and 
reducers, my experience is that simple count() jobs might be stuck on 
one reducer (the one that does all counting) for longer time. I have not 
resolved this issue, but it was not significant in my case.
set mapred.reduce.tasks=xyz doesn't change that behavior, but for 
example using GROUP with COUNT works much faster.


I hope this helps.
--
Wojciech Langiewicz

On 06.12.2011 12:00, Savant, Keshav wrote:

Hi All,



My setup is

hadoop-0.20.203.0

hive-0.7.1



I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.



I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.



The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.



For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data.

This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.



On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.



I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.



Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated.



Keshav





Re: Hive query taking too much time

2011-12-06 Thread Mohit Gupta
Hi Paul,
I am having the same problem. Do you know any efficient way of merging the
files?

-Mohit

On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles pmack...@adobe.com wrote:

 How much time is it spending in the map/reduce phases, respectively? The
 large number of files could be creating a lot of mappers which create a lot
 of overhead. What happens if you merge the 2624 files into a smaller number
 like 24 or 48. That should speed up the mapper phase significantly.

 ** **

 *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
 *Sent:* Tuesday, December 06, 2011 6:01 AM
 *To:* user@hive.apache.org
 *Subject:* Hive query taking too much time

 ** **

 Hi All,

 ** **

 My setup is 

 hadoop-0.20.203.0

 hive-0.7.1

 ** **

 I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
 also acting as secondary name node). On namenode I have setup hive with
 HiveDerbyServerMode to support multiple hive server connection.

 ** **

 I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query
 statements, total number of files is 2624 an their combined size is only
 713 MB, which is very less from Hadoop perspective that can handle TBs of
 data very easily.

 ** **

 The problem is, when I run a simple count query (i.e. *select count(*)
 from a_table*), it takes too much time in executing the query.

 ** **

 For instance it takes almost 17 minutes to execute the said query if the
 table has 950,000 rows, I understand that time is too much for executing a
 query with only such small data. 

 This is only a dev environment and in production environment the number of
 files and their combined size will move into millions and GBs respectively.
 

 ** **

 On analyzing the logs on all the datanodes and namenode/secondary namenode
 I do not find any error in them.

 ** **

 I have tried setting mapred.reduce.tasks to a fixed number also, but
 number of reduce always remains 1 while number of maps is determined by
 hive only.

 ** **

 Any suggestion what I am doing wrong, or how can I improve the performance
 of hive queries? Any suggestion or pointer is highly appreciated. 

 ** **

 Keshav

 _
 The information contained in this message is proprietary and/or
 confidential. If you are not the intended recipient, please: (i) delete the
 message and all copies; (ii) do not disclose, distribute or use the message
 in any manner; and (iii) notify the sender immediately. In addition, please
 be aware that any message addressed to our domain is subject to archiving
 and review by persons other than the intended recipient. Thank you.




-- 
Best Regards,

Mohit Gupta
Software Engineer at Vdopia Inc.


Re: Hive query taking too much time

2011-12-06 Thread Ayon Sinha
How about a simple Pig script with a load and a store statement? Set the max # 
reducers to say 20 or 30, that way you will only have 20-30 files as output. 
Then put these files in the Hive dir. Make sure to match the delimiters in Hive 
 Pig.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.




 From: Vikas Srivastava vikas.srivast...@one97.net
To: user@hive.apache.org 
Sent: Tuesday, December 6, 2011 10:00 PM
Subject: Re: Hive query taking too much time
 

hey if u having the same col of  all the files then you can easily merge by 
shell script

list=`*.csv`
$table=yourtable
for file in $list
do
cat $file new_file.csv
done
hive -e load data local inpath '$file' into table $table

it will merge all the files in single file then you can upload it in the same 
query


On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta success.mohit.gu...@gmail.com 
wrote:

Hi Paul,
I am having the same problem. Do you know any efficient way of merging the 
files?


-Mohit



On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles pmack...@adobe.com wrote:

How much time is it spending in the map/reduce phases, respectively? The large 
number of files could be creating a lot of mappers which create a lot of 
overhead. What happens if you merge the 2624 files into a smaller number like 
24 or 48. That should speed up the mapper phase significantly.
 
From:Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] 
Sent: Tuesday, December 06, 2011 6:01 AM
To: user@hive.apache.org
Subject: Hive query taking too much time
 
Hi All,
 
My setup is 
hadoop-0.20.203.0
hive-0.7.1
 
I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is also 
acting as secondary name node). On namenode I have setup hive with 
HiveDerbyServerMode to support multiple hive server connection.
 
I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query 
statements, total number of files is 2624 an their combined size is only 713 
MB, which is very less from Hadoop perspective that can handle TBs of data 
very easily.
 
The problem is, when I run a simple count query (i.e. select count(*) from 
a_table), it takes too much time in executing the query.
 
For instance it takes almost 17 minutes to execute the said query if the 
table has 950,000 rows, I understand that time is too much for executing a 
query with only such small data. 
This is only a dev environment and in production environment the number of 
files and their combined size will move into millions and GBs respectively.
 
On analyzing the logs on all the datanodes and namenode/secondary namenode I 
do not find any error in them.
 
I have tried setting mapred.reduce.tasks to a fixed number also, but number 
of reduce always remains 1 while number of maps is determined by hive only.
 
Any suggestion what I am doing wrong, or how can I improve the performance of 
hive queries? Any suggestion or pointer is highly appreciated. 
 
Keshav
_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; 
and (iii) notify the sender immediately. In addition, please be aware that 
any message addressed to our domain is subject to archiving and review by 
persons other than the intended recipient. Thank you.



-- 
Best Regards,

Mohit Gupta
Software Engineer at Vdopia Inc.





-- 
With Regards
Vikas Srivastava

DWH  Analytics Team
Mob:+91 9560885900
One97 | Let's get talking !