Re: Getting Slow Query Performance!

2013-03-12 Thread bejoy_ks
Hi

Since you are on a pseudo distributed/ single node environment the hadoop 
mapreduce parallelism is limited.

You might be having just a few map slots and map tasks might be in queue 
waiting for others to complete. In a larger cluster your job should be faster.

As a side note, Certain SQL queries that ulilize indexing would be faster in 
sql server than in hive.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Gobinda Paul 
Date: Tue, 12 Mar 2013 15:09:31 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Getting Slow Query Performance!






i use sqoop to import 30GB data ( two table employee(aprox 21 GB)  and 
salary(aprox 9GB ) into hadoop(Single Node) via hive.
i run a sample query like SELECT 
EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY 
WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90;
In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to 
execute that query .
CPU: Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHzRAM:  2GBHDD: 500GB

Here IS My hive-site.xml conf.


  javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName 
   com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserNameroot  
  javax.jdo.option.ConnectionPassword
123456
hive.hwi.listen.host 0.0.0.0 
This is the host address the Hive Web Interface will listen 
onhive.hwi.listen.port  
  This is the port the Hive Web Interface 
will listen on  
hive.hwi.war.file/lib/hive-hwi-0.9.0.war
This is the WAR file with the jsp content for Hive Web 
Interface   
mapred.reduce.tasks-1 
The default number of reduce tasks per job.  Typically set to a 
prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive 
uses -1 as its default value.  By setting this property to -1, Hive will 
automatically figure out what should be the number of reducers.   
   
hive.exec.reducers.bytes.per.reducer 
10 size per reducer.The default is 1G, 
i.e if the input size is 10G, it will use 10 reducers.   


  hive.exec.reducers.max999   
 max number of reducers will be used. If the one   
specified in the configuration parameter mapred.reduce.tasks is 
negative, hive will use this one as the max number of reducers when 
automatically determine number of reducers.

  hive.exec.scratchdir
/tmp/hive-${user.name}Scratch space for Hive 
jobs  
hive.metastore.local true   



Any IDEA ?? 
  


Re: Getting Slow Query Performance!

2013-03-12 Thread bejoy_ks
Hi

Since you are on a pseudo distributed/ single node environment the hadoop 
mapreduce parallelism is limited.

You might be having just a few map slots and map tasks might be in queue 
waiting for others to complete. In a larger cluster your job should be faster.

Certain SQL queries that ulilize indexing would be faster in sql server than in 
hive.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Gobinda Paul 
Date: Tue, 12 Mar 2013 15:09:31 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Getting Slow Query Performance!






i use sqoop to import 30GB data ( two table employee(aprox 21 GB)  and 
salary(aprox 9GB ) into hadoop(Single Node) via hive.
i run a sample query like SELECT 
EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY 
WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90;
In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to 
execute that query .
CPU: Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHzRAM:  2GBHDD: 500GB

Here IS My hive-site.xml conf.


  javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName 
   com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserNameroot  
  javax.jdo.option.ConnectionPassword
123456
hive.hwi.listen.host 0.0.0.0 
This is the host address the Hive Web Interface will listen 
onhive.hwi.listen.port  
  This is the port the Hive Web Interface 
will listen on  
hive.hwi.war.file/lib/hive-hwi-0.9.0.war
This is the WAR file with the jsp content for Hive Web 
Interface   
mapred.reduce.tasks-1 
The default number of reduce tasks per job.  Typically set to a 
prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive 
uses -1 as its default value.  By setting this property to -1, Hive will 
automatically figure out what should be the number of reducers.   
   
hive.exec.reducers.bytes.per.reducer 
10 size per reducer.The default is 1G, 
i.e if the input size is 10G, it will use 10 reducers.   


  hive.exec.reducers.max999   
 max number of reducers will be used. If the one   
specified in the configuration parameter mapred.reduce.tasks is 
negative, hive will use this one as the max number of reducers when 
automatically determine number of reducers.

  hive.exec.scratchdir
/tmp/hive-${user.name}Scratch space for Hive 
jobs  
hive.metastore.local true   



Any IDEA ?? 
  


RE: Getting Slow Query Performance!

2013-03-12 Thread Bennie Schut
Well it's probably worth  to know 30G is really hitting rock bottom when you 
talk about big data. Hadoop is linearly scalable so probably going to 3 or 4 
similar machines could get you below the mysql time but it's hardly a fair 
comparison.
Setting it up I would suggest reading the hadoop docs: 
http://hadoop.apache.org/docs/current/
These hardware specs give you an idea why it's an unusual case: 
http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/

To give you some hints. Each node needs to be configure on how much resources 
it's allowed to take. This is a balance between several parameters:
mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum, 
mapred.child.java.opts
There are tons more configurations but this is where you start. Different 
hardware and different jobs require different configurations so try it out.
Since you are extremely tight on ram you probably want to reduce memory usage 
on most processes like the namenode/jobtracker/hive and on each node drop the 
memory requirements for tasktracker/datanode.
Also don't put your nodes on 100MB links they are almost always to slow.

Bennie.

From: Gobinda Paul [mailto:gobi...@live.com]
Sent: Tuesday, March 12, 2013 11:01 AM
To: user@hive.apache.org
Subject: RE: Getting Slow Query Performance!


Thnx for your reply , i am new to hadoop and hive .My goal is to process a big 
data using hadoop,
this is my university project ( Data Mining ) ,need to show that hadoop is 
better than mysql in case
of Big data(30-100GB+) Processing,i know hadoop does that.To do so,can you 
please suggest me,
how many node is required to show the performance  and what type of 
configuration is required for each node.


From: bsc...@ebuddy.com<mailto:bsc...@ebuddy.com>
To: user@hive.apache.org<mailto:user@hive.apache.org>
CC: gobi...@live.com<mailto:gobi...@live.com>
Date: Tue, 12 Mar 2013 10:40:33 +0100
Subject: RE: Getting Slow Query Performance!
Generally a single hadoop machine will perform worse then a single mysql 
machine. People normally use hadoop when they have so much data it won't really 
fit on a single machine and it would require specialized hardware (Stuff like 
SAN's) to run.
30GB of data really isn't that much and 2GB of ram is really not what hadoop is 
designed to work on. It really likes to have lots of memory.
I also don't see the hadoop configuration files so perhaps you only have 1 
mapper and 1 reducer. But this is not a typical use-case so I doubt you'll see 
snappy performance after tweaking the configs.




RE: Getting Slow Query Performance!

2013-03-12 Thread Gobinda Paul

Thnx for your reply , i am new to hadoop and hive .My goal is to process a big 
data using hadoop,this is my university project ( Data Mining ) ,need to show 
that hadoop is better than mysql in case of Big data(30-100GB+) Processing,i 
know hadoop does that.To do so,can you please suggest me,how many node is 
required to show the performance  and what type of configuration is required 
for each node.


From: bsc...@ebuddy.com
To: user@hive.apache.org
CC: gobi...@live.com
Date: Tue, 12 Mar 2013 10:40:33 +0100
Subject: RE: Getting Slow Query Performance!

Generally a single hadoop machine will perform worse then a single mysql 
machine. People normally use hadoop when they have so much data it won't really 
fit on a single machine and it would require specialized hardware (Stuff like 
SAN's) to run.30GB of data really isn't that much and 2GB of ram is really not 
what hadoop is designed to work on. It really likes to have lots of memory.I 
also don't see the hadoop configuration files so perhaps you only have 1 mapper 
and 1 reducer. But this is not a typical use-case so I doubt you'll see snappy 
performance after tweaking the configs. 
  

RE: Getting Slow Query Performance!

2013-03-12 Thread Bennie Schut
Generally a single hadoop machine will perform worse then a single mysql 
machine. People normally use hadoop when they have so much data it won't really 
fit on a single machine and it would require specialized hardware (Stuff like 
SAN's) to run.
30GB of data really isn't that much and 2GB of ram is really not what hadoop is 
designed to work on. It really likes to have lots of memory.
I also don't see the hadoop configuration files so perhaps you only have 1 
mapper and 1 reducer. But this is not a typical use-case so I doubt you'll see 
snappy performance after tweaking the configs.

From: Gobinda Paul [mailto:gobi...@live.com]
Sent: Tuesday, March 12, 2013 10:10 AM
To: user@hive.apache.org
Subject: Getting Slow Query Performance!



i use sqoop to import 30GB data ( two table employee(aprox 21 GB)  and 
salary(aprox 9GB ) into hadoop(Single Node) via hive.

i run a sample query like SELECT 
EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY 
WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90;

In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to 
execute that query .

CPU: Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHz
RAM:  2GB
HDD: 500GB


Here IS My hive-site.xml conf.






  
javax.jdo.option.ConnectionURL

jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true
  
  
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
  
  
javax.jdo.option.ConnectionUserName
root
  
  
javax.jdo.option.ConnectionPassword
123456
  
  
hive.hwi.listen.host
 0.0.0.0
 This is the host address the Hive Web Interface will listen 
on
  
  
hive.hwi.listen.port

This is the port the Hive Web Interface will listen 
on
   
   
hive.hwi.war.file
/lib/hive-hwi-0.9.0.war
This is the WAR file with the jsp content for Hive Web 
Interface
   

  
  mapred.reduce.tasks
-1
The default number of reduce tasks per job.  Typically 
set
to a prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local". Hadoop set this to 1 by default, 
whereas hive uses -1 as its default value.
By setting this property to -1, Hive will automatically figure out 
what should be the number of reducers.

   

   
 hive.exec.reducers.bytes.per.reducer
 10
 size per reducer.The default is 1G, i.e if the input size is 
10G, it will use 10 reducers.
   


  
hive.exec.reducers.max
999
max number of reducers will be used. If the one
specified in the configuration parameter mapred.reduce.tasks is
negative, hive will use this one as the max number of reducers when
automatically determine number of reducers.

   

  
hive.exec.scratchdir
/tmp/hive-${user.name}
Scratch space for Hive jobs
  

   
 hive.metastore.local
 true
   




Any IDEA ??


Getting Slow Query Performance!

2013-03-12 Thread Gobinda Paul





i use sqoop to import 30GB data ( two table employee(aprox 21 GB)  and 
salary(aprox 9GB ) into hadoop(Single Node) via hive.
i run a sample query like SELECT 
EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY 
WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90;
In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to 
execute that query .
CPU: Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHzRAM:  2GBHDD: 500GB

Here IS My hive-site.xml conf.


  javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName 
   com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserNameroot  
  javax.jdo.option.ConnectionPassword
123456
hive.hwi.listen.host 0.0.0.0 
This is the host address the Hive Web Interface will listen 
onhive.hwi.listen.port  
  This is the port the Hive Web Interface 
will listen on  
hive.hwi.war.file/lib/hive-hwi-0.9.0.war
This is the WAR file with the jsp content for Hive Web 
Interface   
mapred.reduce.tasks-1 
The default number of reduce tasks per job.  Typically set to a 
prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive 
uses -1 as its default value.  By setting this property to -1, Hive will 
automatically figure out what should be the number of reducers.   
   
hive.exec.reducers.bytes.per.reducer 
10 size per reducer.The default is 1G, 
i.e if the input size is 10G, it will use 10 reducers.   


  hive.exec.reducers.max999   
 max number of reducers will be used. If the one   
specified in the configuration parameter mapred.reduce.tasks is 
negative, hive will use this one as the max number of reducers when 
automatically determine number of reducers.

  hive.exec.scratchdir
/tmp/hive-${user.name}Scratch space for Hive 
jobs  
hive.metastore.local true   



Any IDEA ??