Re: Getting Slow Query Performance!
Hi Since you are on a pseudo distributed/ single node environment the hadoop mapreduce parallelism is limited. You might be having just a few map slots and map tasks might be in queue waiting for others to complete. In a larger cluster your job should be faster. As a side note, Certain SQL queries that ulilize indexing would be faster in sql server than in hive. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Gobinda Paul Date: Tue, 12 Mar 2013 15:09:31 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Getting Slow Query Performance! i use sqoop to import 30GB data ( two table employee(aprox 21 GB) and salary(aprox 9GB ) into hadoop(Single Node) via hive. i run a sample query like SELECT EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90; In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to execute that query . CPU: Pentium(R) Dual-Core CPU E5700 @ 3.00GHzRAM: 2GBHDD: 500GB Here IS My hive-site.xml conf. javax.jdo.option.ConnectionURL jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserNameroot javax.jdo.option.ConnectionPassword 123456 hive.hwi.listen.host 0.0.0.0 This is the host address the Hive Web Interface will listen onhive.hwi.listen.port This is the port the Hive Web Interface will listen on hive.hwi.war.file/lib/hive-hwi-0.9.0.war This is the WAR file with the jsp content for Hive Web Interface mapred.reduce.tasks-1 The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers. hive.exec.reducers.bytes.per.reducer 10 size per reducer.The default is 1G, i.e if the input size is 10G, it will use 10 reducers. hive.exec.reducers.max999 max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers. hive.exec.scratchdir /tmp/hive-${user.name}Scratch space for Hive jobs hive.metastore.local true Any IDEA ??
Re: Getting Slow Query Performance!
Hi Since you are on a pseudo distributed/ single node environment the hadoop mapreduce parallelism is limited. You might be having just a few map slots and map tasks might be in queue waiting for others to complete. In a larger cluster your job should be faster. Certain SQL queries that ulilize indexing would be faster in sql server than in hive. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Gobinda Paul Date: Tue, 12 Mar 2013 15:09:31 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Getting Slow Query Performance! i use sqoop to import 30GB data ( two table employee(aprox 21 GB) and salary(aprox 9GB ) into hadoop(Single Node) via hive. i run a sample query like SELECT EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90; In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to execute that query . CPU: Pentium(R) Dual-Core CPU E5700 @ 3.00GHzRAM: 2GBHDD: 500GB Here IS My hive-site.xml conf. javax.jdo.option.ConnectionURL jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserNameroot javax.jdo.option.ConnectionPassword 123456 hive.hwi.listen.host 0.0.0.0 This is the host address the Hive Web Interface will listen onhive.hwi.listen.port This is the port the Hive Web Interface will listen on hive.hwi.war.file/lib/hive-hwi-0.9.0.war This is the WAR file with the jsp content for Hive Web Interface mapred.reduce.tasks-1 The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers. hive.exec.reducers.bytes.per.reducer 10 size per reducer.The default is 1G, i.e if the input size is 10G, it will use 10 reducers. hive.exec.reducers.max999 max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers. hive.exec.scratchdir /tmp/hive-${user.name}Scratch space for Hive jobs hive.metastore.local true Any IDEA ??
RE: Getting Slow Query Performance!
Well it's probably worth to know 30G is really hitting rock bottom when you talk about big data. Hadoop is linearly scalable so probably going to 3 or 4 similar machines could get you below the mysql time but it's hardly a fair comparison. Setting it up I would suggest reading the hadoop docs: http://hadoop.apache.org/docs/current/ These hardware specs give you an idea why it's an unusual case: http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/ To give you some hints. Each node needs to be configure on how much resources it's allowed to take. This is a balance between several parameters: mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum, mapred.child.java.opts There are tons more configurations but this is where you start. Different hardware and different jobs require different configurations so try it out. Since you are extremely tight on ram you probably want to reduce memory usage on most processes like the namenode/jobtracker/hive and on each node drop the memory requirements for tasktracker/datanode. Also don't put your nodes on 100MB links they are almost always to slow. Bennie. From: Gobinda Paul [mailto:gobi...@live.com] Sent: Tuesday, March 12, 2013 11:01 AM To: user@hive.apache.org Subject: RE: Getting Slow Query Performance! Thnx for your reply , i am new to hadoop and hive .My goal is to process a big data using hadoop, this is my university project ( Data Mining ) ,need to show that hadoop is better than mysql in case of Big data(30-100GB+) Processing,i know hadoop does that.To do so,can you please suggest me, how many node is required to show the performance and what type of configuration is required for each node. From: bsc...@ebuddy.com<mailto:bsc...@ebuddy.com> To: user@hive.apache.org<mailto:user@hive.apache.org> CC: gobi...@live.com<mailto:gobi...@live.com> Date: Tue, 12 Mar 2013 10:40:33 +0100 Subject: RE: Getting Slow Query Performance! Generally a single hadoop machine will perform worse then a single mysql machine. People normally use hadoop when they have so much data it won't really fit on a single machine and it would require specialized hardware (Stuff like SAN's) to run. 30GB of data really isn't that much and 2GB of ram is really not what hadoop is designed to work on. It really likes to have lots of memory. I also don't see the hadoop configuration files so perhaps you only have 1 mapper and 1 reducer. But this is not a typical use-case so I doubt you'll see snappy performance after tweaking the configs.
RE: Getting Slow Query Performance!
Thnx for your reply , i am new to hadoop and hive .My goal is to process a big data using hadoop,this is my university project ( Data Mining ) ,need to show that hadoop is better than mysql in case of Big data(30-100GB+) Processing,i know hadoop does that.To do so,can you please suggest me,how many node is required to show the performance and what type of configuration is required for each node. From: bsc...@ebuddy.com To: user@hive.apache.org CC: gobi...@live.com Date: Tue, 12 Mar 2013 10:40:33 +0100 Subject: RE: Getting Slow Query Performance! Generally a single hadoop machine will perform worse then a single mysql machine. People normally use hadoop when they have so much data it won't really fit on a single machine and it would require specialized hardware (Stuff like SAN's) to run.30GB of data really isn't that much and 2GB of ram is really not what hadoop is designed to work on. It really likes to have lots of memory.I also don't see the hadoop configuration files so perhaps you only have 1 mapper and 1 reducer. But this is not a typical use-case so I doubt you'll see snappy performance after tweaking the configs.
RE: Getting Slow Query Performance!
Generally a single hadoop machine will perform worse then a single mysql machine. People normally use hadoop when they have so much data it won't really fit on a single machine and it would require specialized hardware (Stuff like SAN's) to run. 30GB of data really isn't that much and 2GB of ram is really not what hadoop is designed to work on. It really likes to have lots of memory. I also don't see the hadoop configuration files so perhaps you only have 1 mapper and 1 reducer. But this is not a typical use-case so I doubt you'll see snappy performance after tweaking the configs. From: Gobinda Paul [mailto:gobi...@live.com] Sent: Tuesday, March 12, 2013 10:10 AM To: user@hive.apache.org Subject: Getting Slow Query Performance! i use sqoop to import 30GB data ( two table employee(aprox 21 GB) and salary(aprox 9GB ) into hadoop(Single Node) via hive. i run a sample query like SELECT EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90; In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to execute that query . CPU: Pentium(R) Dual-Core CPU E5700 @ 3.00GHz RAM: 2GB HDD: 500GB Here IS My hive-site.xml conf. javax.jdo.option.ConnectionURL jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName root javax.jdo.option.ConnectionPassword 123456 hive.hwi.listen.host 0.0.0.0 This is the host address the Hive Web Interface will listen on hive.hwi.listen.port This is the port the Hive Web Interface will listen on hive.hwi.war.file /lib/hive-hwi-0.9.0.war This is the WAR file with the jsp content for Hive Web Interface mapred.reduce.tasks -1 The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers. hive.exec.reducers.bytes.per.reducer 10 size per reducer.The default is 1G, i.e if the input size is 10G, it will use 10 reducers. hive.exec.reducers.max 999 max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers. hive.exec.scratchdir /tmp/hive-${user.name} Scratch space for Hive jobs hive.metastore.local true Any IDEA ??
Getting Slow Query Performance!
i use sqoop to import 30GB data ( two table employee(aprox 21 GB) and salary(aprox 9GB ) into hadoop(Single Node) via hive. i run a sample query like SELECT EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90; In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to execute that query . CPU: Pentium(R) Dual-Core CPU E5700 @ 3.00GHzRAM: 2GBHDD: 500GB Here IS My hive-site.xml conf. javax.jdo.option.ConnectionURL jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserNameroot javax.jdo.option.ConnectionPassword 123456 hive.hwi.listen.host 0.0.0.0 This is the host address the Hive Web Interface will listen onhive.hwi.listen.port This is the port the Hive Web Interface will listen on hive.hwi.war.file/lib/hive-hwi-0.9.0.war This is the WAR file with the jsp content for Hive Web Interface mapred.reduce.tasks-1 The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers. hive.exec.reducers.bytes.per.reducer 10 size per reducer.The default is 1G, i.e if the input size is 10G, it will use 10 reducers. hive.exec.reducers.max999 max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers. hive.exec.scratchdir /tmp/hive-${user.name}Scratch space for Hive jobs hive.metastore.local true Any IDEA ??