use the request column in apache access.log as the source of the Hadoop table
Hi All I'm facing a problem and need your help. *I would like to use the request column in apache access.log as the source of the Hadoop table.* I was able to insert the entire log table but, I would like to insert a *specific request to specific table* *the question is* : is possible without additional script? If so, how. The following example should demonstrate what we are looking for: 1. Supposed we have the following log file a. XXX.16.3.221 - - [22/Nov/2010:23:57:09 -0800] GET /includes/Entity1.ent?ClientID=1189272DayOfWeek=2Sent=OKWeekStart=31%2000:00:00 HTTP/1.1 200 1150 - - 2. And the following appropriate table CREATE TABLE Entity1( Id INT, DayOfWeek INT, Sent STRING, WeekStart INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'STORED AS TEXTFILE; 3. The following query : select * from Entity1 - should return : 1189272,2,OK, 31 1. Did you do something like this before? 2. Suppose the request string was encapsulate with base64, is there a way to decode it – do we need to use python script for that? 3. One last question, can you give as example of your use in python - aka what are you use it for? Thanks in advanced, Liad.
Example of automatic insertion process from apache access.log using to hadoop table using hive
Hi, 1. can someone provide me an example for automatic insertion process from apache access.log to hadoop table using hive. 2. can someone explain if there is a way to directly point a directory which will be the data source of hadoop table (ex. copying a file to directory and when i use select hive automatically referrer to the directory and search in all the files) Thanks, Liad.
Re: Getting CheckSumException too often
On 22/11/10 11:02, Hari Sreekumar wrote: Hi, What could be the possible reasons for getting too many checksum exceptions? I am getting these kind of exceptions quite frequently, and the whole job fails in the end: org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_8186355706212889850:of:/tmp/Webevent_07_05_2010.dat at 4075520 at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1158) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1718) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1770) at java.io.DataInputStream.read(DataInputStream.java:83) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) looks like a warning sign of disk failure -are there other disk health checks you could run?
MapReduce program unable to find custom Mapper.
I am trying to run a sample application. But I am getting follwoing error. 10/11/23 07:37:17 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 10/11/23 07:37:17 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id Created Directory!!! File added in HDFS!!! 10/11/23 07:37:20 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/11/23 07:37:20 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String). 10/11/23 07:37:20 INFO input.FileInputFormat: Total input paths to process : 1 10/11/23 07:37:21 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 10/11/23 07:37:21 INFO mapreduce.JobSubmitter: number of splits:3 10/11/23 07:37:21 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null 10/11/23 07:37:21 INFO mapreduce.Job: Running job: job_201011230702_0006 10/11/23 07:37:22 INFO mapreduce.Job: map 0% reduce 0% 10/11/23 07:37:38 INFO mapreduce.Job: Task Id : attempt_201011230702_0006_m_01_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: HDFSClientTest$TokenizerMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1128) at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:167) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:612) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: java.lang.ClassNotFoundException: HDFSClientTest$TokenizerMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) I have created a jar file which includes HDFSClientTest$TokenizerMapper.class file. But still i am getting this error. I am using hadoop-0.21.0 - -- Regards, Yogesh Patil. -- View this message in context: http://old.nabble.com/MapReduce-program-unable-to-find-custom-Mapper.-tp30283799p30283799.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: MapReduce program unable to find custom Mapper.
The warning log (WARN) at the top of the outputs explains the answer pretty much :)
use the request column in apache access.log as the source of the Hadoop table
Hi All I'm facing a problem and need your help. *I would like to use the request column in apache access.log as the source of the Hadoop table.* I was able to insert the entire log table but, I would like to insert a *specific request to specific table* *the question is* : is possible without additional script? If so, how. The following example should demonstrate what we are looking for: 1. Supposed we have the following log file a. XXX.16.3.221 - - [22/Nov/2010:23:57:09 -0800] GET /includes/Entity1.ent?ClientID=1189272DayOfWeek=2Sent=OKWeekStart=31%2000:00:00 HTTP/1.1 200 1150 - - 2. And the following appropriate table CREATE TABLE Entity1( Id INT, DayOfWeek INT, Sent STRING, WeekStart INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'STORED AS TEXTFILE; 3. The following query : select * from Entity1 - should return : 1189272,2,OK, 31 1. Did you do something like this before? 2. Suppose the request string was encapsulate with base64, is there a way to decode it – do we need to use python script for that? 3. One last question, can you give as example of your use in python - aka what are you use it for? Thanks in advanced, Liad.
Is there a single command to start the whole cluster in CDH3 ?
I setup the cluster configuration in masters, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml and copy to all the machines. And I login to one of the machines and use the following to start the cluster. for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done I expect this command will SSH to all the other machines (based on the master and slaves files) to start the corresponding daemons, but obviously it is not doing that in my setup. Am I missing something in my setup ? Also, where do I specify where the Secondary Name Node is run. Rgds, Ricky
Re: Is there a single command to start the whole cluster in CDH3 ?
Hi Ricky, Which hadoop version are you using? I am using hadoop-0.20.2 apache version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and start-mapred.sh script on my master node. If passwordless ssh is configured, this script will start the required services on each node. You shouldn't have to start the services on each node individually. The secondary namenode is specified in the conf/masters file. The node where you call the start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for start-mapred). The node mentioned in the masters file becomes the 2ndary namenode, and the datanodes and tasktrackers are the nodes which are mentioned in the slaves file. HTH, Hari On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com wrote: I setup the cluster configuration in masters, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml and copy to all the machines. And I login to one of the machines and use the following to start the cluster. for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done I expect this command will SSH to all the other machines (based on the master and slaves files) to start the corresponding daemons, but obviously it is not doing that in my setup. Am I missing something in my setup ? Also, where do I specify where the Secondary Name Node is run. Rgds, Ricky
Re: Not a host:port pair: local
On 11/19/2010 10:07 PM, Harsh J wrote: How are you starting your JobTracker by the way? With bin/start-mapred.sh (from the Hadoop installation). --Skye
Re: Is there a single command to start the whole cluster in CDH3 ?
Thanks for pointing me to the right command. I am using the CDH3 distribution. I figure out no matter what I put in the masters file, it always start the NamedNode at the machine where I issue the start-all.sh command. And always start a SecondaryNamedNode in all other machines. Any clue ? Rgds, Ricky -Original Message- From: Hari Sreekumar [mailto:hsreeku...@clickable.com] Sent: Tuesday, November 23, 2010 10:25 AM To: common-user@hadoop.apache.org Subject: Re: Is there a single command to start the whole cluster in CDH3 ? Hi Ricky, Which hadoop version are you using? I am using hadoop-0.20.2 apache version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and start-mapred.sh script on my master node. If passwordless ssh is configured, this script will start the required services on each node. You shouldn't have to start the services on each node individually. The secondary namenode is specified in the conf/masters file. The node where you call the start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for start-mapred). The node mentioned in the masters file becomes the 2ndary namenode, and the datanodes and tasktrackers are the nodes which are mentioned in the slaves file. HTH, Hari On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com wrote: I setup the cluster configuration in masters, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml and copy to all the machines. And I login to one of the machines and use the following to start the cluster. for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done I expect this command will SSH to all the other machines (based on the master and slaves files) to start the corresponding daemons, but obviously it is not doing that in my setup. Am I missing something in my setup ? Also, where do I specify where the Secondary Name Node is run. Rgds, Ricky
Config
We are currently modifying the configuration of our hadoop grid (250 machines). The machines are homogeneous and the specs are dual quad core cpu 18Gb ram 8x1tb drives currently we have set this up - 8 reduce slots at 800mb 8 map slots at 800mb raised our io.sort.mb to 256mb we see a lot of spilling on both maps and reduces and I am wondering what other configs I should be looking into Thanks
Starting Hadoop on OS X fails, nohup issue
I am trying to get Hadoop 0.21.0 running on OS X 10.6.5 in pseudo-distributed mode. I downloaded and extracted the tarball, and I followed the instructions on editing core-site.xml, hdfs-site.xml, and mapred-site.xml. I also set JAVA_HOME in hadoop-env.sh as well as in my .profile. When attempting to start, I am getting the following error: localhost: nohup: can't detach from console: Inappropriate ioctl for device For what it's worth, I also tried Hadoop 0.20.0 from Cloudera and am having the exact same issue with nohup. If I remove nohup from the hadoop-daemon.sh script, it seems to start OK.
How to debug (log4j.properties),
I am trying to debug my map/reduce (Hadoop) app with help of the logging. When I do grep -r in $HADOOP_HOME/logs/* There is no line with debug info found. I need your help. What am I doing wrong? Thanks in advance, Tali In my class I put : import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; LOG.warn(==); System.out.println(); _ Here is my Log4j.properties: log4j.rootLogger=WARN, stdout, logfile log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.RollingFileAppender log4j.appender.logfile.File=app-debug.log #log4j.appender.logfile.MaxFileSize=512KB # Keep three backup files. log4j.appender.logfile.MaxBackupIndex=3 # Pattern to output: date priority [category] - message log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=MYLINE %d %p [%c] - %m%n log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG
Re: How to debug (log4j.properties),
Line like this log4j.logger.org.apache.hadoop=DEBUG works for 0.20.* and for 0.21+. Therefore it should work for all others :) So, are you trying to see your program's debug or from Hadoop ? -- Cos On Tue, Nov 23, 2010 at 05:59PM, Tali K wrote: I am trying to debug my map/reduce (Hadoop) app with help of the logging. When I do grep -r in $HADOOP_HOME/logs/* There is no line with debug info found. I need your help. What am I doing wrong? Thanks in advance, Tali In my class I put : import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; LOG.warn(==); System.out.println(); _ Here is my Log4j.properties: log4j.rootLogger=WARN, stdout, logfile log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.RollingFileAppender log4j.appender.logfile.File=app-debug.log #log4j.appender.logfile.MaxFileSize=512KB # Keep three backup files. log4j.appender.logfile.MaxBackupIndex=3 # Pattern to output: date priority [category] - message log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=MYLINE %d %p [%c] - %m%n log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG
RE: How to debug (log4j.properties),
Thanks, It worked! So, are you trying to see your program's debug or from Hadoop ? I am printing some values from my Mapper. Date: Tue, 23 Nov 2010 18:26:28 -0800 From: c...@apache.org To: common-user@hadoop.apache.org Subject: Re: How to debug (log4j.properties), Line like this log4j.logger.org.apache.hadoop=DEBUG works for 0.20.* and for 0.21+. Therefore it should work for all others :) So, are you trying to see your program's debug or from Hadoop ? -- Cos On Tue, Nov 23, 2010 at 05:59PM, Tali K wrote: I am trying to debug my map/reduce (Hadoop) app with help of the logging. When I do grep -r in $HADOOP_HOME/logs/* There is no line with debug info found. I need your help. What am I doing wrong? Thanks in advance, Tali In my class I put : import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; LOG.warn(==); System.out.println(); _ Here is my Log4j.properties: log4j.rootLogger=WARN, stdout, logfile log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.RollingFileAppender log4j.appender.logfile.File=app-debug.log #log4j.appender.logfile.MaxFileSize=512KB # Keep three backup files. log4j.appender.logfile.MaxBackupIndex=3 # Pattern to output: date priority [category] - message log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=MYLINE %d %p [%c] - %m%n log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG
Re: Is there a single command to start the whole cluster in CDH3 ?
Hi Ricky, Yes, that's how it is meant to be. The machine where you run start-dfs.sh will become the namenode, and the machine whihc you specify in you masters file becomes the secondary namenode. Hari On Wed, Nov 24, 2010 at 2:13 AM, Ricky Ho rickyphyl...@yahoo.com wrote: Thanks for pointing me to the right command. I am using the CDH3 distribution. I figure out no matter what I put in the masters file, it always start the NamedNode at the machine where I issue the start-all.sh command. And always start a SecondaryNamedNode in all other machines. Any clue ? Rgds, Ricky -Original Message- From: Hari Sreekumar [mailto:hsreeku...@clickable.com] Sent: Tuesday, November 23, 2010 10:25 AM To: common-user@hadoop.apache.org Subject: Re: Is there a single command to start the whole cluster in CDH3 ? Hi Ricky, Which hadoop version are you using? I am using hadoop-0.20.2 apache version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and start-mapred.sh script on my master node. If passwordless ssh is configured, this script will start the required services on each node. You shouldn't have to start the services on each node individually. The secondary namenode is specified in the conf/masters file. The node where you call the start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for start-mapred). The node mentioned in the masters file becomes the 2ndary namenode, and the datanodes and tasktrackers are the nodes which are mentioned in the slaves file. HTH, Hari On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com wrote: I setup the cluster configuration in masters, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml and copy to all the machines. And I login to one of the machines and use the following to start the cluster. for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done I expect this command will SSH to all the other machines (based on the master and slaves files) to start the corresponding daemons, but obviously it is not doing that in my setup. Am I missing something in my setup ? Also, where do I specify where the Secondary Name Node is run. Rgds, Ricky
Re: Config
Hi William, I think the most proper config parameter to try is io.sort.factor, which affects disk spilling times on both map and reduce side. The default value of this parameter is 10, try to enlarge it to 100 or more. If the spilling on reduce side is still frequent you could try tuning up mapred.job.shuffle.input.buffer.percent along with mapred.child.java.opts, which may reduce disk spilling times in the shuffle phase. The default value of mapred.job.shuffle.input.buffer.percent is 0.7, with mapred.child.java.opts -Xmx200m by default. Notice that increasing these values will also increase the memory cost, so we need to make sure memory won't become the system bottleneck. Hope this could help. On 24 November 2010 04:58, William wtheisin...@gmail.com wrote: We are currently modifying the configuration of our hadoop grid (250 machines). The machines are homogeneous and the specs are dual quad core cpu 18Gb ram 8x1tb drives currently we have set this up - 8 reduce slots at 800mb 8 map slots at 800mb raised our io.sort.mb to 256mb we see a lot of spilling on both maps and reduces and I am wondering what other configs I should be looking into Thanks -- Best Regards, Li Yu
Re: Is there a single command to start the whole cluster in CDH3 ?
hi Hary, when i try to start hadoop daemons by /usr/lib/hadoop# bin/start-dfs.sh on name node it is giving this error:*May not run daemons as root. Please specify HADOOP_NAMENODE_USER(*same for other daemons*)* but when i try to start it using */etc/init.d/hadoop-0.20-namenode start * it* *gets start successfully* ** * *whats the reason behind that? * On Wed, Nov 24, 2010 at 10:04 AM, Hari Sreekumar hsreeku...@clickable.comwrote: Hi Ricky, Yes, that's how it is meant to be. The machine where you run start-dfs.sh will become the namenode, and the machine whihc you specify in you masters file becomes the secondary namenode. Hari On Wed, Nov 24, 2010 at 2:13 AM, Ricky Ho rickyphyl...@yahoo.com wrote: Thanks for pointing me to the right command. I am using the CDH3 distribution. I figure out no matter what I put in the masters file, it always start the NamedNode at the machine where I issue the start-all.sh command. And always start a SecondaryNamedNode in all other machines. Any clue ? Rgds, Ricky -Original Message- From: Hari Sreekumar [mailto:hsreeku...@clickable.com] Sent: Tuesday, November 23, 2010 10:25 AM To: common-user@hadoop.apache.org Subject: Re: Is there a single command to start the whole cluster in CDH3 ? Hi Ricky, Which hadoop version are you using? I am using hadoop-0.20.2 apache version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and start-mapred.sh script on my master node. If passwordless ssh is configured, this script will start the required services on each node. You shouldn't have to start the services on each node individually. The secondary namenode is specified in the conf/masters file. The node where you call the start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for start-mapred). The node mentioned in the masters file becomes the 2ndary namenode, and the datanodes and tasktrackers are the nodes which are mentioned in the slaves file. HTH, Hari On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com wrote: I setup the cluster configuration in masters, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml and copy to all the machines. And I login to one of the machines and use the following to start the cluster. for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done I expect this command will SSH to all the other machines (based on the master and slaves files) to start the corresponding daemons, but obviously it is not doing that in my setup. Am I missing something in my setup ? Also, where do I specify where the Secondary Name Node is run. Rgds, Ricky -- -Thanks and Regards, Rahul Patodi Associate Software Engineer, Impetus Infotech (India) Private Limited, www.impetus.com Mob:09907074413
Re: Is there a single command to start the whole cluster in CDH3 ?
hi Ricky, for installing CDH3 you can refer this tutorial: http://cloudera-tutorial.blogspot.com/2010/11/running-cloudera-in-distributed-mode.html all the steps in this tutorial are well tested.(*in case of any query please leave a comment*) On Wed, Nov 24, 2010 at 11:48 AM, rahul patodi patodira...@gmail.comwrote: hi Hary, when i try to start hadoop daemons by /usr/lib/hadoop# bin/start-dfs.sh on name node it is giving this error:*May not run daemons as root. Please specify HADOOP_NAMENODE_USER(*same for other daemons*)* but when i try to start it using */etc/init.d/hadoop-0.20-namenode start *it* *gets start successfully* ** * *whats the reason behind that? * On Wed, Nov 24, 2010 at 10:04 AM, Hari Sreekumar hsreeku...@clickable.com wrote: Hi Ricky, Yes, that's how it is meant to be. The machine where you run start-dfs.sh will become the namenode, and the machine whihc you specify in you masters file becomes the secondary namenode. Hari On Wed, Nov 24, 2010 at 2:13 AM, Ricky Ho rickyphyl...@yahoo.com wrote: Thanks for pointing me to the right command. I am using the CDH3 distribution. I figure out no matter what I put in the masters file, it always start the NamedNode at the machine where I issue the start-all.sh command. And always start a SecondaryNamedNode in all other machines. Any clue ? Rgds, Ricky -Original Message- From: Hari Sreekumar [mailto:hsreeku...@clickable.com] Sent: Tuesday, November 23, 2010 10:25 AM To: common-user@hadoop.apache.org Subject: Re: Is there a single command to start the whole cluster in CDH3 ? Hi Ricky, Which hadoop version are you using? I am using hadoop-0.20.2 apache version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and start-mapred.sh script on my master node. If passwordless ssh is configured, this script will start the required services on each node. You shouldn't have to start the services on each node individually. The secondary namenode is specified in the conf/masters file. The node where you call the start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for start-mapred). The node mentioned in the masters file becomes the 2ndary namenode, and the datanodes and tasktrackers are the nodes which are mentioned in the slaves file. HTH, Hari On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com wrote: I setup the cluster configuration in masters, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml and copy to all the machines. And I login to one of the machines and use the following to start the cluster. for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done I expect this command will SSH to all the other machines (based on the master and slaves files) to start the corresponding daemons, but obviously it is not doing that in my setup. Am I missing something in my setup ? Also, where do I specify where the Secondary Name Node is run. Rgds, Ricky -- -Thanks and Regards, Rahul Patodi Associate Software Engineer, Impetus Infotech (India) Private Limited, www.impetus.com Mob:09907074413 -- -Thanks and Regards, Rahul Patodi Associate Software Engineer, Impetus Infotech (India) Private Limited, www.impetus.com Mob:09907074413
Re: Is there a single command to start the whole cluster in CDH3 ?
Hi Raul, I am not sure about CDH, but I have created a separate hadoop user to run my ASF hadoop version, and it works fine. Maybe you can also try creating a new hadoop user, make hadoop the owner of hadoop root directory. HTH, Hari On Wed, Nov 24, 2010 at 11:51 AM, rahul patodi patodira...@gmail.comwrote: hi Ricky, for installing CDH3 you can refer this tutorial: http://cloudera-tutorial.blogspot.com/2010/11/running-cloudera-in-distributed-mode.html all the steps in this tutorial are well tested.(*in case of any query please leave a comment*) On Wed, Nov 24, 2010 at 11:48 AM, rahul patodi patodira...@gmail.com wrote: hi Hary, when i try to start hadoop daemons by /usr/lib/hadoop# bin/start-dfs.sh on name node it is giving this error:*May not run daemons as root. Please specify HADOOP_NAMENODE_USER(*same for other daemons*)* but when i try to start it using */etc/init.d/hadoop-0.20-namenode start *it* *gets start successfully* ** * *whats the reason behind that? * On Wed, Nov 24, 2010 at 10:04 AM, Hari Sreekumar hsreeku...@clickable.com wrote: Hi Ricky, Yes, that's how it is meant to be. The machine where you run start-dfs.sh will become the namenode, and the machine whihc you specify in you masters file becomes the secondary namenode. Hari On Wed, Nov 24, 2010 at 2:13 AM, Ricky Ho rickyphyl...@yahoo.com wrote: Thanks for pointing me to the right command. I am using the CDH3 distribution. I figure out no matter what I put in the masters file, it always start the NamedNode at the machine where I issue the start-all.sh command. And always start a SecondaryNamedNode in all other machines. Any clue ? Rgds, Ricky -Original Message- From: Hari Sreekumar [mailto:hsreeku...@clickable.com] Sent: Tuesday, November 23, 2010 10:25 AM To: common-user@hadoop.apache.org Subject: Re: Is there a single command to start the whole cluster in CDH3 ? Hi Ricky, Which hadoop version are you using? I am using hadoop-0.20.2 apache version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and start-mapred.sh script on my master node. If passwordless ssh is configured, this script will start the required services on each node. You shouldn't have to start the services on each node individually. The secondary namenode is specified in the conf/masters file. The node where you call the start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for start-mapred). The node mentioned in the masters file becomes the 2ndary namenode, and the datanodes and tasktrackers are the nodes which are mentioned in the slaves file. HTH, Hari On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com wrote: I setup the cluster configuration in masters, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml and copy to all the machines. And I login to one of the machines and use the following to start the cluster. for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done I expect this command will SSH to all the other machines (based on the master and slaves files) to start the corresponding daemons, but obviously it is not doing that in my setup. Am I missing something in my setup ? Also, where do I specify where the Secondary Name Node is run. Rgds, Ricky -- -Thanks and Regards, Rahul Patodi Associate Software Engineer, Impetus Infotech (India) Private Limited, www.impetus.com Mob:09907074413 -- -Thanks and Regards, Rahul Patodi Associate Software Engineer, Impetus Infotech (India) Private Limited, www.impetus.com Mob:09907074413
Re: Is there a single command to start the whole cluster in CDH3 ?
Hi everyone, Since this question is CDH-specific, it's better to ask on the cdh-user mailing list: https://groups.google.com/a/cloudera.org/group/cdh-user/topics?pli=1 Thanks -Todd On Wed, Nov 24, 2010 at 1:26 AM, Hari Sreekumar hsreeku...@clickable.comwrote: Hi Raul, I am not sure about CDH, but I have created a separate hadoop user to run my ASF hadoop version, and it works fine. Maybe you can also try creating a new hadoop user, make hadoop the owner of hadoop root directory. HTH, Hari On Wed, Nov 24, 2010 at 11:51 AM, rahul patodi patodira...@gmail.com wrote: hi Ricky, for installing CDH3 you can refer this tutorial: http://cloudera-tutorial.blogspot.com/2010/11/running-cloudera-in-distributed-mode.html all the steps in this tutorial are well tested.(*in case of any query please leave a comment*) On Wed, Nov 24, 2010 at 11:48 AM, rahul patodi patodira...@gmail.com wrote: hi Hary, when i try to start hadoop daemons by /usr/lib/hadoop# bin/start-dfs.sh on name node it is giving this error:*May not run daemons as root. Please specify HADOOP_NAMENODE_USER(*same for other daemons*)* but when i try to start it using */etc/init.d/hadoop-0.20-namenode start *it* *gets start successfully* ** * *whats the reason behind that? * On Wed, Nov 24, 2010 at 10:04 AM, Hari Sreekumar hsreeku...@clickable.com wrote: Hi Ricky, Yes, that's how it is meant to be. The machine where you run start-dfs.sh will become the namenode, and the machine whihc you specify in you masters file becomes the secondary namenode. Hari On Wed, Nov 24, 2010 at 2:13 AM, Ricky Ho rickyphyl...@yahoo.com wrote: Thanks for pointing me to the right command. I am using the CDH3 distribution. I figure out no matter what I put in the masters file, it always start the NamedNode at the machine where I issue the start-all.sh command. And always start a SecondaryNamedNode in all other machines. Any clue ? Rgds, Ricky -Original Message- From: Hari Sreekumar [mailto:hsreeku...@clickable.com] Sent: Tuesday, November 23, 2010 10:25 AM To: common-user@hadoop.apache.org Subject: Re: Is there a single command to start the whole cluster in CDH3 ? Hi Ricky, Which hadoop version are you using? I am using hadoop-0.20.2 apache version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and start-mapred.sh script on my master node. If passwordless ssh is configured, this script will start the required services on each node. You shouldn't have to start the services on each node individually. The secondary namenode is specified in the conf/masters file. The node where you call the start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for start-mapred). The node mentioned in the masters file becomes the 2ndary namenode, and the datanodes and tasktrackers are the nodes which are mentioned in the slaves file. HTH, Hari On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com wrote: I setup the cluster configuration in masters, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml and copy to all the machines. And I login to one of the machines and use the following to start the cluster. for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done I expect this command will SSH to all the other machines (based on the master and slaves files) to start the corresponding daemons, but obviously it is not doing that in my setup. Am I missing something in my setup ? Also, where do I specify where the Secondary Name Node is run. Rgds, Ricky -- -Thanks and Regards, Rahul Patodi Associate Software Engineer, Impetus Infotech (India) Private Limited, www.impetus.com Mob:09907074413 -- -Thanks and Regards, Rahul Patodi Associate Software Engineer, Impetus Infotech (India) Private Limited, www.impetus.com Mob:09907074413 -- Todd Lipcon Software Engineer, Cloudera
is HDFS-788 resolved?
Hi there, Is https://issues.apache.org/jira/browse/HDFS-788 resolved? What actually happens if the smaller partition of some datanodes get full while writing a block? Is it possible that the datanodes are recognized as dead making replication storm among some hundreds of machines? Thanks, Manhee