use the request column in apache access.log as the source of the Hadoop table

2010-11-23 Thread liad livnat
Hi All

I'm facing a problem and need your help.

*I would like to use the request column in apache access.log as the source
of the Hadoop table.*





I was able to insert the entire log table but, I would like to insert
a *specific
request to specific table* *the question is* : is possible without
additional script? If so, how.

The following example should demonstrate what we are looking for:



1.   Supposed we have the following log file

a.   XXX.16.3.221 - - [22/Nov/2010:23:57:09 -0800] GET
/includes/Entity1.ent?ClientID=1189272DayOfWeek=2Sent=OKWeekStart=31%2000:00:00
HTTP/1.1 200 1150 - -

2.   And the following appropriate table

CREATE TABLE Entity1(

Id INT,

DayOfWeek INT,

Sent STRING,

WeekStart INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'STORED
AS TEXTFILE;

3.   The following query : select * from Entity1 -  should return :
1189272,2,OK, 31



1.   Did you do something like this before?

2.   Suppose the request string was encapsulate with base64, is there a
way to decode it – do we need to use python script for that?

3.   One last question, can you give as example of your use in python  -
aka what are you use it for?



Thanks in advanced,

Liad.


Example of automatic insertion process from apache access.log using to hadoop table using hive

2010-11-23 Thread liad livnat
Hi,
1. can someone provide me an example for automatic insertion process from
apache access.log to hadoop table using hive.
2. can someone explain if there is a way to directly point a directory which
will be the data source of hadoop table (ex. copying a file to directory and
when i use select hive automatically referrer to the directory and search in
all the files)
Thanks,
Liad.


Re: Getting CheckSumException too often

2010-11-23 Thread Steve Loughran

On 22/11/10 11:02, Hari Sreekumar wrote:

Hi,

   What could be the possible reasons for getting too many checksum
exceptions? I am getting these kind of exceptions quite frequently, and the
whole job fails in the end:

org.apache.hadoop.fs.ChecksumException: Checksum error:
/blk_8186355706212889850:of:/tmp/Webevent_07_05_2010.dat at 4075520
at 
org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at 
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at 
org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1158)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1718)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1770)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
at 
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)


looks like a warning sign of disk failure -are there other disk health 
checks you could run?


MapReduce program unable to find custom Mapper.

2010-11-23 Thread Patil Yogesh

I am trying to run a sample application. But I am getting follwoing error.

10/11/23 07:37:17 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
10/11/23 07:37:17 WARN conf.Configuration: mapred.task.id is deprecated.
Instead, use mapreduce.task.attempt.id
Created Directory!!!
File added in HDFS!!!
10/11/23 07:37:20 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
10/11/23 07:37:20 WARN mapreduce.JobSubmitter: No job jar file set.  User
classes may not be found. See Job or Job#setJar(String).
10/11/23 07:37:20 INFO input.FileInputFormat: Total input paths to process :
1
10/11/23 07:37:21 WARN conf.Configuration: mapred.map.tasks is deprecated.
Instead, use mapreduce.job.maps
10/11/23 07:37:21 INFO mapreduce.JobSubmitter: number of splits:3
10/11/23 07:37:21 INFO mapreduce.JobSubmitter: adding the following
namenodes' delegation tokens:null
10/11/23 07:37:21 INFO mapreduce.Job: Running job: job_201011230702_0006
10/11/23 07:37:22 INFO mapreduce.Job:  map 0% reduce 0%
10/11/23 07:37:38 INFO mapreduce.Job: Task Id :
attempt_201011230702_0006_m_01_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException:
HDFSClientTest$TokenizerMapper
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1128)
at
org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:167)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:612)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.mapred.Child.main(Child.java:211)
Caused by: java.lang.ClassNotFoundException: HDFSClientTest$TokenizerMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)


I have created a jar file which includes
HDFSClientTest$TokenizerMapper.class file. But still i am getting this
error.

I am using hadoop-0.21.0

-
--
Regards,
Yogesh Patil.
-- 
View this message in context: 
http://old.nabble.com/MapReduce-program-unable-to-find-custom-Mapper.-tp30283799p30283799.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: MapReduce program unable to find custom Mapper.

2010-11-23 Thread Harsh J
The warning log (WARN) at the top of the outputs explains the answer pretty
much :)


use the request column in apache access.log as the source of the Hadoop table

2010-11-23 Thread liad livnat
Hi All

I'm facing a problem and need your help.

*I would like to use the request column in apache access.log as the source
of the Hadoop table.*





I was able to insert the entire log table but, I would like to insert
a *specific
request to specific table* *the question is* : is possible without
additional script? If so, how.

The following example should demonstrate what we are looking for:



1.   Supposed we have the following log file

a.   XXX.16.3.221 - - [22/Nov/2010:23:57:09 -0800] GET
/includes/Entity1.ent?ClientID=1189272DayOfWeek=2Sent=OKWeekStart=31%2000:00:00
HTTP/1.1 200 1150 - -

2.   And the following appropriate table

CREATE TABLE Entity1(

Id INT,

DayOfWeek INT,

Sent STRING,

WeekStart INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'STORED
AS TEXTFILE;

3.   The following query : select * from Entity1 -  should return :
1189272,2,OK, 31



1.   Did you do something like this before?

2.   Suppose the request string was encapsulate with base64, is there a
way to decode it – do we need to use python script for that?

3.   One last question, can you give as example of your use in python  -
aka what are you use it for?



Thanks in advanced,

Liad.


Is there a single command to start the whole cluster in CDH3 ?

2010-11-23 Thread Ricky Ho
I setup the cluster configuration in masters, slaves, core-site.xml, 
hdfs-site.xml, mapred-site.xml and copy to all the machines.

And I login to one of the machines and use the following to start the cluster.
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done

I expect this command will SSH to all the other machines (based on the master 
and slaves files) to start the corresponding daemons, but obviously it is not 
doing that in my setup.

Am I missing something in my setup ?

Also, where do I specify where the Secondary Name Node is run.

Rgds,
Ricky



  


Re: Is there a single command to start the whole cluster in CDH3 ?

2010-11-23 Thread Hari Sreekumar
Hi Ricky,

 Which hadoop version are you using? I am using hadoop-0.20.2 apache
version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and
start-mapred.sh script on my master node. If passwordless ssh is configured,
this script will start the required services on each node. You shouldn't
have to start the services on each node individually. The secondary namenode
is specified in the conf/masters file. The node where you call the
start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for
start-mapred). The node mentioned in the masters file becomes the 2ndary
namenode, and the datanodes and tasktrackers are the nodes which are
mentioned in the slaves file.

HTH,
Hari

On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com wrote:

 I setup the cluster configuration in masters, slaves, core-site.xml,
 hdfs-site.xml, mapred-site.xml and copy to all the machines.

 And I login to one of the machines and use the following to start the
 cluster.
 for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done

 I expect this command will SSH to all the other machines (based on the
 master
 and slaves files) to start the corresponding daemons, but obviously it is
 not
 doing that in my setup.

 Am I missing something in my setup ?

 Also, where do I specify where the Secondary Name Node is run.

 Rgds,
 Ricky







Re: Not a host:port pair: local

2010-11-23 Thread Skye Berghel

On 11/19/2010 10:07 PM, Harsh J wrote:

How are you starting your JobTracker by the way?


With bin/start-mapred.sh (from the Hadoop installation).

--Skye


Re: Is there a single command to start the whole cluster in CDH3 ?

2010-11-23 Thread Ricky Ho
Thanks for pointing me to the right command.  I am using the CDH3 distribution.
I figure out no matter what I put in the masters file, it always start the 
NamedNode at the machine where I issue the start-all.sh command.  And always 
start a SecondaryNamedNode in all other machines.  Any clue ?


Rgds,
Ricky

-Original Message-
From: Hari Sreekumar [mailto:hsreeku...@clickable.com] 
Sent: Tuesday, November 23, 2010 10:25 AM
To: common-user@hadoop.apache.org
Subject: Re: Is there a single command to start the whole cluster in CDH3 ?
 
Hi Ricky,
 
 Which hadoop version are you using? I am using hadoop-0.20.2 apache
version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and
start-mapred.sh script on my master node. If passwordless ssh is configured,
this script will start the required services on each node. You shouldn't
have to start the services on each node individually. The secondary namenode
is specified in the conf/masters file. The node where you call the
start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for
start-mapred). The node mentioned in the masters file becomes the 2ndary
namenode, and the datanodes and tasktrackers are the nodes which are
mentioned in the slaves file.
 
HTH,
Hari
 
On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com wrote:
 
 I setup the cluster configuration in masters, slaves, core-site.xml,
 hdfs-site.xml, mapred-site.xml and copy to all the machines.
 
 And I login to one of the machines and use the following to start the
 cluster.
 for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
 
 I expect this command will SSH to all the other machines (based on the
 master
 and slaves files) to start the corresponding daemons, but obviously it is
 not
 doing that in my setup.
 
 Am I missing something in my setup ?
 
 Also, where do I specify where the Secondary Name Node is run.
 
 Rgds,
 Ricky
 
 
 
 
 


  


Config

2010-11-23 Thread William
We are currently modifying the configuration of our hadoop grid (250
machines).  The machines are homogeneous and the specs are

dual quad core cpu 18Gb ram 8x1tb drives

currently we have set this up  -

8 reduce slots at 800mb
8 map slots at 800mb

raised our io.sort.mb to 256mb

we see a lot of spilling on both maps and reduces and I am wondering what
other configs I should be looking into

Thanks


Starting Hadoop on OS X fails, nohup issue

2010-11-23 Thread Bryan Keller
I am trying to get Hadoop 0.21.0 running on OS X 10.6.5 in pseudo-distributed 
mode. I downloaded and extracted the tarball, and I followed the instructions 
on editing core-site.xml, hdfs-site.xml, and mapred-site.xml. I also set 
JAVA_HOME in hadoop-env.sh as well as in my .profile. When attempting to start, 
I am getting the following error:

localhost: nohup: can't detach from console: Inappropriate ioctl for device

For what it's worth, I also tried Hadoop 0.20.0 from Cloudera and am having the 
exact same issue with nohup. If I remove nohup from the hadoop-daemon.sh 
script, it seems to start OK.

How to debug (log4j.properties),

2010-11-23 Thread Tali K




I am trying to debug my map/reduce (Hadoop)  app with help of the logging.
When I do grep -r in $HADOOP_HOME/logs/* 

There is no line with debug info found.
I need your help.  What am I doing wrong?
Thanks in advance,
Tali

In my class I put :

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;


LOG.warn(==);
System.out.println();
_


Here is my Log4j.properties:

log4j.rootLogger=WARN, stdout, logfile

log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n

log4j.appender.logfile=org.apache.log4j.RollingFileAppender
log4j.appender.logfile.File=app-debug.log


#log4j.appender.logfile.MaxFileSize=512KB
# Keep three backup files.
log4j.appender.logfile.MaxBackupIndex=3
# Pattern to output: date priority [category] - message
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=MYLINE %d %p [%c] - %m%n
log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG

  

Re: How to debug (log4j.properties),

2010-11-23 Thread Konstantin Boudnik
Line like this 
  log4j.logger.org.apache.hadoop=DEBUG

works for 0.20.* and for 0.21+. Therefore it should work for all others :)

So, are you trying to see your program's debug or from Hadoop ?

--
  Cos

On Tue, Nov 23, 2010 at 05:59PM, Tali K wrote:
 
 
 
 
 I am trying to debug my map/reduce (Hadoop)  app with help of the logging.
 When I do grep -r in $HADOOP_HOME/logs/* 
 
 There is no line with debug info found.
 I need your help.  What am I doing wrong?
 Thanks in advance,
 Tali
 
 In my class I put :
 
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 
 
 LOG.warn(==);
 System.out.println();
 _
 
 
 Here is my Log4j.properties:
 
 log4j.rootLogger=WARN, stdout, logfile
 
 log4j.appender.stdout=org.apache.log4j.ConsoleAppender
 log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
 log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
 
 log4j.appender.logfile=org.apache.log4j.RollingFileAppender
 log4j.appender.logfile.File=app-debug.log
 
 
 #log4j.appender.logfile.MaxFileSize=512KB
 # Keep three backup files.
 log4j.appender.logfile.MaxBackupIndex=3
 # Pattern to output: date priority [category] - message
 log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
 log4j.appender.logfile.layout.ConversionPattern=MYLINE %d %p [%c] - %m%n
 log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG
 
 


RE: How to debug (log4j.properties),

2010-11-23 Thread Tali K

Thanks,

It worked!
 
 So, are you trying to see your program's debug or from Hadoop ?
 
I am printing some values from my Mapper. 

 Date: Tue, 23 Nov 2010 18:26:28 -0800
 From: c...@apache.org
 To: common-user@hadoop.apache.org
 Subject: Re: How to debug  (log4j.properties),
 
 Line like this 
   log4j.logger.org.apache.hadoop=DEBUG
 
 works for 0.20.* and for 0.21+. Therefore it should work for all others :)
 
 So, are you trying to see your program's debug or from Hadoop ?
 
 --
   Cos
 
 On Tue, Nov 23, 2010 at 05:59PM, Tali K wrote:
  
  
  
  
  I am trying to debug my map/reduce (Hadoop)  app with help of the logging.
  When I do grep -r in $HADOOP_HOME/logs/* 
  
  There is no line with debug info found.
  I need your help.  What am I doing wrong?
  Thanks in advance,
  Tali
  
  In my class I put :
  
  import org.apache.commons.logging.Log;
  import org.apache.commons.logging.LogFactory;
  
  
  LOG.warn(==);
  System.out.println();
  _
  
  
  Here is my Log4j.properties:
  
  log4j.rootLogger=WARN, stdout, logfile
  
  log4j.appender.stdout=org.apache.log4j.ConsoleAppender
  log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
  log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
  
  log4j.appender.logfile=org.apache.log4j.RollingFileAppender
  log4j.appender.logfile.File=app-debug.log
  
  
  #log4j.appender.logfile.MaxFileSize=512KB
  # Keep three backup files.
  log4j.appender.logfile.MaxBackupIndex=3
  # Pattern to output: date priority [category] - message
  log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
  log4j.appender.logfile.layout.ConversionPattern=MYLINE %d %p [%c] - %m%n
  log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG
  

  

Re: Is there a single command to start the whole cluster in CDH3 ?

2010-11-23 Thread Hari Sreekumar
Hi Ricky,

  Yes, that's how it is meant to be. The machine where you run
start-dfs.sh will become the namenode, and the machine whihc you specify in
you masters file becomes the secondary namenode.

Hari

On Wed, Nov 24, 2010 at 2:13 AM, Ricky Ho rickyphyl...@yahoo.com wrote:

 Thanks for pointing me to the right command.  I am using the CDH3
 distribution.
 I figure out no matter what I put in the masters file, it always start the
 NamedNode at the machine where I issue the start-all.sh command.  And
 always
 start a SecondaryNamedNode in all other machines.  Any clue ?


 Rgds,
 Ricky

 -Original Message-
 From: Hari Sreekumar [mailto:hsreeku...@clickable.com]
 Sent: Tuesday, November 23, 2010 10:25 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Is there a single command to start the whole cluster in CDH3 ?

 Hi Ricky,

 Which hadoop version are you using? I am using hadoop-0.20.2 apache
 version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and
 start-mapred.sh script on my master node. If passwordless ssh is
 configured,
 this script will start the required services on each node. You shouldn't
 have to start the services on each node individually. The secondary
 namenode
 is specified in the conf/masters file. The node where you call the
 start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for
 start-mapred). The node mentioned in the masters file becomes the 2ndary
 namenode, and the datanodes and tasktrackers are the nodes which are
 mentioned in the slaves file.

 HTH,
 Hari

 On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com wrote:

  I setup the cluster configuration in masters, slaves,
 core-site.xml,
  hdfs-site.xml, mapred-site.xml and copy to all the machines.
 
  And I login to one of the machines and use the following to start the
  cluster.
  for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
 
  I expect this command will SSH to all the other machines (based on the
  master
  and slaves files) to start the corresponding daemons, but obviously it
 is
  not
  doing that in my setup.
 
  Am I missing something in my setup ?
 
  Also, where do I specify where the Secondary Name Node is run.
 
  Rgds,
  Ricky
 
 
 
 
 






Re: Config

2010-11-23 Thread Yu Li
Hi William,

I think the most proper config parameter to try is io.sort.factor, which
affects disk spilling times on both map and reduce side. The default value
of this parameter is 10, try to enlarge it to 100 or more.

If the spilling on reduce side is still frequent you could try tuning
up mapred.job.shuffle.input.buffer.percent along with
mapred.child.java.opts, which may reduce disk spilling times in the shuffle
phase. The default value of mapred.job.shuffle.input.buffer.percent is 0.7,
with mapred.child.java.opts -Xmx200m by default.

Notice that increasing these values will also increase the memory cost, so
we need to make sure memory won't become the system bottleneck.

Hope this could help.

On 24 November 2010 04:58, William wtheisin...@gmail.com wrote:

 We are currently modifying the configuration of our hadoop grid (250
 machines).  The machines are homogeneous and the specs are

 dual quad core cpu 18Gb ram 8x1tb drives

 currently we have set this up  -

 8 reduce slots at 800mb
 8 map slots at 800mb

 raised our io.sort.mb to 256mb

 we see a lot of spilling on both maps and reduces and I am wondering what
 other configs I should be looking into

 Thanks




-- 
Best Regards,
Li Yu


Re: Is there a single command to start the whole cluster in CDH3 ?

2010-11-23 Thread rahul patodi
hi Hary,
when i try to start hadoop daemons by /usr/lib/hadoop# bin/start-dfs.sh on
name node it is giving this error:*May not run daemons as root. Please
specify HADOOP_NAMENODE_USER(*same for other daemons*)*
but when i try to start it using */etc/init.d/hadoop-0.20-namenode start *
it* *gets start successfully* **
*
*whats the reason behind that?
*

On Wed, Nov 24, 2010 at 10:04 AM, Hari Sreekumar
hsreeku...@clickable.comwrote:

 Hi Ricky,

  Yes, that's how it is meant to be. The machine where you run
 start-dfs.sh will become the namenode, and the machine whihc you specify in
 you masters file becomes the secondary namenode.

 Hari

 On Wed, Nov 24, 2010 at 2:13 AM, Ricky Ho rickyphyl...@yahoo.com wrote:

  Thanks for pointing me to the right command.  I am using the CDH3
  distribution.
  I figure out no matter what I put in the masters file, it always start
 the
  NamedNode at the machine where I issue the start-all.sh command.  And
  always
  start a SecondaryNamedNode in all other machines.  Any clue ?
 
 
  Rgds,
  Ricky
 
  -Original Message-
  From: Hari Sreekumar [mailto:hsreeku...@clickable.com]
  Sent: Tuesday, November 23, 2010 10:25 AM
  To: common-user@hadoop.apache.org
  Subject: Re: Is there a single command to start the whole cluster in CDH3
 ?
 
  Hi Ricky,
 
  Which hadoop version are you using? I am using hadoop-0.20.2
 apache
  version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and
  start-mapred.sh script on my master node. If passwordless ssh is
  configured,
  this script will start the required services on each node. You shouldn't
  have to start the services on each node individually. The secondary
  namenode
  is specified in the conf/masters file. The node where you call the
  start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for
  start-mapred). The node mentioned in the masters file becomes the 2ndary
  namenode, and the datanodes and tasktrackers are the nodes which are
  mentioned in the slaves file.
 
  HTH,
  Hari
 
  On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com
 wrote:
 
   I setup the cluster configuration in masters, slaves,
  core-site.xml,
   hdfs-site.xml, mapred-site.xml and copy to all the machines.
  
   And I login to one of the machines and use the following to start the
   cluster.
   for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
  
   I expect this command will SSH to all the other machines (based on the
   master
   and slaves files) to start the corresponding daemons, but obviously
 it
  is
   not
   doing that in my setup.
  
   Am I missing something in my setup ?
  
   Also, where do I specify where the Secondary Name Node is run.
  
   Rgds,
   Ricky
  
  
  
  
  
 
 
 
 




-- 
-Thanks and Regards,
Rahul Patodi
Associate Software Engineer,
Impetus Infotech (India) Private Limited,
www.impetus.com
Mob:09907074413


Re: Is there a single command to start the whole cluster in CDH3 ?

2010-11-23 Thread rahul patodi
hi Ricky,
for installing CDH3 you can refer this tutorial:
http://cloudera-tutorial.blogspot.com/2010/11/running-cloudera-in-distributed-mode.html
all the steps in this tutorial are well tested.(*in case of any query please
leave a comment*)


On Wed, Nov 24, 2010 at 11:48 AM, rahul patodi patodira...@gmail.comwrote:

 hi Hary,
 when i try to start hadoop daemons by /usr/lib/hadoop# bin/start-dfs.sh on
 name node it is giving this error:*May not run daemons as root. Please
 specify HADOOP_NAMENODE_USER(*same for other daemons*)*
 but when i try to start it using */etc/init.d/hadoop-0.20-namenode start
 *it* *gets start successfully* **
 *
 *whats the reason behind that?
 *

 On Wed, Nov 24, 2010 at 10:04 AM, Hari Sreekumar hsreeku...@clickable.com
  wrote:

 Hi Ricky,

  Yes, that's how it is meant to be. The machine where you run
 start-dfs.sh will become the namenode, and the machine whihc you specify
 in
 you masters file becomes the secondary namenode.

 Hari

 On Wed, Nov 24, 2010 at 2:13 AM, Ricky Ho rickyphyl...@yahoo.com wrote:

  Thanks for pointing me to the right command.  I am using the CDH3
  distribution.
  I figure out no matter what I put in the masters file, it always start
 the
  NamedNode at the machine where I issue the start-all.sh command.  And
  always
  start a SecondaryNamedNode in all other machines.  Any clue ?
 
 
  Rgds,
  Ricky
 
  -Original Message-
  From: Hari Sreekumar [mailto:hsreeku...@clickable.com]
  Sent: Tuesday, November 23, 2010 10:25 AM
  To: common-user@hadoop.apache.org
  Subject: Re: Is there a single command to start the whole cluster in
 CDH3 ?
 
  Hi Ricky,
 
  Which hadoop version are you using? I am using hadoop-0.20.2
 apache
  version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh and
  start-mapred.sh script on my master node. If passwordless ssh is
  configured,
  this script will start the required services on each node. You shouldn't
  have to start the services on each node individually. The secondary
  namenode
  is specified in the conf/masters file. The node where you call the
  start-*.sh script becomes the namenode(for start-dfs) or jobtracker(for
  start-mapred). The node mentioned in the masters file becomes the 2ndary
  namenode, and the datanodes and tasktrackers are the nodes which are
  mentioned in the slaves file.
 
  HTH,
  Hari
 
  On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com
 wrote:
 
   I setup the cluster configuration in masters, slaves,
  core-site.xml,
   hdfs-site.xml, mapred-site.xml and copy to all the machines.
  
   And I login to one of the machines and use the following to start the
   cluster.
   for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
  
   I expect this command will SSH to all the other machines (based on the
   master
   and slaves files) to start the corresponding daemons, but obviously
 it
  is
   not
   doing that in my setup.
  
   Am I missing something in my setup ?
  
   Also, where do I specify where the Secondary Name Node is run.
  
   Rgds,
   Ricky
  
  
  
  
  
 
 
 
 




 --
 -Thanks and Regards,
 Rahul Patodi
 Associate Software Engineer,
 Impetus Infotech (India) Private Limited,
 www.impetus.com
 Mob:09907074413




-- 
-Thanks and Regards,
Rahul Patodi
Associate Software Engineer,
Impetus Infotech (India) Private Limited,
www.impetus.com
Mob:09907074413


Re: Is there a single command to start the whole cluster in CDH3 ?

2010-11-23 Thread Hari Sreekumar
Hi Raul,

  I am not sure about CDH, but I have created a separate hadoop
user to run my ASF hadoop version, and it works fine. Maybe you can also try
creating a new hadoop user, make hadoop the owner of hadoop root directory.

HTH,
Hari

On Wed, Nov 24, 2010 at 11:51 AM, rahul patodi patodira...@gmail.comwrote:

 hi Ricky,
 for installing CDH3 you can refer this tutorial:

 http://cloudera-tutorial.blogspot.com/2010/11/running-cloudera-in-distributed-mode.html
 all the steps in this tutorial are well tested.(*in case of any query
 please
 leave a comment*)


 On Wed, Nov 24, 2010 at 11:48 AM, rahul patodi patodira...@gmail.com
 wrote:

  hi Hary,
  when i try to start hadoop daemons by /usr/lib/hadoop# bin/start-dfs.sh
 on
  name node it is giving this error:*May not run daemons as root. Please
  specify HADOOP_NAMENODE_USER(*same for other daemons*)*
  but when i try to start it using */etc/init.d/hadoop-0.20-namenode
 start
  *it* *gets start successfully* **
  *
  *whats the reason behind that?
  *
 
  On Wed, Nov 24, 2010 at 10:04 AM, Hari Sreekumar 
 hsreeku...@clickable.com
   wrote:
 
  Hi Ricky,
 
   Yes, that's how it is meant to be. The machine where you run
  start-dfs.sh will become the namenode, and the machine whihc you specify
  in
  you masters file becomes the secondary namenode.
 
  Hari
 
  On Wed, Nov 24, 2010 at 2:13 AM, Ricky Ho rickyphyl...@yahoo.com
 wrote:
 
   Thanks for pointing me to the right command.  I am using the CDH3
   distribution.
   I figure out no matter what I put in the masters file, it always start
  the
   NamedNode at the machine where I issue the start-all.sh command.
  And
   always
   start a SecondaryNamedNode in all other machines.  Any clue ?
  
  
   Rgds,
   Ricky
  
   -Original Message-
   From: Hari Sreekumar [mailto:hsreeku...@clickable.com]
   Sent: Tuesday, November 23, 2010 10:25 AM
   To: common-user@hadoop.apache.org
   Subject: Re: Is there a single command to start the whole cluster in
  CDH3 ?
  
   Hi Ricky,
  
   Which hadoop version are you using? I am using hadoop-0.20.2
  apache
   version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh
 and
   start-mapred.sh script on my master node. If passwordless ssh is
   configured,
   this script will start the required services on each node. You
 shouldn't
   have to start the services on each node individually. The secondary
   namenode
   is specified in the conf/masters file. The node where you call the
   start-*.sh script becomes the namenode(for start-dfs) or
 jobtracker(for
   start-mapred). The node mentioned in the masters file becomes the
 2ndary
   namenode, and the datanodes and tasktrackers are the nodes which are
   mentioned in the slaves file.
  
   HTH,
   Hari
  
   On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com
  wrote:
  
I setup the cluster configuration in masters, slaves,
   core-site.xml,
hdfs-site.xml, mapred-site.xml and copy to all the machines.
   
And I login to one of the machines and use the following to start
 the
cluster.
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start;
 done
   
I expect this command will SSH to all the other machines (based on
 the
master
and slaves files) to start the corresponding daemons, but
 obviously
  it
   is
not
doing that in my setup.
   
Am I missing something in my setup ?
   
Also, where do I specify where the Secondary Name Node is run.
   
Rgds,
Ricky
   
   
   
   
   
  
  
  
  
 
 
 
 
  --
  -Thanks and Regards,
  Rahul Patodi
  Associate Software Engineer,
  Impetus Infotech (India) Private Limited,
  www.impetus.com
  Mob:09907074413
 
 


 --
 -Thanks and Regards,
 Rahul Patodi
 Associate Software Engineer,
 Impetus Infotech (India) Private Limited,
 www.impetus.com
 Mob:09907074413



Re: Is there a single command to start the whole cluster in CDH3 ?

2010-11-23 Thread Todd Lipcon
Hi everyone,

Since this question is CDH-specific, it's better to ask on the cdh-user
mailing list:
https://groups.google.com/a/cloudera.org/group/cdh-user/topics?pli=1

Thanks
-Todd

On Wed, Nov 24, 2010 at 1:26 AM, Hari Sreekumar hsreeku...@clickable.comwrote:

 Hi Raul,

  I am not sure about CDH, but I have created a separate hadoop
 user to run my ASF hadoop version, and it works fine. Maybe you can also
 try
 creating a new hadoop user, make hadoop the owner of hadoop root directory.

 HTH,
 Hari

 On Wed, Nov 24, 2010 at 11:51 AM, rahul patodi patodira...@gmail.com
 wrote:

  hi Ricky,
  for installing CDH3 you can refer this tutorial:
 
 
 http://cloudera-tutorial.blogspot.com/2010/11/running-cloudera-in-distributed-mode.html
  all the steps in this tutorial are well tested.(*in case of any query
  please
  leave a comment*)
 
 
  On Wed, Nov 24, 2010 at 11:48 AM, rahul patodi patodira...@gmail.com
  wrote:
 
   hi Hary,
   when i try to start hadoop daemons by /usr/lib/hadoop# bin/start-dfs.sh
  on
   name node it is giving this error:*May not run daemons as root. Please
   specify HADOOP_NAMENODE_USER(*same for other daemons*)*
   but when i try to start it using */etc/init.d/hadoop-0.20-namenode
  start
   *it* *gets start successfully* **
   *
   *whats the reason behind that?
   *
  
   On Wed, Nov 24, 2010 at 10:04 AM, Hari Sreekumar 
  hsreeku...@clickable.com
wrote:
  
   Hi Ricky,
  
Yes, that's how it is meant to be. The machine where you run
   start-dfs.sh will become the namenode, and the machine whihc you
 specify
   in
   you masters file becomes the secondary namenode.
  
   Hari
  
   On Wed, Nov 24, 2010 at 2:13 AM, Ricky Ho rickyphyl...@yahoo.com
  wrote:
  
Thanks for pointing me to the right command.  I am using the CDH3
distribution.
I figure out no matter what I put in the masters file, it always
 start
   the
NamedNode at the machine where I issue the start-all.sh command.
   And
always
start a SecondaryNamedNode in all other machines.  Any clue ?
   
   
Rgds,
Ricky
   
-Original Message-
From: Hari Sreekumar [mailto:hsreeku...@clickable.com]
Sent: Tuesday, November 23, 2010 10:25 AM
To: common-user@hadoop.apache.org
Subject: Re: Is there a single command to start the whole cluster in
   CDH3 ?
   
Hi Ricky,
   
Which hadoop version are you using? I am using hadoop-0.20.2
   apache
version, and I generally just run the $HADOOP_HOME/bin/start-dfs.sh
  and
start-mapred.sh script on my master node. If passwordless ssh is
configured,
this script will start the required services on each node. You
  shouldn't
have to start the services on each node individually. The secondary
namenode
is specified in the conf/masters file. The node where you call the
start-*.sh script becomes the namenode(for start-dfs) or
  jobtracker(for
start-mapred). The node mentioned in the masters file becomes the
  2ndary
namenode, and the datanodes and tasktrackers are the nodes which are
mentioned in the slaves file.
   
HTH,
Hari
   
On Tue, Nov 23, 2010 at 11:43 PM, Ricky Ho rickyphyl...@yahoo.com
   wrote:
   
 I setup the cluster configuration in masters, slaves,
core-site.xml,
 hdfs-site.xml, mapred-site.xml and copy to all the machines.

 And I login to one of the machines and use the following to start
  the
 cluster.
 for service in /etc/init.d/hadoop-0.20-*; do sudo $service start;
  done

 I expect this command will SSH to all the other machines (based on
  the
 master
 and slaves files) to start the corresponding daemons, but
  obviously
   it
is
 not
 doing that in my setup.

 Am I missing something in my setup ?

 Also, where do I specify where the Secondary Name Node is run.

 Rgds,
 Ricky





   
   
   
   
  
  
  
  
   --
   -Thanks and Regards,
   Rahul Patodi
   Associate Software Engineer,
   Impetus Infotech (India) Private Limited,
   www.impetus.com
   Mob:09907074413
  
  
 
 
  --
  -Thanks and Regards,
  Rahul Patodi
  Associate Software Engineer,
  Impetus Infotech (India) Private Limited,
  www.impetus.com
  Mob:09907074413
 




-- 
Todd Lipcon
Software Engineer, Cloudera


is HDFS-788 resolved?

2010-11-23 Thread Manhee Jo
Hi there,

Is
https://issues.apache.org/jira/browse/HDFS-788
resolved?

What actually happens if the smaller partition of some datanodes get full 
while writing a block?
Is it possible that the datanodes are recognized as dead making replication 
storm among some hundreds of machines?

Thanks,
Manhee