Re: HDFS Safemode and EC2 EBS?

2009-06-25 Thread Tom White
Hi Chris,

You should really start all the slave nodes to be sure that you don't
lose data. If you start fewer than #nodes - #replication + 1 nodes
then you are virtually guaranteed to lose blocks. Starting 6 nodes out
of 10 will cause the filesystem to remain in safe mode, as you've
seen.

BTW I'm just created a Jira for EBS support
(https://issues.apache.org/jira/browse/HADOOP-6108) which you might be
interested in.

Cheers,
Tom

On Thu, Jun 25, 2009 at 3:51 PM, Chris Curtin wrote:
> Hi,
>
> I am using 0.19.0 on EC2. The Hadoop execution and HDFS directories are on
> EBS volumes mounted to each node in my EC2 cluster. Only the install of
> hadoop is in the AMI. We have 10 EBS volumes and when the cluster starts it
> randomly picks one for each slave. We don't always start all 10 slaves
> depending on what type of work we are going to do.
>
> Every third or fourth start of the cluster the namenode goes into safemode
> and won't come out automatically. Restarting datanodes and task trackers on
> each of the slaves doesn't help. Not much in the log files besides the error
> about waiting for the available %. Forcing it out of safe mode allows the
> cluster to start working.
>
> My only thought is that something is being stored on one of the EBS volumes
> not being mounted when starting a smaller configuration (say 6 nodes instead
> of 10). But isn't HDFS fault tolerant so that if there is a missing node it
> carries on?
>
> Any advice on why the namenode and datanodes can't find all the data blocks?
> Or where to look for more information about what might be going on?
>
> Thanks,
>
> Chris
>


HDFS Safemode and EC2 EBS?

2009-06-25 Thread Chris Curtin
Hi,

I am using 0.19.0 on EC2. The Hadoop execution and HDFS directories are on
EBS volumes mounted to each node in my EC2 cluster. Only the install of
hadoop is in the AMI. We have 10 EBS volumes and when the cluster starts it
randomly picks one for each slave. We don't always start all 10 slaves
depending on what type of work we are going to do.

Every third or fourth start of the cluster the namenode goes into safemode
and won't come out automatically. Restarting datanodes and task trackers on
each of the slaves doesn't help. Not much in the log files besides the error
about waiting for the available %. Forcing it out of safe mode allows the
cluster to start working.

My only thought is that something is being stored on one of the EBS volumes
not being mounted when starting a smaller configuration (say 6 nodes instead
of 10). But isn't HDFS fault tolerant so that if there is a missing node it
carries on?

Any advice on why the namenode and datanodes can't find all the data blocks?
Or where to look for more information about what might be going on?

Thanks,

Chris