Re: hadoop 0.20 append - some clarifications

2011-02-14 Thread M. C. Srivas
The problem you describe occurs with NFS also.

Basically, single-site-semantics are very hard to achieve on a networked
file system.


On Mon, Feb 14, 2011 at 8:21 PM, Gokulakannan M  wrote:

> I agree that HDFS doesn't strongly follow POSIX semantics. But it would
> have
> been better if this issue is fixed.
>
>
>
>  _
>
> From: Ted Dunning [mailto:tdunn...@maprtech.com]
> Sent: Monday, February 14, 2011 10:18 PM
> To: gok...@huawei.com
> Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org;
> dhr...@gmail.com
> Subject: Re: hadoop 0.20 append - some clarifications
>
>
>
> HDFS definitely doesn't follow anything like POSIX file semantics.
>
>
>
> They may be a vague inspiration for what HDFS does, but generally the
> behavior of HDFS is not tightly specified.  Even the unit tests have some
> real surprising behavior.
>
> On Mon, Feb 14, 2011 at 7:21 AM, Gokulakannan M  wrote:
>
>
>
> >> I think that in general, the behavior of any program reading data from
> an
> HDFS file before hsync or close is called is pretty much undefined.
>
>
>
> In Unix, users can parallelly read a file when another user is writing a
> file. And I suppose the sync feature design is based on that.
>
> So at any point of time during the file write, parallel users should be
> able
> to read the file.
>
>
>
> https://issues.apache.org/jira/browse/HDFS-142?focusedCommentId=12663958
> <
> https://issues.apache.org/jira/browse/HDFS-142?focusedCommentId=12663958&pa
>
> ge=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-
> 12663958>
>
> &page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comme
> nt-12663958
>
>  _
>
> From: Ted Dunning [mailto:tdunn...@maprtech.com]
> Sent: Friday, February 11, 2011 2:14 PM
> To: common-user@hadoop.apache.org; gok...@huawei.com
> Cc: hdfs-u...@hadoop.apache.org; dhr...@gmail.com
> Subject: Re: hadoop 0.20 append - some clarifications
>
>
>
> I think that in general, the behavior of any program reading data from an
> HDFS file before hsync or close is called is pretty much undefined.
>
>
>
> If you don't wait until some point were part of the file is defined, you
> can't expect any particular behavior.
>
> On Fri, Feb 11, 2011 at 12:31 AM, Gokulakannan M 
> wrote:
>
> I am not concerned about the sync behavior.
>
> The thing is the reader reading non-flushed(non-synced) data from HDFS as
> you have explained in previous post.(in hadoop 0.20 append branch)
>
> I identified one specific scenario where the above statement is not holding
> true.
>
> Following is how you can reproduce the problem.
>
> 1. add debug point at createBlockOutputStream() method in DFSClient and run
> your HDFS write client in debug mode
>
> 2. allow client to write 1 block to HDFS
>
> 3. for the 2nd block, the flow will come to the debug point mentioned in
> 1(do not execute the createBlockOutputStream() method). hold here.
>
> 4. parallely, try to read the file from another client
>
> Now you will get an error saying that file cannot be read.
>
>
>
>  _
>
> From: Ted Dunning [mailto:tdunn...@maprtech.com]
> Sent: Friday, February 11, 2011 11:04 AM
> To: gok...@huawei.com
> Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org;
> c...@boudnik.org
> Subject: Re: hadoop 0.20 append - some clarifications
>
>
>
> It is a bit confusing.
>
>
>
> SequenceFile.Writer#sync isn't really sync.
>
>
>
> There is SequenceFile.Writer#syncFs which is more what you might expect to
> be sync.
>
>
>
> Then there is HADOOP-6313 which specifies hflush and hsync.  Generally, if
> you want portable code, you have to reflect a bit to figure out what can be
> done.
>
> On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M  wrote:
>
> Thanks Ted for clarifying.
>
> So the sync is to just flush the current buffers to datanode and persist
> the
> block info in namenode once per block, isn't it?
>
>
>
> Regarding reader able to see the unflushed data, I faced an issue in the
> following scneario:
>
> 1. a writer is writing a 10MB file(block size 2 MB)
>
> 2. wrote the file upto 4MB (2 finalized blocks in current and nothing in
> blocksBeingWritten directory in DN) . So 2 blocks are written
>
> 3. client calls addBlock for the 3rd block on namenode and not yet created
> outputstream to DN(or written anything to DN). At this point of time, the
> namenode knows about the 3rd block but the datanode doesn't.
>
> 4. at point 3, a reader is trying to read the file and he is getting
> exception and not able to read the file as the datanode's getBlockInfo
> returns null to the client(of course DN doesn't know about the 3rd block
> yet)
>
> In this situation the reader cannot see the file. But when the block
> writing
> is in progress , the read is successful.
>
> Is this a bug that needs to be handled in append branch?
>
>
>
> >> -Original Message-
> >> From: Konstantin Boudnik [mailto:c...@boudnik.org]
> >> Sent: Friday, February 11, 2011 4:09 AM
> >>To: common-user@hadoop.apache.org
> >> Subj

RE: hadoop 0.20 append - some clarifications

2011-02-14 Thread Gokulakannan M
I agree that HDFS doesn't strongly follow POSIX semantics. But it would have
been better if this issue is fixed.

 

  _  

From: Ted Dunning [mailto:tdunn...@maprtech.com] 
Sent: Monday, February 14, 2011 10:18 PM
To: gok...@huawei.com
Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org;
dhr...@gmail.com
Subject: Re: hadoop 0.20 append - some clarifications

 

HDFS definitely doesn't follow anything like POSIX file semantics.

 

They may be a vague inspiration for what HDFS does, but generally the
behavior of HDFS is not tightly specified.  Even the unit tests have some
real surprising behavior.

On Mon, Feb 14, 2011 at 7:21 AM, Gokulakannan M  wrote:

 

>> I think that in general, the behavior of any program reading data from an
HDFS file before hsync or close is called is pretty much undefined.

 

In Unix, users can parallelly read a file when another user is writing a
file. And I suppose the sync feature design is based on that.

So at any point of time during the file write, parallel users should be able
to read the file.

 

https://issues.apache.org/jira/browse/HDFS-142?focusedCommentId=12663958

&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comme
nt-12663958

  _  

From: Ted Dunning [mailto:tdunn...@maprtech.com] 
Sent: Friday, February 11, 2011 2:14 PM
To: common-user@hadoop.apache.org; gok...@huawei.com
Cc: hdfs-u...@hadoop.apache.org; dhr...@gmail.com
Subject: Re: hadoop 0.20 append - some clarifications

 

I think that in general, the behavior of any program reading data from an
HDFS file before hsync or close is called is pretty much undefined.

 

If you don't wait until some point were part of the file is defined, you
can't expect any particular behavior.

On Fri, Feb 11, 2011 at 12:31 AM, Gokulakannan M  wrote:

I am not concerned about the sync behavior.

The thing is the reader reading non-flushed(non-synced) data from HDFS as
you have explained in previous post.(in hadoop 0.20 append branch)

I identified one specific scenario where the above statement is not holding
true.

Following is how you can reproduce the problem.

1. add debug point at createBlockOutputStream() method in DFSClient and run
your HDFS write client in debug mode

2. allow client to write 1 block to HDFS

3. for the 2nd block, the flow will come to the debug point mentioned in
1(do not execute the createBlockOutputStream() method). hold here.

4. parallely, try to read the file from another client

Now you will get an error saying that file cannot be read.



 _

From: Ted Dunning [mailto:tdunn...@maprtech.com]
Sent: Friday, February 11, 2011 11:04 AM
To: gok...@huawei.com
Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org;
c...@boudnik.org
Subject: Re: hadoop 0.20 append - some clarifications



It is a bit confusing.



SequenceFile.Writer#sync isn't really sync.



There is SequenceFile.Writer#syncFs which is more what you might expect to
be sync.



Then there is HADOOP-6313 which specifies hflush and hsync.  Generally, if
you want portable code, you have to reflect a bit to figure out what can be
done.

On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M  wrote:

Thanks Ted for clarifying.

So the sync is to just flush the current buffers to datanode and persist the
block info in namenode once per block, isn't it?



Regarding reader able to see the unflushed data, I faced an issue in the
following scneario:

1. a writer is writing a 10MB file(block size 2 MB)

2. wrote the file upto 4MB (2 finalized blocks in current and nothing in
blocksBeingWritten directory in DN) . So 2 blocks are written

3. client calls addBlock for the 3rd block on namenode and not yet created
outputstream to DN(or written anything to DN). At this point of time, the
namenode knows about the 3rd block but the datanode doesn't.

4. at point 3, a reader is trying to read the file and he is getting
exception and not able to read the file as the datanode's getBlockInfo
returns null to the client(of course DN doesn't know about the 3rd block
yet)

In this situation the reader cannot see the file. But when the block writing
is in progress , the read is successful.

Is this a bug that needs to be handled in append branch?



>> -Original Message-
>> From: Konstantin Boudnik [mailto:c...@boudnik.org]
>> Sent: Friday, February 11, 2011 4:09 AM
>>To: common-user@hadoop.apache.org
>> Subject: Re: hadoop 0.20 append - some clarifications

>> You might also want to check append design doc published at HDFS-265



I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's
design doc won't apply to it.



 _

From: Ted Dunning [mailto:tdunn...@maprtech.com]
Sent: Thursday, February 10, 2011 9:29 PM
To: common-user@hadoop.apache.org; gok...@huawei.com
Cc: hdfs-u...@hadoop.apache.org
Subject: Re: hadoop 0.20 app

Re: Is this a fair summary of HDFS failover?

2011-02-14 Thread Mark Kerzner
I completely agree, and I am using yours and the group's posting to define
the direction and approaches, but I am also trying every solution - and I am
beginning to do just that, the AvatarNode now.

Thank you,
Mark

On Mon, Feb 14, 2011 at 4:43 PM, M. C. Srivas  wrote:

> I understand you are writing a book "Hadoop in Practice".  If so, its
> important that what's recommended in the book should be verified in
> practice. (I mean, beyond simply posting in this newsgroup - for instance,
> the recommendations on NN fail-over should be tried out first before
> writing
> about how to do it). Otherwise you won't know your recommendations really
> work or not.
>
>
>
> On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner  >wrote:
>
> > Thank you, M. C. Srivas, that was enormously useful. I understand it now,
> > but just to be complete, I have re-formulated my points according to your
> > comments:
> >
> >   - In 0.20 the Secondary NameNode performs snapshotting. Its data can be
> >   used to recreate the HDFS if the Primary NameNode fails. The procedure
> is
> >   manual and may take hours, and there is also data loss since the last
> >   snapshot;
> >   - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with
> >   HA and act as a cold spare. The data loss is less than with Secondary
> NN,
> >   but it is still manual and potentially error-prone, and it takes hours;
> >   - There is an AvatarNode patch available for 0.20, and Facebook runs
> its
> >   cluster that way, but the patch submitted to Apache requires testing
> and
> > the
> >   developers adopting it must do some custom configurations and also
> > exercise
> >   care in their work.
> >
> > As a conclusion, when building an HA HDFS cluster, one needs to follow
> the
> > best
> > practices outlined by Tom
> > White<
> > http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
> >,
> > and may still need to resort to specialized NSF filers for running the
> > NameNode.
> >
> > Sincerely,
> > Mark
> >
> >
> >
> > On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas 
> wrote:
> >
> > > The summary is quite inaccurate.
> > >
> > > On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > is it accurate to say that
> > > >
> > > >   - In 0.20 the Secondary NameNode acts as a cold spare; it can be
> used
> > > to
> > > >   recreate the HDFS if the Primary NameNode fails, but with the delay
> > of
> > > >   minutes if not hours, and there is also some data loss;
> > > >
> > >
> > >
> > > The Secondary NN is not a spare. It is used to augment the work of the
> > > Primary, by offloading some of its work to another machine. The work
> > > offloaded is "log rollup" or "checkpointing". This has been a source of
> > > constant confusion (some named it incorrectly as a "secondary" and now
> we
> > > are stuck with it).
> > >
> > > The Secondary NN certainly cannot take over for the Primary. It is not
> > its
> > > purpose.
> > >
> > > Yes, there is data loss.
> > >
> > >
> > >
> > >
> > > >   - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539),
> > > which
> > > >   replaces the Secondary NameNode. The Backup Node can be used as a
> > warm
> > > >   spare, with the failover being a matter of seconds. There can be
> > > multiple
> > > >   Backup Nodes, for additional insurance against failure, and
> previous
> > > best
> > > >   common practices apply to it;
> > > >
> > >
> > >
> > > There is no "Backup NN" in the manner you are thinking of. It is
> > completely
> > > manual, and requires restart of the "whole world", and takes about 2-3
> > > hours
> > > to happen. If you are lucky, you may have only a little data loss
> (people
> > > have lost entire clusters due to this -- from what I understand, you
> are
> > > far
> > > better off resurrecting the Primary instead of trying to bring up a
> > Backup
> > > NN).
> > >
> > > In any case, when you run it like you mention above, you will have to
> > > (a) make sure that the primary is dead
> > > (b) edit hdfs-site.xml on *every* datanode to point to the new IP
> address
> > > of
> > > the backup, and restart each datanode.
> > > (c) wait for 2-3 hours for all the block-reports from every restarted
> DN
> > to
> > > finish
> > >
> > > 2-3 hrs afterwards:
> > > (d) after that, restart all TT and the JT to connect to the new NN
> > > (e) finally, restart all the clients (eg, HBase, Oozie, etc)
> > >
> > > Many companies, including Yahoo! and Facebook, use a couple of NetApp
> > > filers
> > > to hold the actual data that the NN writes. The two NetApp filers are
> run
> > > in
> > > "HA" mode with NVRAM copying.  But the NN remains a single point of
> > > failure,
> > > and there is probably some data loss.
> > >
> > >
> > >
> > > >   - 0.22 will have further improvements to the HDFS performance, such
> > > >   as HDFS-1093.
> > > >
> > > > Does the paper on HDFS Reliability by Tom
> > > > White<
> > > >
> > http://www.cloudera.com/wp-content/uploads/201

Re: Is this a fair summary of HDFS failover?

2011-02-14 Thread M. C. Srivas
I understand you are writing a book "Hadoop in Practice".  If so, its
important that what's recommended in the book should be verified in
practice. (I mean, beyond simply posting in this newsgroup - for instance,
the recommendations on NN fail-over should be tried out first before writing
about how to do it). Otherwise you won't know your recommendations really
work or not.



On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner wrote:

> Thank you, M. C. Srivas, that was enormously useful. I understand it now,
> but just to be complete, I have re-formulated my points according to your
> comments:
>
>   - In 0.20 the Secondary NameNode performs snapshotting. Its data can be
>   used to recreate the HDFS if the Primary NameNode fails. The procedure is
>   manual and may take hours, and there is also data loss since the last
>   snapshot;
>   - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with
>   HA and act as a cold spare. The data loss is less than with Secondary NN,
>   but it is still manual and potentially error-prone, and it takes hours;
>   - There is an AvatarNode patch available for 0.20, and Facebook runs its
>   cluster that way, but the patch submitted to Apache requires testing and
> the
>   developers adopting it must do some custom configurations and also
> exercise
>   care in their work.
>
> As a conclusion, when building an HA HDFS cluster, one needs to follow the
> best
> practices outlined by Tom
> White<
> http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf>,
> and may still need to resort to specialized NSF filers for running the
> NameNode.
>
> Sincerely,
> Mark
>
>
>
> On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas  wrote:
>
> > The summary is quite inaccurate.
> >
> > On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner 
> > wrote:
> >
> > > Hi,
> > >
> > > is it accurate to say that
> > >
> > >   - In 0.20 the Secondary NameNode acts as a cold spare; it can be used
> > to
> > >   recreate the HDFS if the Primary NameNode fails, but with the delay
> of
> > >   minutes if not hours, and there is also some data loss;
> > >
> >
> >
> > The Secondary NN is not a spare. It is used to augment the work of the
> > Primary, by offloading some of its work to another machine. The work
> > offloaded is "log rollup" or "checkpointing". This has been a source of
> > constant confusion (some named it incorrectly as a "secondary" and now we
> > are stuck with it).
> >
> > The Secondary NN certainly cannot take over for the Primary. It is not
> its
> > purpose.
> >
> > Yes, there is data loss.
> >
> >
> >
> >
> > >   - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539),
> > which
> > >   replaces the Secondary NameNode. The Backup Node can be used as a
> warm
> > >   spare, with the failover being a matter of seconds. There can be
> > multiple
> > >   Backup Nodes, for additional insurance against failure, and previous
> > best
> > >   common practices apply to it;
> > >
> >
> >
> > There is no "Backup NN" in the manner you are thinking of. It is
> completely
> > manual, and requires restart of the "whole world", and takes about 2-3
> > hours
> > to happen. If you are lucky, you may have only a little data loss (people
> > have lost entire clusters due to this -- from what I understand, you are
> > far
> > better off resurrecting the Primary instead of trying to bring up a
> Backup
> > NN).
> >
> > In any case, when you run it like you mention above, you will have to
> > (a) make sure that the primary is dead
> > (b) edit hdfs-site.xml on *every* datanode to point to the new IP address
> > of
> > the backup, and restart each datanode.
> > (c) wait for 2-3 hours for all the block-reports from every restarted DN
> to
> > finish
> >
> > 2-3 hrs afterwards:
> > (d) after that, restart all TT and the JT to connect to the new NN
> > (e) finally, restart all the clients (eg, HBase, Oozie, etc)
> >
> > Many companies, including Yahoo! and Facebook, use a couple of NetApp
> > filers
> > to hold the actual data that the NN writes. The two NetApp filers are run
> > in
> > "HA" mode with NVRAM copying.  But the NN remains a single point of
> > failure,
> > and there is probably some data loss.
> >
> >
> >
> > >   - 0.22 will have further improvements to the HDFS performance, such
> > >   as HDFS-1093.
> > >
> > > Does the paper on HDFS Reliability by Tom
> > > White<
> > >
> http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
> > > >still
> > > represent the current state of things?
> > >
> >
> >
> > See Dhruba's blog-post about the Avatar NN + some custom "stackable HDFS"
> > code on all the clients + Zookeeper + the dual NetApp filers.
> >
> > It helps Facebook do manual, controlled, fail-over during software
> > upgrades,
> > at the cost of some performance loss on the DataNodes (the DataNodes have
> > to
> > do 2x block reports, and each block-report is expensive, so it limits the
> > DataNode a bit).  The article does not talk about dataloss when

Re: Is this a fair summary of HDFS failover?

2011-02-14 Thread Ted Dunning
Note that document purports to be from 2008 and, at best, was uploaded just
about a year ago.

That it is still pretty accurate is kind of a tribute to either the
stability of hbase or the stagnation depending on how you read it.

On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner wrote:

> As a conclusion, when building an HA HDFS cluster, one needs to follow the
> best
> practices outlined by Tom
> White<
> http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf>,
> and may still need to resort to specialized NSF filers for running the
> NameNode.
>


Re: Is this a fair summary of HDFS failover?

2011-02-14 Thread Mark Kerzner
Thank you, M. C. Srivas, that was enormously useful. I understand it now,
but just to be complete, I have re-formulated my points according to your
comments:

   - In 0.20 the Secondary NameNode performs snapshotting. Its data can be
   used to recreate the HDFS if the Primary NameNode fails. The procedure is
   manual and may take hours, and there is also data loss since the last
   snapshot;
   - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with
   HA and act as a cold spare. The data loss is less than with Secondary NN,
   but it is still manual and potentially error-prone, and it takes hours;
   - There is an AvatarNode patch available for 0.20, and Facebook runs its
   cluster that way, but the patch submitted to Apache requires testing and the
   developers adopting it must do some custom configurations and also exercise
   care in their work.

As a conclusion, when building an HA HDFS cluster, one needs to follow the best
practices outlined by Tom
White,
and may still need to resort to specialized NSF filers for running the
NameNode.

Sincerely,
Mark



On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas  wrote:

> The summary is quite inaccurate.
>
> On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner 
> wrote:
>
> > Hi,
> >
> > is it accurate to say that
> >
> >   - In 0.20 the Secondary NameNode acts as a cold spare; it can be used
> to
> >   recreate the HDFS if the Primary NameNode fails, but with the delay of
> >   minutes if not hours, and there is also some data loss;
> >
>
>
> The Secondary NN is not a spare. It is used to augment the work of the
> Primary, by offloading some of its work to another machine. The work
> offloaded is "log rollup" or "checkpointing". This has been a source of
> constant confusion (some named it incorrectly as a "secondary" and now we
> are stuck with it).
>
> The Secondary NN certainly cannot take over for the Primary. It is not its
> purpose.
>
> Yes, there is data loss.
>
>
>
>
> >   - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539),
> which
> >   replaces the Secondary NameNode. The Backup Node can be used as a warm
> >   spare, with the failover being a matter of seconds. There can be
> multiple
> >   Backup Nodes, for additional insurance against failure, and previous
> best
> >   common practices apply to it;
> >
>
>
> There is no "Backup NN" in the manner you are thinking of. It is completely
> manual, and requires restart of the "whole world", and takes about 2-3
> hours
> to happen. If you are lucky, you may have only a little data loss (people
> have lost entire clusters due to this -- from what I understand, you are
> far
> better off resurrecting the Primary instead of trying to bring up a Backup
> NN).
>
> In any case, when you run it like you mention above, you will have to
> (a) make sure that the primary is dead
> (b) edit hdfs-site.xml on *every* datanode to point to the new IP address
> of
> the backup, and restart each datanode.
> (c) wait for 2-3 hours for all the block-reports from every restarted DN to
> finish
>
> 2-3 hrs afterwards:
> (d) after that, restart all TT and the JT to connect to the new NN
> (e) finally, restart all the clients (eg, HBase, Oozie, etc)
>
> Many companies, including Yahoo! and Facebook, use a couple of NetApp
> filers
> to hold the actual data that the NN writes. The two NetApp filers are run
> in
> "HA" mode with NVRAM copying.  But the NN remains a single point of
> failure,
> and there is probably some data loss.
>
>
>
> >   - 0.22 will have further improvements to the HDFS performance, such
> >   as HDFS-1093.
> >
> > Does the paper on HDFS Reliability by Tom
> > White<
> > http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
> > >still
> > represent the current state of things?
> >
>
>
> See Dhruba's blog-post about the Avatar NN + some custom "stackable HDFS"
> code on all the clients + Zookeeper + the dual NetApp filers.
>
> It helps Facebook do manual, controlled, fail-over during software
> upgrades,
> at the cost of some performance loss on the DataNodes (the DataNodes have
> to
> do 2x block reports, and each block-report is expensive, so it limits the
> DataNode a bit).  The article does not talk about dataloss when the
> fail-over is initiated manually, so I don't know about that.
>
>
> http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html
>
>
>
>
> >
> > Thank you. Sincerely,
> > Mark
> >
>


Re: Debugging and fixing Safemode

2011-02-14 Thread Matthew Foley
Hi Sandhya,
the threshold for leaving safemode automatically is configurable; it defaults 
to 0.999, but you can change parameter "dfs.namenode.safemode.threshold-pct" to 
a different floating-point number in your config.  It is set to almost 100% by 
default, on the theory that (a) if you didn't hit 100% it means some of your 
datanodes didn't come up or suffered data loss, and (b) you might want to know 
about that before letting the cluster start writing and changing files.

When the cluster comes out of safe mode, it should automatically fix any 
under-replicated blocks; you don't need to take action to fix them yourself.  
But any files that are damaged by loss of ALL replicas of a block will appear 
corrupted to applications.

You can run dfs fsck to identify problem files, and move them to lost+found or 
delete them.

Hope this helps,
--Matt


On Feb 11, 2011, at 11:53 AM, Edupuganti, Sandhya wrote:

Our Namenode is going into Safemode after every restart. It reports the ratio 
to be .98xxx whereas it is looking for 0.999 to leave the safe mode. So I'm 
guessing there must be one or two files that are under replicated.

Is there any way I can find out which files are under replicated, so that I can 
re copy them if I have or delete them.

I don’t want to end up with a silent Namenode in a safemode next time and 
causing all our jobs to fail.

Any pointers will be greatly appreciated

Many Thanks
Sandhya



Check/Compare mappers output

2011-02-14 Thread maha
Hi all,

  I want to know how can I check/compare mappers key and values.

Example:

  My Mappers have the following to filter documents before being output:

  String doc1;

 if(!doc1.equals("d1"))
output.collect(new Text('#'+doc1+'#'), new 
Text('#'+word1.substring(word1.indexOf(',')+1, word1.indexOf('>'))+'#')); 
 

 Yet the intermediate output still includes "d1":
#d1##1#
#d1##2#
#d1##1#
#d1##5#
#d1##3#
..

  I put '#' to see if there was a space or newline included. Any ideas?

Thank you,

Maha

Re: Problem with running the job, no default queue

2011-02-14 Thread Koji Noguchi
Hi Shivani,

You probably don't want to ask m45 specific questions on hadoop.apache mailing 
list.

Try

% hadoop queue -showacls

It should show which queues you're allowed to submit.   If it doesn't give you 
any queues, you need to request one.

Koji



On 2/9/11 9:10 PM, "Shivani Rao"  wrote:

Tried a simple example job with Yahoo M45. The job fails for non-existence of a 
default queue.
Output is attached as below. From the Apache hadoop mailing list, found this 
post (specific to M45), that attacked this problem by setting the property 
Dmapred.job.queue.name=*myqueue* 
(http://web.archiveorange.com/archive/v/3inw3ySGHmNRR9Bm14Uv)

There is also documentation set for capacity schedulers, but I do not have 
write access to the files in conf directory, so I do not know how I can set the 
capacity schedulers there.

I am also posting this question on the general lists, just in case.

$hadoop jar /grid/0/gs/hadoop/current/hadoop-examples.jar pi 10 1
Number of Maps  = 10
Samples per Map = 1
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
11/02/10 04:19:22 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
83705 for sgrao
11/02/10 04:19:22 INFO security.TokenCache: Got dt for 
hdfs://grit-nn1.yahooresearchcluster.com/user/sgrao/.staging/job_201101150035_26053;uri=68.180.138.10:8020;t.service=68.180.138.10:8020
11/02/10 04:19:22 INFO mapred.FileInputFormat: Total input paths to process : 10
11/02/10 04:19:23 INFO mapred.JobClient: Cleaning up the staging area 
hdfs://grit-nn1.yahooresearchcluster.com/user/sgrao/.staging/job_201101150035_26053
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Queue "default" 
does not exist
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3680)
at sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1301)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1297)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)

at org.apache.hadoop.ipc.Client.call(Client.java:951)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:223)
at org.apache.hadoop.mapred.$Proxy6.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:818)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1156)
at org.apache.hadoop.examples.PiEstimator.estimate(PiEstimator.java:297)
at org.apache.hadoop.examples.PiEstimator.run(PiEstimator.java:342)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.examples.PiEstimator.main(PiEstimator.java:351)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



Re: something doubts about map/reduce

2011-02-14 Thread Harsh J
Hello,

2011/2/14 Wang LiChuan[王立传] :
> So my question is this:
>
> If I return "false"  in the function of "isSplitable" to tell the framework, 
> it's a none-splittable file, when doing map/reduce, how many map task may I 
> have?Does I have 100 map, and each one handle a file? Or do you have any 
> suggestions on how to handle this kind of situation?

Yes, this is what should happen if the splits aren't made. InputSplits
are created for the full length of each input file in this case.

-- 
Harsh J
www.harshj.com


something doubts about map/reduce

2011-02-14 Thread Wang LiChuan [王立传]
HI My Friends:

 I’ has researched into hadoop and map/reduce, but before I can go on. 
I have one question, and I Can’t find it in FAQ. Please consider this situation:

1.   I created 100 files, each file of course is bigger than the default 
64MB(such as 1G),so definitely will be split into many blocks in hdfs

2.   When handling each file, I want to consider the file is 
none-splittable, due to that when handle part of file I need all of the other 
parts

So my question is this:

If I return “false”  in the function of “isSplitable” to tell the framework, 
it’s a none-splittable file, when doing map/reduce, how many map task may I 
have?Does I have 100 map, and each one handle a file? Or do you have any 
suggestions on how to handle this kind of situation?

Thanks!

Best wishes,

Yours 

Wang LiChuan

2011-2-14



Re: HBase crashes when one server goes down

2011-02-14 Thread Jean-Daniel Cryans
Please use the hbase mailing list for HBase-related questions.

Regarding your issue, we'll need more information to help you out.
Haven you checked the logs? If you see exceptions in there, did you
google them trying to figure out what's going on?

Finally, does your setup meet all the requirements?
http://hbase.apache.org/notsoquick.html#requirements

J-D

On Mon, Feb 14, 2011 at 9:49 AM, Rodrigo Barreto  wrote:
> Hi,
>
> We are new with Hadoop, we have just configured a cluster with 3 servers and
> everything is working ok except when one server goes down, the Hadoop / HDFS
> continues working but the HBase stops, the queries does not return results
> until we restart the HBase. The HBase configuration is copied bellow, please
> help us.
>
> ## HBASE-SITE.XML ###
>
> 
>        
>                hbase.zookeeper.quorum
>                master,slave1,slave2
>                The directory shared by region servers.
>                
>        
>        
>                hbase.rootdir
>                hdfs://master:54310/hbase
>        
>        
>                hbase.cluster.distributed
>                true
>        
>        
>                hbase.master
>                master:6
>                The host and port that the HBase master runs
> at.
>                
>        
>
>        
>                dfs.replication
>                2
>                Default block replication.
>                The actual number of replications can be specified when the
> file is created.
>                The default is used if replication is not specified in
> create time.
>                
>        
> 
>
>
> Thanks,
>
> Rodrigo Barreto.
>


Re: Is this a fair summary of HDFS failover?

2011-02-14 Thread M. C. Srivas
The summary is quite inaccurate.

On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner  wrote:

> Hi,
>
> is it accurate to say that
>
>   - In 0.20 the Secondary NameNode acts as a cold spare; it can be used to
>   recreate the HDFS if the Primary NameNode fails, but with the delay of
>   minutes if not hours, and there is also some data loss;
>


The Secondary NN is not a spare. It is used to augment the work of the
Primary, by offloading some of its work to another machine. The work
offloaded is "log rollup" or "checkpointing". This has been a source of
constant confusion (some named it incorrectly as a "secondary" and now we
are stuck with it).

The Secondary NN certainly cannot take over for the Primary. It is not its
purpose.

Yes, there is data loss.




>   - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which
>   replaces the Secondary NameNode. The Backup Node can be used as a warm
>   spare, with the failover being a matter of seconds. There can be multiple
>   Backup Nodes, for additional insurance against failure, and previous best
>   common practices apply to it;
>


There is no "Backup NN" in the manner you are thinking of. It is completely
manual, and requires restart of the "whole world", and takes about 2-3 hours
to happen. If you are lucky, you may have only a little data loss (people
have lost entire clusters due to this -- from what I understand, you are far
better off resurrecting the Primary instead of trying to bring up a Backup
NN).

In any case, when you run it like you mention above, you will have to
(a) make sure that the primary is dead
(b) edit hdfs-site.xml on *every* datanode to point to the new IP address of
the backup, and restart each datanode.
(c) wait for 2-3 hours for all the block-reports from every restarted DN to
finish

2-3 hrs afterwards:
(d) after that, restart all TT and the JT to connect to the new NN
(e) finally, restart all the clients (eg, HBase, Oozie, etc)

Many companies, including Yahoo! and Facebook, use a couple of NetApp filers
to hold the actual data that the NN writes. The two NetApp filers are run in
"HA" mode with NVRAM copying.  But the NN remains a single point of failure,
and there is probably some data loss.



>   - 0.22 will have further improvements to the HDFS performance, such
>   as HDFS-1093.
>
> Does the paper on HDFS Reliability by Tom
> White<
> http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
> >still
> represent the current state of things?
>


See Dhruba's blog-post about the Avatar NN + some custom "stackable HDFS"
code on all the clients + Zookeeper + the dual NetApp filers.

It helps Facebook do manual, controlled, fail-over during software upgrades,
at the cost of some performance loss on the DataNodes (the DataNodes have to
do 2x block reports, and each block-report is expensive, so it limits the
DataNode a bit).  The article does not talk about dataloss when the
fail-over is initiated manually, so I don't know about that.

http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html




>
> Thank you. Sincerely,
> Mark
>


HBase crashes when one server goes down

2011-02-14 Thread Rodrigo Barreto
Hi,

We are new with Hadoop, we have just configured a cluster with 3 servers and
everything is working ok except when one server goes down, the Hadoop / HDFS
continues working but the HBase stops, the queries does not return results
until we restart the HBase. The HBase configuration is copied bellow, please
help us.

## HBASE-SITE.XML ###



hbase.zookeeper.quorum
master,slave1,slave2
The directory shared by region servers.



hbase.rootdir
hdfs://master:54310/hbase


hbase.cluster.distributed
true


hbase.master
master:6
The host and port that the HBase master runs
at.




dfs.replication
2
Default block replication.
The actual number of replications can be specified when the
file is created.
The default is used if replication is not specified in
create time.





Thanks,

Rodrigo Barreto.


Re: Reduce Failed at Tasktrackers

2011-02-14 Thread Jairam Chandar
Did you get a response/solution/workaround to this problem?
I am getting the same error. 

-Jairam





Is this a fair summary of HDFS failover?

2011-02-14 Thread Mark Kerzner
Hi,

is it accurate to say that

   - In 0.20 the Secondary NameNode acts as a cold spare; it can be used to
   recreate the HDFS if the Primary NameNode fails, but with the delay of
   minutes if not hours, and there is also some data loss;
   - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which
   replaces the Secondary NameNode. The Backup Node can be used as a warm
   spare, with the failover being a matter of seconds. There can be multiple
   Backup Nodes, for additional insurance against failure, and previous best
   common practices apply to it;
   - 0.22 will have further improvements to the HDFS performance, such
   as HDFS-1093.

Does the paper on HDFS Reliability by Tom
Whitestill
represent the current state of things?

Thank you. Sincerely,
Mark


Re: hadoop 0.20 append - some clarifications

2011-02-14 Thread Ted Dunning
HDFS definitely doesn't follow anything like POSIX file semantics.

They may be a vague inspiration for what HDFS does, but generally the
behavior of HDFS is not tightly specified.  Even the unit tests have some
real surprising behavior.

On Mon, Feb 14, 2011 at 7:21 AM, Gokulakannan M  wrote:

>
>
> >> I think that in general, the behavior of any program reading data from
> an HDFS file before hsync or close is called is pretty much undefined.
>
>
>
> In Unix, users can parallelly read a file when another user is writing a
> file. And I suppose the sync feature design is based on that.
>
> So at any point of time during the file write, parallel users should be
> able to read the file.
>
>
>
>
> https://issues.apache.org/jira/browse/HDFS-142?focusedCommentId=12663958&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12663958
>  --
>
> *From:* Ted Dunning [mailto:tdunn...@maprtech.com]
> *Sent:* Friday, February 11, 2011 2:14 PM
> *To:* common-user@hadoop.apache.org; gok...@huawei.com
> *Cc:* hdfs-u...@hadoop.apache.org; dhr...@gmail.com
> *Subject:* Re: hadoop 0.20 append - some clarifications
>
>
>
> I think that in general, the behavior of any program reading data from an
> HDFS file before hsync or close is called is pretty much undefined.
>
>
>
> If you don't wait until some point were part of the file is defined, you
> can't expect any particular behavior.
>
> On Fri, Feb 11, 2011 at 12:31 AM, Gokulakannan M 
> wrote:
>
> I am not concerned about the sync behavior.
>
> The thing is the reader reading non-flushed(non-synced) data from HDFS as
> you have explained in previous post.(in hadoop 0.20 append branch)
>
> I identified one specific scenario where the above statement is not holding
> true.
>
> Following is how you can reproduce the problem.
>
> 1. add debug point at createBlockOutputStream() method in DFSClient and run
> your HDFS write client in debug mode
>
> 2. allow client to write 1 block to HDFS
>
> 3. for the 2nd block, the flow will come to the debug point mentioned in
> 1(do not execute the createBlockOutputStream() method). hold here.
>
> 4. parallely, try to read the file from another client
>
> Now you will get an error saying that file cannot be read.
>
>
>
>  _
>
> From: Ted Dunning [mailto:tdunn...@maprtech.com]
> Sent: Friday, February 11, 2011 11:04 AM
> To: gok...@huawei.com
> Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org;
> c...@boudnik.org
> Subject: Re: hadoop 0.20 append - some clarifications
>
>
>
> It is a bit confusing.
>
>
>
> SequenceFile.Writer#sync isn't really sync.
>
>
>
> There is SequenceFile.Writer#syncFs which is more what you might expect to
> be sync.
>
>
>
> Then there is HADOOP-6313 which specifies hflush and hsync.  Generally, if
> you want portable code, you have to reflect a bit to figure out what can be
> done.
>
> On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M  wrote:
>
> Thanks Ted for clarifying.
>
> So the sync is to just flush the current buffers to datanode and persist
> the
> block info in namenode once per block, isn't it?
>
>
>
> Regarding reader able to see the unflushed data, I faced an issue in the
> following scneario:
>
> 1. a writer is writing a 10MB file(block size 2 MB)
>
> 2. wrote the file upto 4MB (2 finalized blocks in current and nothing in
> blocksBeingWritten directory in DN) . So 2 blocks are written
>
> 3. client calls addBlock for the 3rd block on namenode and not yet created
> outputstream to DN(or written anything to DN). At this point of time, the
> namenode knows about the 3rd block but the datanode doesn't.
>
> 4. at point 3, a reader is trying to read the file and he is getting
> exception and not able to read the file as the datanode's getBlockInfo
> returns null to the client(of course DN doesn't know about the 3rd block
> yet)
>
> In this situation the reader cannot see the file. But when the block
> writing
> is in progress , the read is successful.
>
> Is this a bug that needs to be handled in append branch?
>
>
>
> >> -Original Message-
> >> From: Konstantin Boudnik [mailto:c...@boudnik.org]
> >> Sent: Friday, February 11, 2011 4:09 AM
> >>To: common-user@hadoop.apache.org
> >> Subject: Re: hadoop 0.20 append - some clarifications
>
> >> You might also want to check append design doc published at HDFS-265
>
>
>
> I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's
> design doc won't apply to it.
>
>
>
>  _
>
> From: Ted Dunning [mailto:tdunn...@maprtech.com]
> Sent: Thursday, February 10, 2011 9:29 PM
> To: common-user@hadoop.apache.org; gok...@huawei.com
> Cc: hdfs-u...@hadoop.apache.org
> Subject: Re: hadoop 0.20 append - some clarifications
>
>
>
> Correct is a strong word here.
>
>
>
> There is actually an HDFS unit test that checks to see if partially written
> and unflushed data is visible.  The basic rule of thumb is that you need to
> synchronize readers and writers outside of HDFS.  There is no guar

RE: hadoop 0.20 append - some clarifications

2011-02-14 Thread Gokulakannan M
 

>> I think that in general, the behavior of any program reading data from an
HDFS file before hsync or close is called is pretty much undefined.

 

In Unix, users can parallelly read a file when another user is writing a
file. And I suppose the sync feature design is based on that.

So at any point of time during the file write, parallel users should be able
to read the file.

 

https://issues.apache.org/jira/browse/HDFS-142?focusedCommentId=12663958&pag
e=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1
2663958

  _  

From: Ted Dunning [mailto:tdunn...@maprtech.com] 
Sent: Friday, February 11, 2011 2:14 PM
To: common-user@hadoop.apache.org; gok...@huawei.com
Cc: hdfs-u...@hadoop.apache.org; dhr...@gmail.com
Subject: Re: hadoop 0.20 append - some clarifications

 

I think that in general, the behavior of any program reading data from an
HDFS file before hsync or close is called is pretty much undefined.

 

If you don't wait until some point were part of the file is defined, you
can't expect any particular behavior.

On Fri, Feb 11, 2011 at 12:31 AM, Gokulakannan M  wrote:

I am not concerned about the sync behavior.

The thing is the reader reading non-flushed(non-synced) data from HDFS as
you have explained in previous post.(in hadoop 0.20 append branch)

I identified one specific scenario where the above statement is not holding
true.

Following is how you can reproduce the problem.

1. add debug point at createBlockOutputStream() method in DFSClient and run
your HDFS write client in debug mode

2. allow client to write 1 block to HDFS

3. for the 2nd block, the flow will come to the debug point mentioned in
1(do not execute the createBlockOutputStream() method). hold here.

4. parallely, try to read the file from another client

Now you will get an error saying that file cannot be read.



 _

From: Ted Dunning [mailto:tdunn...@maprtech.com]
Sent: Friday, February 11, 2011 11:04 AM
To: gok...@huawei.com
Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org;
c...@boudnik.org
Subject: Re: hadoop 0.20 append - some clarifications



It is a bit confusing.



SequenceFile.Writer#sync isn't really sync.



There is SequenceFile.Writer#syncFs which is more what you might expect to
be sync.



Then there is HADOOP-6313 which specifies hflush and hsync.  Generally, if
you want portable code, you have to reflect a bit to figure out what can be
done.

On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M  wrote:

Thanks Ted for clarifying.

So the sync is to just flush the current buffers to datanode and persist the
block info in namenode once per block, isn't it?



Regarding reader able to see the unflushed data, I faced an issue in the
following scneario:

1. a writer is writing a 10MB file(block size 2 MB)

2. wrote the file upto 4MB (2 finalized blocks in current and nothing in
blocksBeingWritten directory in DN) . So 2 blocks are written

3. client calls addBlock for the 3rd block on namenode and not yet created
outputstream to DN(or written anything to DN). At this point of time, the
namenode knows about the 3rd block but the datanode doesn't.

4. at point 3, a reader is trying to read the file and he is getting
exception and not able to read the file as the datanode's getBlockInfo
returns null to the client(of course DN doesn't know about the 3rd block
yet)

In this situation the reader cannot see the file. But when the block writing
is in progress , the read is successful.

Is this a bug that needs to be handled in append branch?



>> -Original Message-
>> From: Konstantin Boudnik [mailto:c...@boudnik.org]
>> Sent: Friday, February 11, 2011 4:09 AM
>>To: common-user@hadoop.apache.org
>> Subject: Re: hadoop 0.20 append - some clarifications

>> You might also want to check append design doc published at HDFS-265



I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's
design doc won't apply to it.



 _

From: Ted Dunning [mailto:tdunn...@maprtech.com]
Sent: Thursday, February 10, 2011 9:29 PM
To: common-user@hadoop.apache.org; gok...@huawei.com
Cc: hdfs-u...@hadoop.apache.org
Subject: Re: hadoop 0.20 append - some clarifications



Correct is a strong word here.



There is actually an HDFS unit test that checks to see if partially written
and unflushed data is visible.  The basic rule of thumb is that you need to
synchronize readers and writers outside of HDFS.  There is no guarantee that
data is visible or invisible after writing, but there is a guarantee that it
will become visible after sync or close.

On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M  wrote:

Is this the correct behavior or my understanding is wrong?






 



Re: Hbase documentations

2011-02-14 Thread Bibek Paudel
On Mon, Feb 14, 2011 at 11:55 AM, Matthew John
 wrote:
> Hi guys,
>
> can someone send me a good documentation on Hbase (other than the
> hadoop wiki). I am also looking for a good Hbase tutorial.
>

Have you checked this: http://hbase.apache.org/book.html ?

-b

> Regards,
> Matthew
>


RE: recommendation on HDDs

2011-02-14 Thread Michael Segel

Steve is right, and to try and add more clarification...

> Interesting choice; the 7 core in a single CPU option is something else 
> to consider. Remember also this is a moving target, what anyone says is 
> valid now (Feb 2011) will be seen as quaint in two years time. Even a 
> few months from now, what is the best value for a cluster will hve moved on.

I've never saw a 7 core chip, 4,6, and now 8? (Cores not including hyper 
threading).
The point Steve is making is that its true that the price point for picking the 
optimum hardware continues to move and what we see today doesn't mean we won't 
see a better optimal configuration. And more importantly, what is optimal for 
one user isn't going to be optimal for another.

The other issue that we hadn't even talked about is if you want to go 'white 
box' and build your own, or your IT shop picks up the phone and calls Dell, HP, 
IBM, whomever supplies your hardware. That too will limit your options and 
affect your budget.

In addition you have to look at what are realistic expectations. There are a 
lot of factors that you have to weigh when making hardware decisions, including 
how clean your developers code is going to be and what resources that they will 
require to run on your cloud.

And if you run Hbase or Cloudbase, you add more variables.

The key is finding out which combination of variables are going to be the most 
important for you to get the most out of your hardware.

Ok... 
I'll get off my soapbox for now and go get my first cup of coffee. :-)

-Mike


> Date: Mon, 14 Feb 2011 11:23:13 +
> From: ste...@apache.org
> To: common-user@hadoop.apache.org
> Subject: Re: recommendation on HDDs
> 
> On 12/02/11 16:26, Michael Segel wrote:
> >
> > All,
> >
> > I'd like to clarify somethings...
> >
> > First the concept is to build out a cluster of commodity hardware.
> > So when you do your shopping you want to get the most bang for your buck. 
> > That is the 'sweet spot' that I'm talking about.
> > When you look at your E5500 or E5600 chip sets, you will want to go with 4 
> > cores per CPU, dual CPU and a clock speed around 2.53GHz or so.
> > (Faster chips are more expensive and the performance edge falls off so you 
> > end up paying a premium.)
> 
> Interesting choice; the 7 core in a single CPU option is something else 
> to consider. Remember also this is a moving target, what anyone says is 
> valid now (Feb 2011) will be seen as quaint in two years time. Even a 
> few months from now, what is the best value for a cluster will hve moved on.
> 
> >
> > Looking at your disks, you start with using the on board SATA controller. 
> > Why? Because it means you don't have to pay for a controller card.
> > If you are building a cluster for general purpose computing... Assuming 1U 
> > boxes you have room for 4 3.5" SATA which still give you the best 
> > performance for your buck.
> > Can you go with 2.5"? Yes, but you are going to be paying a premium.
> >
> > Price wise, a 2TB SATA II 7200 RPM drive is going to be your best deal. You 
> > could go with SATA III drives if your motherboard supports the SATA III 
> > ports, but you're still paying a slight premium.
> >
> > The OP felt that all he would need was 1TB of disk and was considering 4 
> > 250GB drives. (More spindles...yada yada yada...)
> >
> > My suggestion is to forget that nonsense and go with one 2 TB drive because 
> > its a better deal and if you want to add more disk to the node, you can. 
> > (Its easier to add disk than it is to replace it.)
> >
> > Now do you need to create a spare OS drive? No. Some people who have an 
> > internal 3.5 space sometimes do. That's ok, and you can put your hadoop 
> > logging there. (Just make sure you have a lot of disk space...)
> 
> One advantage of a specific drive for OS and log (in a separate 
> partition) is you can re-image it without losing data you care about, 
> and swap in a replacement fast. If you have a small cluster set up for 
> hotswap, that reduces the time a node is down -just have a spare OS HDD 
> ready to put in. OS disks are the ones you care about when they fail, 
> the others are more "mildly concerned about the failure rate" than 
> something to page you over.
> 
> >
> > The truth is that there really isn't any single *right* answer. There are a 
> > lot of options and budget constraints as well as physical constraints like 
> > power, space, and location of the hardware.
> 
> +1. don't forget weight either.
> 
> >
> > Also you may be building out a cluster who's main purpose is to be a backup 
> > location for your cluster. So your production cluster has lots of nodes. 
> > Your backup cluster has lots of disks per node because your main focus is 
> > as much storage per node.
> >
> > So here you may end up buying a 4U rack box, load it up with 3.5" drives 
> > and a couple of SATA controller cards. You care less about performance but 
> > more about storage space. Here you may say 3TB SATA drives w 12 or more per

Re: recommendation on HDDs

2011-02-14 Thread Steve Loughran

On 12/02/11 16:26, Michael Segel wrote:


All,

I'd like to clarify somethings...

First the concept is to build out a cluster of commodity hardware.
So when you do your shopping you want to get the most bang for your buck. That 
is the 'sweet spot' that I'm talking about.
When you look at your E5500 or E5600 chip sets, you will want to go with 4 
cores per CPU, dual CPU and a clock speed around 2.53GHz or so.
(Faster chips are more expensive and the performance edge falls off so you end 
up paying a premium.)


Interesting choice; the 7 core in a single CPU option is something else 
to consider. Remember also this is a moving target, what anyone says is 
valid now (Feb 2011) will be seen as quaint in two years time. Even a 
few months from now, what is the best value for a cluster will hve moved on.




Looking at your disks, you start with using the on board SATA controller. Why? 
Because it means you don't have to pay for a controller card.
If you are building a cluster for general purpose computing... Assuming 1U boxes you 
have room for 4 3.5" SATA which still give you the best performance for your 
buck.
Can you go with 2.5"? Yes, but you are going to be paying a premium.

Price wise, a 2TB SATA II 7200 RPM drive is going to be your best deal. You 
could go with SATA III drives if your motherboard supports the SATA III ports, 
but you're still paying a slight premium.

The OP felt that all he would need was 1TB of disk and was considering 4 250GB 
drives. (More spindles...yada yada yada...)

My suggestion is to forget that nonsense and go with one 2 TB drive because its 
a better deal and if you want to add more disk to the node, you can. (Its 
easier to add disk than it is to replace it.)

Now do you need to create a spare OS drive? No. Some people who have an 
internal 3.5 space sometimes do. That's ok, and you can put your hadoop logging 
there. (Just make sure you have a lot of disk space...)


One advantage of a specific drive for OS and log (in a separate 
partition) is you can re-image it without losing data you care about, 
and swap in a replacement fast. If you have a small cluster set up for 
hotswap, that reduces the time a node is down -just have a spare OS HDD 
ready to put in. OS disks are the ones you care about when they fail, 
the others are more "mildly concerned about the failure rate" than 
something to page you over.




The truth is that there really isn't any single *right* answer. There are a lot 
of options and budget constraints as well as physical constraints like power, 
space, and location of the hardware.


+1. don't forget weight either.



Also you may be building out a cluster who's main purpose is to be a backup 
location for your cluster. So your production cluster has lots of nodes. Your 
backup cluster has lots of disks per node because your main focus is as much 
storage per node.

So here you may end up buying a 4U rack box, load it up with 3.5" drives and a 
couple of SATA controller cards. You care less about performance but more about 
storage space. Here you may say 3TB SATA drives w 12 or more per box. (I don't know 
how many you can fit in to a 4U chassis these days.  So you have 10 DN backing up a 
100+ DN cluster in your main data center. But that's another story.


You can get 12 HDDs in a 1U if you ask nicely. but in a small cluster 
there's a cost, that server can be a big chunk of your filesystem, and 
if it goes down there's up to 24TB worth of replication going to take 
place over the rest of the network, so you'll need at least 24TB of 
spare capacity on the other machines, ignoring bandwidth issues.




I think the main take away you should have is that if you look at the price 
point... your best price per GB is on a 2TB drive until the prices drop on 3TB 
drives.
Since the OP believes that their requirement is 1TB per node... a single 2TB 
would be the best choice. It allows for additional space and you really 
shouldn't be too worried about disk i/o being your bottleneck.



One less thing to worry about is good.


Hbase documentations

2011-02-14 Thread Matthew John
Hi guys,

can someone send me a good documentation on Hbase (other than the
hadoop wiki). I am also looking for a good Hbase tutorial.

Regards,
Matthew


Re: recommendation on HDDs

2011-02-14 Thread Steve Loughran

On 10/02/11 22:25, Michael Segel wrote:


Shrinivas,

Assuming you're in the US, I'd recommend the following:

Go with 2TB 7200 SATA hard drives.
(Not sure what type of hardware you have)

What  we've found is that in the data nodes, there's an optimal configuration 
that balances price versus performance.

While your chasis may hold 8 drives, how many open SATA ports are on the 
motherboard? Since you're using JBOD, you don't want the additional expense of 
having to purchase a separate controller card for the additional drives.



I'm not going to disagree about cost, but I will note that a single 
controller can become a bottleneck once you add a lot of disks to it; it 
generates lots of interrupts that go to the came core, which then ends 
up at 100% CPU and overloading. With two controllers the work can get 
spread over two CPUs, moving the bottlenecks back into the IO channels.


For that reason I'd limit the #of disks for a single controller at 
around 4-6.


Remember as well as storage capacity, you need disk space for logs, 
spill space, temp dirs, etc. This is why 2TB HDDs are looking appealing 
these days


Speed? 10K RPM has a faster seek time and possibly bandwidth but you pay 
in capital and power. If the HDFS blocks are laid out well, seek time 
isn't so important, so consider saving the money and putting it elsewhere.


The other big question with Hadoop is RAM and CPU, and the answer there 
is "it depends". RAM depends on the algorithm, as can the CPU:spindle 
ratio ... I recommend 1 core to 1 spindle as a good starting point. In a 
large cluster the extra capital costs of a second CPU compared to the 
amount of extra servers and storage that you could get for the same 
money speaks in favour of more servers, but in smaller clusters the 
spreadsheets say different things.


-Steve

(disclaimer, I work for a server vendor :)