Re: A data loss scenario with a single region server going down

George P. Stathis Thu, 23 Sep 2010 09:43:54 -0700

I'm there. Thanks St.Ack.

On Wed, Sep 22, 2010 at 11:59 PM, Stack <st...@duboce.net> wrote:


> Hey George:
>
> James Kennedy is working on getting transactional hbase working w/
> hbase TRUNK.  Watch HBASE-2641 for the drop of changes needed in core
> to make it so his github THBase can use HBase core.
>
> St.Ack
>
> On Mon, Sep 20, 2010 at 5:43 PM, Ryan Rawson <ryano...@gmail.com> wrote:
> > hi,
> >
> > sorry i dont.  i think the current transactional/indexed person is
> > working on bringing it up to 0.89, perhaps they would enjoy your help
> > in testing or porting the code?
> >
> > I'll poke a few people into replying.
> >
> > -ryan
> >
> > On Mon, Sep 20, 2010 at 5:19 PM, George P. Stathis <gstat...@traackr.com>
> wrote:
> >> On Mon, Sep 20, 2010 at 4:55 PM, Ryan Rawson <ryano...@gmail.com>
> wrote:
> >>
> >>> When you say replication what exactly do you mean?  In normal HDFS, as
> >>> you write the data is sent to 3 nodes yes, but with the flaw I
> >>> outlined, it doesnt matter because the datanodes and namenode will
> >>> pretend a data block just didnt exist if it wasnt closed properly.
> >>>
> >>
> >> That's the part I was not understanding. I do now. Thanks.
> >>
> >>
> >>>
> >>> So even with the most careful white glove handling of hbase, you will
> >>> eventually have a crash and you will lose data w/o 0.89/CDH3 et. al.
> >>> You can circumvent this by storing the data elsewhere and spooling
> >>> into hbase, or perhaps just not minding if you lose data (yes those
> >>> applications exist).
> >>>
> >>> Looking at those JIRAs in question, the first is already on trunk
> >>> which is 0.89.  The second isn't alas.  At this point the
> >>> transactional hbase just isnt being actively maintained by any
> >>> committer and we are reliant on kind people's contributions.  So I
> >>> can't promise when it will hit 0.89/0.90.
> >>>
> >>
> >> Are you aware of any indexing alternatives in 0.89?
> >>
> >>
> >>>
> >>> -ryan
> >>>
> >>>
> >>> On Mon, Sep 20, 2010 at 1:21 PM, George P. Stathis <
> gstat...@traackr.com>
> >>> wrote:
> >>> > Thanks for the response Ryan. I have no doubt that 0.89 can be used
> in
> >>> > production and that it has strong support. I just wanted to avoid
> moving
> >>> to
> >>> > it now because we have limited resources and it would put a dent in
> our
> >>> > roadmap if we were to fast track the migration now. Specifically, we
> are
> >>> > using HBASE-2438 and HBASE-2426 to support pagination across indexes.
> So
> >>> we
> >>> > either have to migrate those to 0.89 or somehow go stock and be able
> to
> >>> > support pagination across region servers.
> >>> >
> >>> > Of course, if the choice is between migrating or losing more data,
> data
> >>> > safety comes first. But if we can buy two or three more months of
> time
> >>> and
> >>> > avoid region server crashes (like you did for a year), maybe we can
> go
> >>> that
> >>> > route for now. What do we need to do achieve that?
> >>> >
> >>> > -GS
> >>> >
> >>> > PS: Out of curiosity, I understand the WAL log append issue for a
> single
> >>> > regionserver when it comes to losing the data on a single node. But
> if
> >>> that
> >>> > data is also being replicated on another region server, why wouldn't
> it
> >>> be
> >>> > available there? Or is the WAL log shared across multiple region
> servers
> >>> > (maybe that's what I'm missing)?
> >>> >
> >>> >
> >>> > On Mon, Sep 20, 2010 at 3:52 PM, Ryan Rawson <ryano...@gmail.com>
> wrote:
> >>> >
> >>> >> Hey,
> >>> >>
> >>> >> The problem is that the stock 0.20 hadoop wont let you read from a
> >>> >> non-closed file.  It will report that length as 0.  So if a
> >>> >> regionserver crashes, that last WAL log that is still open becomes 0
> >>> >> length and the data within in unreadable.  That specifically is the
> >>> >> problem of data loss.  You could always make it so your
> regionservers
> >>> >> rarely crash - this is possible btw and I did it for over a year.
> >>> >>
> >>> >> But you will want to run CDH3 or the append-branch releases to get
> the
> >>> >> series of patches that fix this hole.  It also happens that only
> 0.89
> >>> >> runs on it.  I would like to avoid the hadoop "everyone uses 0.20
> >>> >> forever" problem and talk about what we could do to help you get on
> >>> >> 0.89.  Over here at SU we've made a commitment to the future of 0.89
> >>> >> and are running it in production.  Let us know what else you'd need.
> >>> >>
> >>> >> -ryan
> >>> >>
> >>> >> On Mon, Sep 20, 2010 at 12:39 PM, George P. Stathis
> >>> >> <gstat...@traackr.com> wrote:
> >>> >> > Thanks Todd. We are not quite ready to move to 0.89 yet. We have
> made
> >>> >> custom
> >>> >> > modifications to the transactional contrib sources which are now
> taken
> >>> >> out
> >>> >> > of 0.89. We are planning on moving to 0.90 when it comes out and
> at
> >>> that
> >>> >> > point, either migrate our customizations, or move back to the
> >>> out-of-the
> >>> >> box
> >>> >> > features (which will require a re-write of our code).
> >>> >> >
> >>> >> > We are well aware of the CDH distros but at the time we started
> with
> >>> >> hbase,
> >>> >> > there was none that included HBase. I think CDH3 the first one to
> >>> include
> >>> >> > HBase, correct? And is 0.89 the only one supported?
> >>> >> >
> >>> >> > Moreover, are we saying that there is no way to prevent stock
> hbase
> >>> >> 0.20.6
> >>> >> > and hadoop 0.20.2 from losing data when a single node goes down?
> It
> >>> does
> >>> >> not
> >>> >> > matter if the data is replicated, it will still get lost?
> >>> >> >
> >>> >> > -GS
> >>> >> >
> >>> >> > On Sun, Sep 19, 2010 at 5:58 PM, Todd Lipcon <t...@cloudera.com>
> >>> wrote:
> >>> >> >
> >>> >> >> Hi George,
> >>> >> >>
> >>> >> >> The data loss problems you mentioned below are known issues when
> >>> running
> >>> >> on
> >>> >> >> stock Apache 0.20.x hadoop.
> >>> >> >>
> >>> >> >> You should consider upgrading to CDH3b2, which includes a number
> of
> >>> HDFS
> >>> >> >> patches that allow HBase to durably store data. You'll also have
> to
> >>> >> upgrade
> >>> >> >> to HBase 0.89 - we ship a version as part of CDH that will work
> well.
> >>> >> >>
> >>> >> >> Thanks
> >>> >> >> -Todd
> >>> >> >>
> >>> >> >> On Sun, Sep 19, 2010 at 6:57 AM, George P. Stathis <
> >>> >> gstat...@traackr.com
> >>> >> >> >wrote:
> >>> >> >>
> >>> >> >> > Hi folks. I'd like to run the following data loss scenario by
> you
> >>> to
> >>> >> see
> >>> >> >> if
> >>> >> >> > we are doing something obviously wrong with our setup here.
> >>> >> >> >
> >>> >> >> > Setup:
> >>> >> >> >
> >>> >> >> >   - Hadoop 0.20.1
> >>> >> >> >   - HBase 0.20.3
> >>> >> >> >   - 1 Master Node running Nameserver, SecondaryNameserver,
> >>> JobTracker,
> >>> >> >> >   HMaster and 1 Zookeeper (no zookeeper quorum right now)
> >>> >> >> >   - 4 child nodes running a Datanode, TaskTracker and
> RegionServer
> >>> >> each
> >>> >> >> >   - dfs.replication is set to 2
> >>> >> >> >   - Host: Amazon EC2
> >>> >> >> >
> >>> >> >> > Up until yesterday, we were frequently experiencing
> >>> >> >> > HBASE-2077<https://issues.apache.org/jira/browse/HBASE-2077>,
> >>> >> >> > which kept bringing our RegionServers down. What we realized
> though
> >>> is
> >>> >> >> that
> >>> >> >> > we were losing data (a few hours worth) with just one out of
> four
> >>> >> >> > regionservers going down. This is problematic since we are
> supposed
> >>> to
> >>> >> >> > replicate at x2 out of 4 nodes, so at least one other node
> should
> >>> be
> >>> >> able
> >>> >> >> > to
> >>> >> >> > theoretically serve the data that the downed regionserver
> can't.
> >>> >> >> >
> >>> >> >> > Questions:
> >>> >> >> >
> >>> >> >> >   - When a regionserver goes down unexpectedly, the only data
> that
> >>> >> >> >   theoretically gets lost was whatever didn't make it to the
> WAL,
> >>> >> right?
> >>> >> >> Or
> >>> >> >> >   wrong? E.g.
> >>> >> >> >
> >>> >> >> >
> >>> >> >>
> >>> >>
> >>>
> http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
> >>> >> >> >   - We ran a hadoop fsck on our cluster and verified the
> >>> replication
> >>> >> >> factor
> >>> >> >> >   as well as that the were no under replicated blocks. So why
> was
> >>> our
> >>> >> >> data
> >>> >> >> > not
> >>> >> >> >   available from another node?
> >>> >> >> >   - If the log gets rolled every 60 minutes by default (we
> haven't
> >>> >> >> touched
> >>> >> >> >   the defaults), how can we lose data from up to 24 hours ago?
> >>> >> >> >   - When the downed regionserver comes back up, shouldn't that
> data
> >>> be
> >>> >> >> >   available again? Ours wasn't.
> >>> >> >> >   - In such scenarios, is there a recommended approach for
> >>> restoring
> >>> >> the
> >>> >> >> >   regionserver that goes down? We just brought them back up by
> >>> logging
> >>> >> on
> >>> >> >> > the
> >>> >> >> >   node itself an manually restarting them first. Now we have
> >>> automated
> >>> >> >> > crons
> >>> >> >> >   that listen for their ports and restart them if they go down
> >>> within
> >>> >> two
> >>> >> >> >   minutes.
> >>> >> >> >   - Are there way to recover such lost data?
> >>> >> >> >   - Are versions 0.89 / 0.90 addressing any of these issues?
> >>> >> >> >   - Curiosity question: when a regionserver goes down, does the
> >>> master
> >>> >> >> try
> >>> >> >> >   to replicate that node's data on another node to satisfy the
> >>> >> >> > dfs.replication
> >>> >> >> >   ratio?
> >>> >> >> >
> >>> >> >> > For now, we have upgraded our HBase to 0.20.6, which is
> supposed to
> >>> >> >> contain
> >>> >> >> > the HBASE-2077 <
> https://issues.apache.org/jira/browse/HBASE-2077>
> >>> fix
> >>> >> >> (but
> >>> >> >> > no one has verified yet). Lars' blog also suggests that Hadoop
> >>> 0.21.0
> >>> >> is
> >>> >> >> > the
> >>> >> >> > way to go to avoid the  file append issues but it's not
> production
> >>> >> ready
> >>> >> >> > yet. Should we stick to 0.20.1? Upgrade to 0.20.2?
> >>> >> >> >
> >>> >> >> > Any tips here are definitely appreciated. I'll be happy to
> provide
> >>> >> more
> >>> >> >> > information as well.
> >>> >> >> >
> >>> >> >> > -GS
> >>> >> >> >
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> --
> >>> >> >> Todd Lipcon
> >>> >> >> Software Engineer, Cloudera
> >>> >> >>
> >>> >> >
> >>> >>
> >>> >
> >>>
> >>
> >
>

Re: A data loss scenario with a single region server going down

Reply via email to