Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Jacques Thu, 04 Aug 2011 09:41:12 -0700

Do you have any suggestions of things I should look at to confirm/deny these
possibilities?


The tables are very small and inactive (probably only 50-100 rows changing
per day).

Thanks,
Jacques

On Thu, Aug 4, 2011 at 9:09 AM, Ryan Rawson <ryano...@gmail.com> wrote:

> Another possibility is the logs were not replayed correctly during the
> region startup.  We put in a lot of tests to cover this case, so it
> should not be so.
>
> Essentially the WAL replay looks at the current HFiles state, then
> decides which log entries to replay or skip. This is because a log
> might have more data than what is strictly missing from the HFiles.
>
> If the data that is missing is over 6 hours old, that is a very weird
> bug, it suggests to me that either an hfile is missing for some
> reason, or the WAL replay didnt include some for some reason.
>
> -ryan
>
> On Thu, Aug 4, 2011 at 8:38 AM, Jacques <whs...@gmail.com> wrote:
> > Thanks for the feedback.  So you're inclined to think it would be at the
> dfs
> > layer?
> >
> > Is it accurate to say the most likely places where the data could have
> been
> > lost were:
> > 1. wal writes didn't actually get written to disk (no log entries to
> suggest
> > any issues)
> > 2. wal corrupted (no log entries suggest any trouble reading the log)
> > 3. not all split logs were read by regionservers  (?? is there any way to
> > ensure this either way... should I look at the filesystem some place?)
> >
> > Do you think the type of network partition I'm talking about is
> adequately
> > covered in existing tests? (Specifically running an external zk cluster?)
> >
> > Have you heard if anyone else is been having problems with the second
> 90.4
> > rc?
> >
> > Thanks again for your help.  I'm following up with the MapR guys as well.
> >
> > Jacques
> >
> > On Wed, Aug 3, 2011 at 3:49 PM, Jean-Daniel Cryans <jdcry...@apache.org
> >wrote:
> >
> >> Hi Jacques,
> >>
> >> Sorry to hear about that.
> >>
> >> Regarding MapR, I personally don't have hands-on experience so it's a
> >> little bit hard for me to help you. You might want to ping them and
> >> ask their opinion (and I know they are watching, Ted? Srivas?)
> >>
> >> What I can do is telling you if things look normal from the HBase
> >> point of view, but I see you're not running with DEBUG so I might miss
> >> some information.
> >>
> >> Looking at the master log, it tells us that it was able to split the
> >> logs correctly.
> >>
> >> Looking at a few regionserver logs, it doesn't seem to say that it had
> >> issues replaying the logs so that's good too.
> >>
> >> About the memstore questions, it's almost purely size-based (64MB). I
> >> say almost because we limit the number of WALs a regionserver can
> >> carry so that when it reaches that limit it force flushes the
> >> memstores with older edits. There's also a thread that rolls the
> >> latest log if it's more than an hour old, so in the extreme case it
> >> could take 32 hours for an edit in the memstore to make it to a
> >> StoreFile. It used to be that without appends rolling those files
> >> often would prevent losses older than 1 hour, but I haven't seen those
> >> issues since we started using appends. But you're not using HDFS, and
> >> I don't have MapR experience, so I can't really go any further...
> >>
> >> J-D
> >>
> >> On Tue, Aug 2, 2011 at 3:44 PM, Jacques <whs...@gmail.com> wrote:
> >> > Given the hardy reviews and timing, we recently shifted from 90.3
> >> (apache)
> >> > to 90.4rc2 (the July 24th one that Stack posted -- 0.90.4, r1150278).
> >> >
> >> > We had a network switch go down last night which caused an apparent
> >> network
> >> > partition between two of our region servers and one or more zk nodes.
> >> >  (We're still piecing together the situation).  Anyway, things
> *seemed*
> >> to
> >> > recover fine.  However, this morning we realized that we lost some
> data
> >> that
> >> > was generated just before the problems occurred.
> >> >
> >> > It looks like h002 went down nearly immediately at around 8pm while
> h001
> >> > didn't go down until around 8:10pm (somewhat confused by this).  We're
> >> > thinking that this may have contributed to the problem.  The
> particular
> >> > table that had data issues is a very small table with a single region
> >> that
> >> > was running on h002 when it went down.
> >> >
> >> > We know the corruption/lack of edits affected two tables.  It extended
> >> > across a number of rows and actually appears to reach back up to data
> >> > inserted 6 hours earlier (estimate).  The two tables we can verify
> errors
> >> on
> >> > are each probably at most 10-20k <1k rows.  Some places rows that were
> >> added
> >> > are completely missing and some just had missing cell edits.  As an
> >> aside, I
> >> > was thinking there was a time based memstore flush in addition to a
> size
> >> > one.  But upon reviewing the hbase default configuration, I don't see
> >> > mention of it.  Is this purely size based?
> >> >
> >> > We don't have the tools in place to verify exactly what other data or
> >> tables
> >> > may have been impacted.
> >> >
> >> > The log files are at the paste bin links below.  The whole cluster is
> 8
> >> > nodes + master, 3 zk nodes running on separate machines.  We run with
> >> mostly
> >> > standard settings but do have the following settings:
> >> > heap: 12gb
> >> > regionsize 4gb, (due to lots of cold data and not enough servers, avg
> 300
> >> > regions/server)
> >> > mslab: 4m/512k (due to somewhat frequent updates to larger objects in
> the
> >> > 200-500k size range)
> >> >
> >> > We've been using hbase for about a year now and have been nothing but
> >> happy
> >> > with it.  The failure state that we had last night (where only some
> >> region
> >> > servers cannot talk to some zk servers) seems like a strange one.
> >> >
> >> > Any thoughts? (beyond chiding for switching to a rc)    Any opinions
> >> whether
> >> > we should we roll back to 90.3 (or 90.3+cloudera)?
> >> >
> >> > Thanks for any help,
> >> > Jacques
> >> >
> >> > master: http://pastebin.com/aG8fm2KZ
> >> > h001: http://pastebin.com/nLLk06EC
> >> > h002: http://pastebin.com/0wPFuZDx
> >> > h003: http://pastebin.com/3ZMV01mA
> >> > h004: http://pastebin.com/0YVefuqS
> >> > h005: http://pastebin.com/N90LDjvs
> >> > h006: http://pastebin.com/gM8umekW
> >> > h007: http://pastebin.com/0TVvX68d
> >> > h008: http://pastebin.com/mV968Cem
> >> >
> >>
> >
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Reply via email to