Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Jacques Thu, 04 Aug 2011 08:39:11 -0700

Thanks for the feedback.  So you're inclined to think it would be at the dfs
layer?


Is it accurate to say the most likely places where the data could have been
lost were:
1. wal writes didn't actually get written to disk (no log entries to suggest
any issues)
2. wal corrupted (no log entries suggest any trouble reading the log)
3. not all split logs were read by regionservers  (?? is there any way to
ensure this either way... should I look at the filesystem some place?)

Do you think the type of network partition I'm talking about is adequately
covered in existing tests? (Specifically running an external zk cluster?)

Have you heard if anyone else is been having problems with the second 90.4
rc?

Thanks again for your help.  I'm following up with the MapR guys as well.

Jacques

On Wed, Aug 3, 2011 at 3:49 PM, Jean-Daniel Cryans <jdcry...@apache.org>wrote:

> Hi Jacques,
>
> Sorry to hear about that.
>
> Regarding MapR, I personally don't have hands-on experience so it's a
> little bit hard for me to help you. You might want to ping them and
> ask their opinion (and I know they are watching, Ted? Srivas?)
>
> What I can do is telling you if things look normal from the HBase
> point of view, but I see you're not running with DEBUG so I might miss
> some information.
>
> Looking at the master log, it tells us that it was able to split the
> logs correctly.
>
> Looking at a few regionserver logs, it doesn't seem to say that it had
> issues replaying the logs so that's good too.
>
> About the memstore questions, it's almost purely size-based (64MB). I
> say almost because we limit the number of WALs a regionserver can
> carry so that when it reaches that limit it force flushes the
> memstores with older edits. There's also a thread that rolls the
> latest log if it's more than an hour old, so in the extreme case it
> could take 32 hours for an edit in the memstore to make it to a
> StoreFile. It used to be that without appends rolling those files
> often would prevent losses older than 1 hour, but I haven't seen those
> issues since we started using appends. But you're not using HDFS, and
> I don't have MapR experience, so I can't really go any further...
>
> J-D
>
> On Tue, Aug 2, 2011 at 3:44 PM, Jacques <whs...@gmail.com> wrote:
> > Given the hardy reviews and timing, we recently shifted from 90.3
> (apache)
> > to 90.4rc2 (the July 24th one that Stack posted -- 0.90.4, r1150278).
> >
> > We had a network switch go down last night which caused an apparent
> network
> > partition between two of our region servers and one or more zk nodes.
> >  (We're still piecing together the situation).  Anyway, things *seemed*
> to
> > recover fine.  However, this morning we realized that we lost some data
> that
> > was generated just before the problems occurred.
> >
> > It looks like h002 went down nearly immediately at around 8pm while h001
> > didn't go down until around 8:10pm (somewhat confused by this).  We're
> > thinking that this may have contributed to the problem.  The particular
> > table that had data issues is a very small table with a single region
> that
> > was running on h002 when it went down.
> >
> > We know the corruption/lack of edits affected two tables.  It extended
> > across a number of rows and actually appears to reach back up to data
> > inserted 6 hours earlier (estimate).  The two tables we can verify errors
> on
> > are each probably at most 10-20k <1k rows.  Some places rows that were
> added
> > are completely missing and some just had missing cell edits.  As an
> aside, I
> > was thinking there was a time based memstore flush in addition to a size
> > one.  But upon reviewing the hbase default configuration, I don't see
> > mention of it.  Is this purely size based?
> >
> > We don't have the tools in place to verify exactly what other data or
> tables
> > may have been impacted.
> >
> > The log files are at the paste bin links below.  The whole cluster is 8
> > nodes + master, 3 zk nodes running on separate machines.  We run with
> mostly
> > standard settings but do have the following settings:
> > heap: 12gb
> > regionsize 4gb, (due to lots of cold data and not enough servers, avg 300
> > regions/server)
> > mslab: 4m/512k (due to somewhat frequent updates to larger objects in the
> > 200-500k size range)
> >
> > We've been using hbase for about a year now and have been nothing but
> happy
> > with it.  The failure state that we had last night (where only some
> region
> > servers cannot talk to some zk servers) seems like a strange one.
> >
> > Any thoughts? (beyond chiding for switching to a rc)    Any opinions
> whether
> > we should we roll back to 90.3 (or 90.3+cloudera)?
> >
> > Thanks for any help,
> > Jacques
> >
> > master: http://pastebin.com/aG8fm2KZ
> > h001: http://pastebin.com/nLLk06EC
> > h002: http://pastebin.com/0wPFuZDx
> > h003: http://pastebin.com/3ZMV01mA
> > h004: http://pastebin.com/0YVefuqS
> > h005: http://pastebin.com/N90LDjvs
> > h006: http://pastebin.com/gM8umekW
> > h007: http://pastebin.com/0TVvX68d
> > h008: http://pastebin.com/mV968Cem
> >
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Reply via email to