Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Jacques Thu, 04 Aug 2011 10:48:45 -0700

I will take a look and see what I can figure out.

Thanks for your help.


Jacques

On Thu, Aug 4, 2011 at 9:52 AM, Ryan Rawson <ryano...@gmail.com> wrote:

> The regionserver logs that talk about the hlog replay might shed some
> light, it should tell you what entries were skipped, etc.  Having a
> look at the hfile structure of the regions, see if there are holes,
> the HFile.main tool can come in handy here, you can run it as:
> hbase org.apache.hadoop.hbase.io.hfile.HFile
>
> it will give you usage.
>
> Mapr might be able to give you audit logs of the time in question,
> that could be useful as well.
>
>
>
> On Thu, Aug 4, 2011 at 9:40 AM, Jacques <whs...@gmail.com> wrote:
> > Do you have any suggestions of things I should look at to confirm/deny
> these
> > possibilities?
> >
> > The tables are very small and inactive (probably only 50-100 rows
> changing
> > per day).
> >
> > Thanks,
> > Jacques
> >
> > On Thu, Aug 4, 2011 at 9:09 AM, Ryan Rawson <ryano...@gmail.com> wrote:
> >
> >> Another possibility is the logs were not replayed correctly during the
> >> region startup.  We put in a lot of tests to cover this case, so it
> >> should not be so.
> >>
> >> Essentially the WAL replay looks at the current HFiles state, then
> >> decides which log entries to replay or skip. This is because a log
> >> might have more data than what is strictly missing from the HFiles.
> >>
> >> If the data that is missing is over 6 hours old, that is a very weird
> >> bug, it suggests to me that either an hfile is missing for some
> >> reason, or the WAL replay didnt include some for some reason.
> >>
> >> -ryan
> >>
> >> On Thu, Aug 4, 2011 at 8:38 AM, Jacques <whs...@gmail.com> wrote:
> >> > Thanks for the feedback.  So you're inclined to think it would be at
> the
> >> dfs
> >> > layer?
> >> >
> >> > Is it accurate to say the most likely places where the data could have
> >> been
> >> > lost were:
> >> > 1. wal writes didn't actually get written to disk (no log entries to
> >> suggest
> >> > any issues)
> >> > 2. wal corrupted (no log entries suggest any trouble reading the log)
> >> > 3. not all split logs were read by regionservers  (?? is there any way
> to
> >> > ensure this either way... should I look at the filesystem some place?)
> >> >
> >> > Do you think the type of network partition I'm talking about is
> >> adequately
> >> > covered in existing tests? (Specifically running an external zk
> cluster?)
> >> >
> >> > Have you heard if anyone else is been having problems with the second
> >> 90.4
> >> > rc?
> >> >
> >> > Thanks again for your help.  I'm following up with the MapR guys as
> well.
> >> >
> >> > Jacques
> >> >
> >> > On Wed, Aug 3, 2011 at 3:49 PM, Jean-Daniel Cryans <
> jdcry...@apache.org
> >> >wrote:
> >> >
> >> >> Hi Jacques,
> >> >>
> >> >> Sorry to hear about that.
> >> >>
> >> >> Regarding MapR, I personally don't have hands-on experience so it's a
> >> >> little bit hard for me to help you. You might want to ping them and
> >> >> ask their opinion (and I know they are watching, Ted? Srivas?)
> >> >>
> >> >> What I can do is telling you if things look normal from the HBase
> >> >> point of view, but I see you're not running with DEBUG so I might
> miss
> >> >> some information.
> >> >>
> >> >> Looking at the master log, it tells us that it was able to split the
> >> >> logs correctly.
> >> >>
> >> >> Looking at a few regionserver logs, it doesn't seem to say that it
> had
> >> >> issues replaying the logs so that's good too.
> >> >>
> >> >> About the memstore questions, it's almost purely size-based (64MB). I
> >> >> say almost because we limit the number of WALs a regionserver can
> >> >> carry so that when it reaches that limit it force flushes the
> >> >> memstores with older edits. There's also a thread that rolls the
> >> >> latest log if it's more than an hour old, so in the extreme case it
> >> >> could take 32 hours for an edit in the memstore to make it to a
> >> >> StoreFile. It used to be that without appends rolling those files
> >> >> often would prevent losses older than 1 hour, but I haven't seen
> those
> >> >> issues since we started using appends. But you're not using HDFS, and
> >> >> I don't have MapR experience, so I can't really go any further...
> >> >>
> >> >> J-D
> >> >>
> >> >> On Tue, Aug 2, 2011 at 3:44 PM, Jacques <whs...@gmail.com> wrote:
> >> >> > Given the hardy reviews and timing, we recently shifted from 90.3
> >> >> (apache)
> >> >> > to 90.4rc2 (the July 24th one that Stack posted -- 0.90.4,
> r1150278).
> >> >> >
> >> >> > We had a network switch go down last night which caused an apparent
> >> >> network
> >> >> > partition between two of our region servers and one or more zk
> nodes.
> >> >> >  (We're still piecing together the situation).  Anyway, things
> >> *seemed*
> >> >> to
> >> >> > recover fine.  However, this morning we realized that we lost some
> >> data
> >> >> that
> >> >> > was generated just before the problems occurred.
> >> >> >
> >> >> > It looks like h002 went down nearly immediately at around 8pm while
> >> h001
> >> >> > didn't go down until around 8:10pm (somewhat confused by this).
>  We're
> >> >> > thinking that this may have contributed to the problem.  The
> >> particular
> >> >> > table that had data issues is a very small table with a single
> region
> >> >> that
> >> >> > was running on h002 when it went down.
> >> >> >
> >> >> > We know the corruption/lack of edits affected two tables.  It
> extended
> >> >> > across a number of rows and actually appears to reach back up to
> data
> >> >> > inserted 6 hours earlier (estimate).  The two tables we can verify
> >> errors
> >> >> on
> >> >> > are each probably at most 10-20k <1k rows.  Some places rows that
> were
> >> >> added
> >> >> > are completely missing and some just had missing cell edits.  As an
> >> >> aside, I
> >> >> > was thinking there was a time based memstore flush in addition to a
> >> size
> >> >> > one.  But upon reviewing the hbase default configuration, I don't
> see
> >> >> > mention of it.  Is this purely size based?
> >> >> >
> >> >> > We don't have the tools in place to verify exactly what other data
> or
> >> >> tables
> >> >> > may have been impacted.
> >> >> >
> >> >> > The log files are at the paste bin links below.  The whole cluster
> is
> >> 8
> >> >> > nodes + master, 3 zk nodes running on separate machines.  We run
> with
> >> >> mostly
> >> >> > standard settings but do have the following settings:
> >> >> > heap: 12gb
> >> >> > regionsize 4gb, (due to lots of cold data and not enough servers,
> avg
> >> 300
> >> >> > regions/server)
> >> >> > mslab: 4m/512k (due to somewhat frequent updates to larger objects
> in
> >> the
> >> >> > 200-500k size range)
> >> >> >
> >> >> > We've been using hbase for about a year now and have been nothing
> but
> >> >> happy
> >> >> > with it.  The failure state that we had last night (where only some
> >> >> region
> >> >> > servers cannot talk to some zk servers) seems like a strange one.
> >> >> >
> >> >> > Any thoughts? (beyond chiding for switching to a rc)    Any
> opinions
> >> >> whether
> >> >> > we should we roll back to 90.3 (or 90.3+cloudera)?
> >> >> >
> >> >> > Thanks for any help,
> >> >> > Jacques
> >> >> >
> >> >> > master: http://pastebin.com/aG8fm2KZ
> >> >> > h001: http://pastebin.com/nLLk06EC
> >> >> > h002: http://pastebin.com/0wPFuZDx
> >> >> > h003: http://pastebin.com/3ZMV01mA
> >> >> > h004: http://pastebin.com/0YVefuqS
> >> >> > h005: http://pastebin.com/N90LDjvs
> >> >> > h006: http://pastebin.com/gM8umekW
> >> >> > h007: http://pastebin.com/0TVvX68d
> >> >> > h008: http://pastebin.com/mV968Cem
> >> >> >
> >> >>
> >> >
> >>
> >
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Reply via email to