Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

M. C. Srivas Fri, 05 Aug 2011 07:08:14 -0700

On Thu, Aug 4, 2011 at 9:01 PM, Todd Lipcon <t...@cloudera.com> wrote:


> On Thu, Aug 4, 2011 at 8:36 PM, lohit <lohit.vijayar...@gmail.com> wrote:
> > 2011/8/4 Ryan Rawson <ryano...@gmail.com>
> >
> >> Yes, that is what JD is referring to, the so-called IO fence.
> >>
> >> It works like so:
> >> - regionserver is appending to an HLog, continues to do so, hasnt
> >> gotten the ZK "kill yourself signal" yet
> >> - hmaster splits the logs
> >> - the hmaster yanks the writer from under the regionserver, and the RS
> >> then starts to kill itself
> >>
> > Can you tell more about how this is done with HDFS. If RS has the lease,
> how
> > did master get hold of that lease. Or is it removing file?
>
> In older versions, it would call append() which recovered the lease,
> so long as the soft lease timeout had expired. More recently, it calls
> an HDFS "recoverLease" API that provides fencing.
>

Looks like we need a patch in both HBase and MapR ... even if MapR had
leases, this piece of code in FSUtils.java prevents it being called.

    if (!(fs instanceof DistributedFileSystem)) {
      return;
    }

Someone will be issuing a patch for both MapR and HBase to fix this in a
couple of days. (I am on vacation).




>
> >>
> >>
> >> This can happen because ZK can deliver the session lost message late,
> >> and there is a race.
> >>
> >> -ryan
> >>
> >> On Thu, Aug 4, 2011 at 8:13 PM, M. C. Srivas <mcsri...@gmail.com>
> wrote:
> >> > On Thu, Aug 4, 2011 at 10:34 AM, Jean-Daniel Cryans <
> jdcry...@apache.org
> >> >wrote:
> >> >
> >> >> > Thanks for the feedback.  So you're inclined to think it would be
> at
> >> the
> >> >> dfs
> >> >> > layer?
> >> >>
> >> >> That's where the evidence seems to point.
> >> >>
> >> >> >
> >> >> > Is it accurate to say the most likely places where the data could
> have
> >> >> been
> >> >> > lost were:
> >> >> > 1. wal writes didn't actually get written to disk (no log entries
> to
> >> >> suggest
> >> >> > any issues)
> >> >>
> >> >> Most likely.
> >> >>
> >> >> > 2. wal corrupted (no log entries suggest any trouble reading the
> log)
> >> >>
> >> >> In that case the logs would scream (and I didn't see that in the logs
> >> >> I looked at).
> >> >>
> >> >> > 3. not all split logs were read by regionservers  (?? is there any
> way
> >> to
> >> >> > ensure this either way... should I look at the filesystem some
> place?)
> >> >>
> >> >> Some regions would have recovered edits files, but that seems highly
> >> >> unlikely. With DEBUG enabled we could have seen which files were
> split
> >> >> by the master and which ones were created for the regions, and then
> >> >> which were read by the region servers.
> >> >>
> >> >> >
> >> >> > Do you think the type of network partition I'm talking about is
> >> >> adequately
> >> >> > covered in existing tests? (Specifically running an external zk
> >> cluster?)
> >> >>
> >> >> The IO fencing was only tested with HDFS, I don't know what happens
> in
> >> >> that case with MapR. What I mean is that when the master splits the
> >> >> logs, it takes ownership of the HDFS writer lease (only one per file)
> >> >> so that it can safely close the log file. Then after that it checks
> if
> >> >> there are any new log files that were created (the region server
> could
> >> >> have rolled a log while the master was splitting them) and will
> >> >> restart if that situation happens until it's able to own all files
> and
> >> >> split them.
> >> >>
> >> >
> >> > JD,   I didn't think the master explicitly dealt with writer leases.
> >> >
> >> > Does HBase rely on single-writer semantics on the log file? That is,
> if
> >> the
> >> > master and a RS both decide to mucky-muck with a log file, you expect
> the
> >> FS
> >> > to lock out one of the writers?
> >> >
> >> >
> >> >
> >> >
> >> >>
> >> >> >
> >> >> > Have you heard if anyone else is been having problems with the
> second
> >> >> 90.4
> >> >> > rc?
> >> >>
> >> >> Nope, we run it here on our dev cluster and didn't encounter any
> issue
> >> >> (with the code or node failure).
> >> >>
> >> >> >
> >> >> > Thanks again for your help.  I'm following up with the MapR guys as
> >> well.
> >> >>
> >> >> Good idea!
> >> >>
> >> >> J-D
> >> >>
> >> >
> >>
> >
> >
> >
> > --
> > Have a Nice Day!
> > Lohit
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Reply via email to