Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

M. C. Srivas Fri, 05 Aug 2011 08:52:53 -0700

The normal behavior would be for the HMaster to make the hlog read-only
before processing it.... very simple fencing and works on all Posix or
close-to-Posix systems.  Does that not work on HDFS?



On Fri, Aug 5, 2011 at 7:07 AM, M. C. Srivas <mcsri...@gmail.com> wrote:

>
>
> On Thu, Aug 4, 2011 at 9:01 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> On Thu, Aug 4, 2011 at 8:36 PM, lohit <lohit.vijayar...@gmail.com> wrote:
>> > 2011/8/4 Ryan Rawson <ryano...@gmail.com>
>> >
>> >> Yes, that is what JD is referring to, the so-called IO fence.
>> >>
>> >> It works like so:
>> >> - regionserver is appending to an HLog, continues to do so, hasnt
>> >> gotten the ZK "kill yourself signal" yet
>> >> - hmaster splits the logs
>> >> - the hmaster yanks the writer from under the regionserver, and the RS
>> >> then starts to kill itself
>> >>
>> > Can you tell more about how this is done with HDFS. If RS has the lease,
>> how
>> > did master get hold of that lease. Or is it removing file?
>>
>> In older versions, it would call append() which recovered the lease,
>> so long as the soft lease timeout had expired. More recently, it calls
>> an HDFS "recoverLease" API that provides fencing.
>>
>
> Looks like we need a patch in both HBase and MapR ... even if MapR had
> leases, this piece of code in FSUtils.java prevents it being called.
>
>     if (!(fs instanceof DistributedFileSystem)) {
>       return;
>     }
>
> Someone will be issuing a patch for both MapR and HBase to fix this in a
> couple of days. (I am on vacation).
>
>
>
>
>>
>> >>
>> >>
>> >> This can happen because ZK can deliver the session lost message late,
>> >> and there is a race.
>> >>
>> >> -ryan
>> >>
>> >> On Thu, Aug 4, 2011 at 8:13 PM, M. C. Srivas <mcsri...@gmail.com>
>> wrote:
>> >> > On Thu, Aug 4, 2011 at 10:34 AM, Jean-Daniel Cryans <
>> jdcry...@apache.org
>> >> >wrote:
>> >> >
>> >> >> > Thanks for the feedback.  So you're inclined to think it would be
>> at
>> >> the
>> >> >> dfs
>> >> >> > layer?
>> >> >>
>> >> >> That's where the evidence seems to point.
>> >> >>
>> >> >> >
>> >> >> > Is it accurate to say the most likely places where the data could
>> have
>> >> >> been
>> >> >> > lost were:
>> >> >> > 1. wal writes didn't actually get written to disk (no log entries
>> to
>> >> >> suggest
>> >> >> > any issues)
>> >> >>
>> >> >> Most likely.
>> >> >>
>> >> >> > 2. wal corrupted (no log entries suggest any trouble reading the
>> log)
>> >> >>
>> >> >> In that case the logs would scream (and I didn't see that in the
>> logs
>> >> >> I looked at).
>> >> >>
>> >> >> > 3. not all split logs were read by regionservers  (?? is there any
>> way
>> >> to
>> >> >> > ensure this either way... should I look at the filesystem some
>> place?)
>> >> >>
>> >> >> Some regions would have recovered edits files, but that seems highly
>> >> >> unlikely. With DEBUG enabled we could have seen which files were
>> split
>> >> >> by the master and which ones were created for the regions, and then
>> >> >> which were read by the region servers.
>> >> >>
>> >> >> >
>> >> >> > Do you think the type of network partition I'm talking about is
>> >> >> adequately
>> >> >> > covered in existing tests? (Specifically running an external zk
>> >> cluster?)
>> >> >>
>> >> >> The IO fencing was only tested with HDFS, I don't know what happens
>> in
>> >> >> that case with MapR. What I mean is that when the master splits the
>> >> >> logs, it takes ownership of the HDFS writer lease (only one per
>> file)
>> >> >> so that it can safely close the log file. Then after that it checks
>> if
>> >> >> there are any new log files that were created (the region server
>> could
>> >> >> have rolled a log while the master was splitting them) and will
>> >> >> restart if that situation happens until it's able to own all files
>> and
>> >> >> split them.
>> >> >>
>> >> >
>> >> > JD,   I didn't think the master explicitly dealt with writer leases.
>> >> >
>> >> > Does HBase rely on single-writer semantics on the log file? That is,
>> if
>> >> the
>> >> > master and a RS both decide to mucky-muck with a log file, you expect
>> the
>> >> FS
>> >> > to lock out one of the writers?
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> > Have you heard if anyone else is been having problems with the
>> second
>> >> >> 90.4
>> >> >> > rc?
>> >> >>
>> >> >> Nope, we run it here on our dev cluster and didn't encounter any
>> issue
>> >> >> (with the code or node failure).
>> >> >>
>> >> >> >
>> >> >> > Thanks again for your help.  I'm following up with the MapR guys
>> as
>> >> well.
>> >> >>
>> >> >> Good idea!
>> >> >>
>> >> >> J-D
>> >> >>
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Have a Nice Day!
>> > Lohit
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR)

Reply via email to