On Thu, Aug 4, 2011 at 10:34 AM, Jean-Daniel Cryans <jdcry...@apache.org>wrote:
> > Thanks for the feedback. So you're inclined to think it would be at the > dfs > > layer? > > That's where the evidence seems to point. > > > > > Is it accurate to say the most likely places where the data could have > been > > lost were: > > 1. wal writes didn't actually get written to disk (no log entries to > suggest > > any issues) > > Most likely. > > > 2. wal corrupted (no log entries suggest any trouble reading the log) > > In that case the logs would scream (and I didn't see that in the logs > I looked at). > > > 3. not all split logs were read by regionservers (?? is there any way to > > ensure this either way... should I look at the filesystem some place?) > > Some regions would have recovered edits files, but that seems highly > unlikely. With DEBUG enabled we could have seen which files were split > by the master and which ones were created for the regions, and then > which were read by the region servers. > > > > > Do you think the type of network partition I'm talking about is > adequately > > covered in existing tests? (Specifically running an external zk cluster?) > > The IO fencing was only tested with HDFS, I don't know what happens in > that case with MapR. What I mean is that when the master splits the > logs, it takes ownership of the HDFS writer lease (only one per file) > so that it can safely close the log file. Then after that it checks if > there are any new log files that were created (the region server could > have rolled a log while the master was splitting them) and will > restart if that situation happens until it's able to own all files and > split them. > JD, I didn't think the master explicitly dealt with writer leases. Does HBase rely on single-writer semantics on the log file? That is, if the master and a RS both decide to mucky-muck with a log file, you expect the FS to lock out one of the writers? > > > > > Have you heard if anyone else is been having problems with the second > 90.4 > > rc? > > Nope, we run it here on our dev cluster and didn't encounter any issue > (with the code or node failure). > > > > > Thanks again for your help. I'm following up with the MapR guys as well. > > Good idea! > > J-D >