Is there not the WAL to handle a failed flush?
> On Oct 2, 2014, at 11:39 AM, Nick Dimiduk <[email protected]> wrote: > > In this case, didn't the RS creating the directories and flushing the files > prevent data loss? Had the flush aborted due to lack of directories, that > flush data would have been lost entirely. > >> On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <[email protected]> wrote: >> >> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <[email protected]> wrote: >> >>> Also, once the original /hbase got mv'd, a few of the region servers did >>> some flush's before they aborted. Those RS's actually created a new >>> /hbase, with new table directories, but only containing the data from the >>> flush. >> >> >> Sounds like we should be creating flush files with createNonRecursive (even >> though it's deprecated) >> >> >>> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <[email protected]> wrote: >>> >>> FWIW, in case something like this happens to someone else. >>> >>> To recover this, the first thing I tried was to just mv the /hbase >>> directory back. That doesn’t work. >>> >>> To get back going had to completely shut down and restart. >>> >>> Also, once the original /hbase got mv'd, a few of the region servers did >>> some flush's before they aborted. Those RS's actually created a new >>> /hbase, with new table directories, but only containing the data from the >>> flush. >>> >>> >>> -----Original Message----- >>> From: Buckley,Ron >>> Sent: Thursday, October 02, 2014 2:09 PM >>> To: hbase-user >>> Subject: RE: Recovering hbase after a failure >>> >>> Nick, >>> >>> Good ideas. Compared file and region counts with our DR site. >> Things >>> looks OK. Going to run some rowcounter's too. >>> >>> Feels like we got off easy. >>> >>> Ron >>> >>> -----Original Message----- >>> From: Nick Dimiduk [mailto:[email protected]] >>> Sent: Thursday, October 02, 2014 1:27 PM >>> To: hbase-user >>> Subject: Re: Recovering hbase after a failure >>> >>> Hi Ron, >>> >>> Yikes! >>> >>> Do you have any basic metrics regarding the amount of data in the system >>> -- size of store files before the incident, number of records, &c? >>> >>> You could sift through the HDFS audit log and see if any files that were >>> there previously have not been restored. >>> >>> -n >>> >>>> On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <[email protected]> wrote: >>>> >>>> We just had an event where, on our main hbase instance, the /hbase >>>> directory got moved out from under the running system (Human error). >>>> >>>> HBase was really unhappy about that, but we were able to recover it >>>> fairly easily and get back going. >>>> >>>> As far as I can tell, all the data and tables came back correct. But, >>>> I'm pretty concerned that there may be some hidden corruption or data >>> loss. >>>> >>>> 'hbase hbck' runs clean and there are no new complaints in the logs. >>>> >>>> Can anyone think of anything else we should look at? >> >> >> >> -- >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >>
