[jira] [Commented] (HBASE-3820) Splitlog() executed while the namenode was in safemode may cause data-loss

stack (JIRA) Thu, 05 May 2011 21:51:50 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029748#comment-13029748
 ]


stack commented on HBASE-3820:
------------------------------

Sure.

bq. "dfs.exist(new Path("/"))" checks the dfs is available, at least, the dfs 
can be read.

That is true, but this is being done in a method named 'isFSWritable' (of some 
such thing)... we're testing exists but the method says we are testing its 
writable which we are not doing.  Its misleading.

bq. "!checkDfsSafeMode(conf)" checks the dfs is not in safemode. because only 
in safemode, it can't be written

This is true.  But this is not the only reason fs becomes unwriteable.

bq. If not,what about do a test of writing a temp file to check whether the dfs 
is writable?

I don't think this a good idea.  This call is made often.  Doing this check 
will put a big load on the fs, at least I think it will.

Why not just use the method that was there previous and add the is safe mode.  
Should we even check safe mode though?  This is a trip to the NN (IIUC).  Maybe 
we should just deal with the safe mode exception going into a holding pattern 
if we get one checking the fs (Will we get a safe mode exception if we do an 
exists check?  i don't know).

bq. About the HLogSplitter, it doesn't only wait ten seconds. ten seconds is 
just an waiting-interval. If the dfs is in safemode, it will wait 10 seconds, 
and after that checkagain.......

Ok.  Pardon me.  My misunderstanding.  Thanks for clarifying.

bq. I think, if the dfs can't be writable, whether dfs is in safemode or any 
other reasons cause the dfs can't be written while splitLog(), it's fetal. so 
let the master abort.

If its not writeable, it would be nice if we could go into a holding pattern 
especially if its actually in safe mode -- would be nice to ride out a safe 
mode if that was possible.  But I think that that is a pretty big problem to 
solve; will take some work.

Maybe you should narrow the scope of this issue to deal with safe mode while 
splitting?  Or making the Master hold until fs exits safe mode?

Thanks Jieshan for looking into this issue.

> Splitlog() executed while the namenode was in safemode may cause data-loss
> --------------------------------------------------------------------------
>
>                 Key: HBASE-3820
>                 URL: https://issues.apache.org/jira/browse/HBASE-3820
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.2
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3820-MFSFix-90-V2.patch, HBASE-3820-MFSFix-90.patch
>
>
> I found this problem while the namenode went into safemode due to some 
> unclear reasons. 
> There's one patch about this problem:
>    try {
>       HLogSplitter splitter = HLogSplitter.createLogSplitter(
>         conf, rootdir, logDir, oldLogDir, this.fs);
>       try {
>         splitter.splitLog();
>       } catch (OrphanHLogAfterSplitException e) {
>         LOG.warn("Retrying splitting because of:", e);
>         // An HLogSplitter instance can only be used once.  Get new instance.
>         splitter = HLogSplitter.createLogSplitter(conf, rootdir, logDir,
>           oldLogDir, this.fs);
>         splitter.splitLog();
>       }
>       splitTime = splitter.getTime();
>       splitLogSize = splitter.getSize();
>     } catch (IOException e) {
>       checkFileSystem();
>       LOG.error("Failed splitting " + logDir.toString(), e);
>       master.abort("Shutting down HBase cluster: Failed splitting hlog 
> files...", e);
>     } finally {
>       this.splitLogLock.unlock();
>     }
> And it was really give some useful help to some extent, while the namenode 
> process exited or been killed, but not considered the Namenode safemode 
> exception.
>    I think the root reason is the method of checkFileSystem().
>    It gives out an method to check whether the HDFS works normally(Read and 
> write could be success), and that maybe the original propose of this method. 
> This's how this method implements:
>     DistributedFileSystem dfs = (DistributedFileSystem) fs;
>     try {
>       if (dfs.exists(new Path("/"))) {  
>         return;
>       }
>     } catch (IOException e) {
>       exception = RemoteExceptionHandler.checkIOException(e);
>     }
>    
>    I have check the hdfs code, and learned that while the namenode was in 
> safemode ,the dfs.exists(new Path("/")) returned true. Because the file 
> system could provide read-only service. So this method just checks the dfs 
> whether could be read. I think it's not reasonable.
>     
>    

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3820) Splitlog() executed while the namenode was in safemode may cause data-loss

Reply via email to