[jira] Commented: (HDFS-1220) Namenode unable to start due to truncated fstime
[ https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907907#action_12907907 ] Todd Lipcon commented on HDFS-1220: --- Dhruba: Didn't we take care of that in HDFS-970? > Namenode unable to start due to truncated fstime > > > Key: HDFS-1220 > URL: https://issues.apache.org/jira/browse/HDFS-1220 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: updating fstime file on disk is not atomic, so it is possible that > if a crash happens in the middle, next time when NameNode reboots, it will > read stale fstime, hence unable to start successfully. > > - Details: > Basically, this involve 3 steps: > 1) delete fstime file (timeFile.delete()) > 2) truncate fstime file (new FileOutputStream(timeFile)) > 3) write new time to fstime file (out.writeLong(checkpointTime)) > If a crash happens after step 2 and before step 3, in the next reboot, > NameNode > got an exception when reading the time (8 byte) from an empty fstime file. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1220) Namenode unable to start due to truncated fstime
[ https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907903#action_12907903 ] dhruba borthakur commented on HDFS-1220: The rename does not actually sync all the data from the kernel buffers to disk. Thus, it is theoretically possible that even though the NN actually wrote out everything to disk and the machine rebooted, some data in any of the fstime/edits/fsimage could be missing. I think we should issue a fsync() on all these files before closing them. > Namenode unable to start due to truncated fstime > > > Key: HDFS-1220 > URL: https://issues.apache.org/jira/browse/HDFS-1220 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: updating fstime file on disk is not atomic, so it is possible that > if a crash happens in the middle, next time when NameNode reboots, it will > read stale fstime, hence unable to start successfully. > > - Details: > Basically, this involve 3 steps: > 1) delete fstime file (timeFile.delete()) > 2) truncate fstime file (new FileOutputStream(timeFile)) > 3) write new time to fstime file (out.writeLong(checkpointTime)) > If a crash happens after step 2 and before step 3, in the next reboot, > NameNode > got an exception when reading the time (8 byte) from an empty fstime file. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1220) Namenode unable to start due to truncated fstime
[ https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882003#action_12882003 ] Konstantin Shvachko commented on HDFS-1220: --- I think in both issues (this and HDFS-1221) you are missing the point discussed in HDFS-955 in details. We first write everything (fsimage, edits, fstime, and VERSION) into a separate directory. If everything is done successfully then this directory will be used as a startup point by the NN. If not the old directory is still present and NN will recover it and start from the previous image. Does this answer your concerns? > Namenode unable to start due to truncated fstime > > > Key: HDFS-1220 > URL: https://issues.apache.org/jira/browse/HDFS-1220 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: updating fstime file on disk is not atomic, so it is possible that > if a crash happens in the middle, next time when NameNode reboots, it will > read stale fstime, hence unable to start successfully. > > - Details: > Basically, this involve 3 steps: > 1) delete fstime file (timeFile.delete()) > 2) truncate fstime file (new FileOutputStream(timeFile)) > 3) write new time to fstime file (out.writeLong(checkpointTime)) > If a crash happens after step 2 and before step 3, in the next reboot, > NameNode > got an exception when reading the time (8 byte) from an empty fstime file. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1220) Namenode unable to start due to truncated fstime
[ https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881526#action_12881526 ] Thanh Do commented on HDFS-1220: it is not exactly the same as HDFS-1221, although fstime suffered from corruption too (which may lead to data loss). In this case, i think the update to fstime should be atomic, or NameNode some how should anticipate reading an empty fstime. > Namenode unable to start due to truncated fstime > > > Key: HDFS-1220 > URL: https://issues.apache.org/jira/browse/HDFS-1220 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: updating fstime file on disk is not atomic, so it is possible that > if a crash happens in the middle, next time when NameNode reboots, it will > read stale fstime, hence unable to start successfully. > > - Details: > Basically, this involve 3 steps: > 1) delete fstime file (timeFile.delete()) > 2) truncate fstime file (new FileOutputStream(timeFile)) > 3) write new time to fstime file (out.writeLong(checkpointTime)) > If a crash happens after step 2 and before step 3, in the next reboot, > NameNode > got an exception when reading the time (8 byte) from an empty fstime file. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1220) Namenode unable to start due to truncated fstime
[ https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881519#action_12881519 ] Konstantin Shvachko commented on HDFS-1220: --- Is this the same as HDFS-1221? > Namenode unable to start due to truncated fstime > > > Key: HDFS-1220 > URL: https://issues.apache.org/jira/browse/HDFS-1220 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: updating fstime file on disk is not atomic, so it is possible that > if a crash happens in the middle, next time when NameNode reboots, it will > read stale fstime, hence unable to start successfully. > > - Details: > Basically, this involve 3 steps: > 1) delete fstime file (timeFile.delete()) > 2) truncate fstime file (new FileOutputStream(timeFile)) > 3) write new time to fstime file (out.writeLong(checkpointTime)) > If a crash happens after step 2 and before step 3, in the next reboot, > NameNode > got an exception when reading the time (8 byte) from an empty fstime file. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1220) Namenode unable to start due to truncated fstime
[ https://issues.apache.org/jira/browse/HDFS-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879666#action_12879666 ] Todd Lipcon commented on HDFS-1220: --- I believe we fixed this in trunk by saving to an fsimage_ckpt dir and then moving it into place atomically once all the files are on disk. See HDFS-955? > Namenode unable to start due to truncated fstime > > > Key: HDFS-1220 > URL: https://issues.apache.org/jira/browse/HDFS-1220 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.20.1 >Reporter: Thanh Do > > - Summary: updating fstime file on disk is not atomic, so it is possible that > if a crash happens in the middle, next time when NameNode reboots, it will > read stale fstime, hence unable to start successfully. > > - Details: > Basically, this involve 3 steps: > 1) delete fstime file (timeFile.delete()) > 2) truncate fstime file (new FileOutputStream(timeFile)) > 3) write new time to fstime file (out.writeLong(checkpointTime)) > If a crash happens after step 2 and before step 3, in the next reboot, > NameNode > got an exception when reading the time (8 byte) from an empty fstime file. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.