[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559813#comment-14559813 ] Jing Zhao commented on HDFS-7991: - We can use this jira just to remove the original dfsadmin scripts and add a script hook as Allen did in his patch. Allen, for your script patch, besides the secretshutdownhook is just a placeholder, looks like you have not handled the HADOOP_OPTS issue right? Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555856#comment-14555856 ] Vinayakumar B commented on HDFS-7991: - You never know whether all the time machine will be up for admin to execute stop command to have the checkpoint. And also AFAIK in some real and big clusters executing stop command itself is very very rare, especially in these cases where standby not available. What if machine itself goes down suddenly after running for months/years, having tons of millions of edits without checkpoint ? I have also seen sometimes, due to some overusage of openfiles/connections, I was not able to open SSH terminal itself to execute command. Still in this case restart of NN going to take hours/days based on load. Then All the effort spent on discussion in this Jira would go waste. Instead of doing everything at the end while stopping, why not implement a periodic check inside Active NameNode itself to check for the checkpoint. Similar to {{FSNameSystem#NameNodeEditLogRoller}} added to roll edits after reaching threshold to avoid bigger edit logs. Infact we can re-use this thread itself to check for checkpoint also with different interval. Interval may be multiple of checkpoint interval configured. Anyway doing *checkpoint* in Active NameNode is not a big deal. Its just saving FsImage to all available disks. No big process of loading edits involved as its already uptodate. So even NN can do this with just acquiring {{writeLock()}} instead of entering safemode and coming out. Still {{saveNamespace()}} external RPC can retain current behaviour. Since this problem can happen only if Standby/Secondary NameNode not available for long time, I feel its Okay for client's operation to wait for saveNamespace() to be over. Any thoughts? Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555786#comment-14555786 ] Hadoop QA commented on HDFS-7991: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 36s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 29s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 37s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | shellcheck | 0m 5s | The applied patch generated 1 new shellcheck (v0.3.3) issues (total was 25, now 23). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 32s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | common tests | 22m 48s | Tests passed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 161m 46s | Tests failed in hadoop-hdfs. | | | | 218m 51s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.TestAppendSnapshotTruncate | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12734733/HDFS-7991-shellpart.patch | | Optional Tests | shellcheck javadoc javac unit | | git revision | trunk / cf2b569 | | shellcheck | https://builds.apache.org/job/PreCommit-HDFS-Build/11096/artifact/patchprocess/diffpatchshellcheck.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-HDFS-Build/11096/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/11096/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/11096/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/11096/console | This message was automatically generated. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556911#comment-14556911 ] Allen Wittenauer commented on HDFS-7991: bq. The current mechanism can be removed when better working solution is available. Be aware that any solution (such as that in the current shell code) that calls dfsadmin without doing the necessary work to authenticate is a backwards incompatible change and breaks existing, secure deployments. (See [~kihwal]'s comment above). That's before we even get to HADOOP_OPTS munging problems and the issues that causes. So removing the current mechanism is an improvement: from not working to working namenode shutdown. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556434#comment-14556434 ] Allen Wittenauer commented on HDFS-7991: bq. Any thoughts? Just one, and this is the line that triggered it: bq. Instead of doing everything at the end while stopping, why not implement a periodic check inside Active NameNode itself to check for the checkpoint. I've been working under the assumption that the sites that are hitting this issue are running a secondary namenode. Is that not true? Doesn't the 2NN make this whole issue go away? * If the answer is The 2NN does make this issue go away then this is a won't fix and we should yank out the broken bash code that's presently in trunk and causes my stop's to actually *fail*. * If the answer is No, the 2NN has nothing to do with this then [~vinayrpet] (either separate or combined with the 2NN) is a MUCH better answer than hacking the hell out of this stuff. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556586#comment-14556586 ] Suresh Srinivas commented on HDFS-7991: --- bq. Yup, and in those cases, that's what they pay vendors to fix. For those of that don't, they roll back to the last good copy and move on. The proposal here ensure that no vendor needs to be involved to remove faulty editlog record (BTW I have not seen regex issues, only out of order editlog entries that could not be applied or editlog records became too big (n^2 growth) and applying it became laboriously slow). bq. All of the discussion up until recently has been about fixing the broken bits in the shell code. If we want to switch the discussion to make the namenode checkpoint optional when it's sent a kill, that's great. It means we can clean out the shell code and make this entirely a Java-level fix, as it should be. We can fix issues in the code. Currently NN is sent kill -9 after a timeout. That needs to be changed to work with NN shutdown hook. Also NN shutdown hook and ensuring all the daemon services are done in the right order without causing failures to namespace requires careful design. It also requires putting namenode into safemode. I think doing it outside, as done in the current approach, using save namespace, is much simpler and cleaner. But if you want to do it as part of shutdown you are welcome to do make that change. If that change takes some time, I prefer the current mechanism until it gets ready. The current mechanism can be removed when *better* working solution is available. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556496#comment-14556496 ] Allen Wittenauer commented on HDFS-7991: bq. Ideally when 2NN or standby is working. But we have had many issues where checkpointing is not done by SNN or standby, for the following reasons: OK, so these are not new issues at all and have been around for literally years (decade now?). We had it happen at Y! back in 2007 and it's a story I often tell during talks. bq. We need a way to be able to save namespace. Then fix the NN-2NN relationship to provide better alerting when stuff goes wrong. Hacking the shell code (and, yes, the code in branch-2 and in trunk are clearly hacks. Heck, the branch-2 doesn't even trigger if you are running NN in non-daemon mode...) is completely the wrong thing to do. .. and has been pointed out, this hack does NOTHING to help in the case of hardware failure, when you want it most. bq. Today operators who understand this situation do save namespace manually before stopping the namenode. I don't think I can put enough lol's in here to express how many laughs this statement got from around the office. No, operators who understand this issue monitor the size of the edits file and the 2NN and then act appropriately. We don't do safemode-checkpoint-shutdown on every NN bring down. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556546#comment-14556546 ] Allen Wittenauer commented on HDFS-7991: bq. Just in your previous comment it seemed to me you did not even understand the issue . ... bq. 1. editlog had an issue and could not be consumed by 2NN or standby bq. 2. checkpointing is lagging behind (see HDFS-7609) bq. 3. There could many others bugs and issues (standby down etc) that could result in delayed checkpoint I've seen every single one of these in production, either at Y! or at LI during my time with Hadoop. My favorite is the time the regex was bugged so bad it caused the 2NN to crash during the log parsing because someone wrote a weird file name. So yeah, I'm pretty sure I do have a good grasp of exactly the issues you are talking about, having been on the receiving end of corrupted image files in the past and having to walk down to developer row to get them fixed. bq. No one is proposing that operators need do [safemode-checkpoint-shutdown] on every NN bring down. Oh? You mean like this completely broken code that is already sitting in trunk during the first attempt (HDFS-6353 ) to fix this issue? {code} if [[ ${COMMAND} == namenode ]] [[ ${HADOOP_DAEMON_MODE} == stop ]]; then hadoop_debug Do checkpoint if necessary before stopping NameNode export CLASSPATH ${JAVA} -Dproc_dfsadmin ${HADOOP_OPTS} org.apache.hadoop.hdfs.tools.DFSAdmin -safemode enter ${JAVA} -Dproc_dfsadmin ${HADOOP_OPTS} org.apache.hadoop.hdfs.tools.DFSAdmin -saveNamespace -beforeShutdown ${JAVA} -Dproc_dfsadmin ${HADOOP_OPTS} org.apache.hadoop.hdfs.tools.DFSAdmin -safemode leave fi {code} I'm glad that we agree that this code should get removed since it's causing so many problems. bq. In some cases checkpoint could not even be done because editlog was corrupt and could not be consumed by 2NN or standby (sorry, repeating myself). Yup, and in those cases, that's what they pay vendors to fix. For those of that don't, they roll back to the last good copy and move on. bq. This jira proposes to save namespace when checkpointing has not happened for a long time. All of the discussion up until recently has been about fixing the broken bits in the shell code. If we want to switch the discussion to make the namenode checkpoint optional when it's sent a kill, that's great. It means we can clean out the shell code and make this entirely a Java-level fix, as it should be. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556684#comment-14556684 ] Vinayakumar B commented on HDFS-7991: - bq. If doing checkpointing in the active namenode was possible without pausing the ongoing requests, we would not have moved checkpointing to either secondary or standby Yes agree that we cant pause ongoing requests for long time. I actually meant for these critical situations, not always, saving namespace directly looked better compare to restart of NN, which also requires someone to monitor the size of edits and trigger saveNamespace/stop. But in Normal conditions occurance of this would be very rare. May be If user apps needs to be informed about the situation, then active NN itself can put itself to safemode before saving namespace, as done on admin request. Anyway I am not very strong about safemode or not, that was just a thought as practically saving just fsImage to disk will take less time, of-course it again depends on size. But IMHO, to handle such abnormal cases, NN itself should be able to take steps, instead of some admin finding out and taking steps. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556511#comment-14556511 ] Suresh Srinivas commented on HDFS-7991: --- bq. I don't think I can put enough lol's in here to express how many laughs this statement got from around the office. [~aw], I am glad it was amusing. I have a lot of respect for the operations background you bring. But that does not mean that others are clueless. Such an attitude is disrespectful and counter productive. So please tone it down. There are many others who understand operational aspects of the issue we are discussing in this jira and have seen many issues where users have gotten burnt. bq. No, operators who understand this issue monitor the size of the edits file and the 2NN and then act appropriately. Just in your previous comment it seemed to me you did not even understand the issue :). What do you mean by act appropriately? bq. We don't do safemode-checkpoint-shutdown on every NN bring down. Relax. No one is proposing that operators need do that on every NN bring down. Not even the solution in this jira is proposing that, if you read it carefully. When checkpoint has not happened for a long time, NN startup could take a very long time (I have seen half a dozen cases where it took 3-5 days!). In some cases checkpoint *could not even be done* because editlog was corrupt and could not be consumed by 2NN or standby (sorry, repeating myself). Some operators understand the issue that checkpoint has not happened for a long time and do save namespace to avoid issues. Some don't. This jira proposes to save namespace when checkpointing has not happened for a long time. What I see in this jira is we have gone in circles and I am not even sure issues are understood well. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556470#comment-14556470 ] Suresh Srinivas commented on HDFS-7991: --- bq. I've been working under the assumption that the sites that are hitting this issue are running a secondary namenode. Is that not true? Doesn't the 2NN make this whole issue go away? Ideally when 2NN or standby is working. But we have had many issues where checkpointing is not done by SNN or standby, for the following reasons: 1. editlog had an issue and could not be consumed by 2NN or standby 2. checkpointing is lagging behind (see HDFS-7609) 3. There could many others bugs and issues (standby down etc) that could result in delayed checkpoint Repeating myself, this is a very important functionality to avoid data loss and service unavailability. But we need a way to be able to save namespace. Today operators who understand this situation do save namespace manually before stopping the namenode. People who miss doing that run into production issues. This jira proposes automatically saving namespace to avoid issues. I don't understand why it hacking the hell out of stuff. [~vinayrpet], some comments: bq. What if machine itself goes down suddenly after running for months/years, having tons of millions of edits without checkpoint ? Yes there are times when saving namespace may not be possible. But in large majority of case, when HDFS issues are seen, inexperienced administrators just restart the cluster and run into this issue. bq. Anyway doing checkpoint in Active NameNode is not a big deal If doing checkpointing in the active namenode was possible without pausing the ongoing requests, we would not have moved to checkpointing to either secondary or standby. That is also the reason why the namenode is first put into safemode, the write request are quiesced, and then save namespace is called. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554800#comment-14554800 ] Jing Zhao commented on HDFS-7991: - Thanks Allen. Yes, I also just realized that jmx may not be a good solution here. bq. to do a REST or RPC call to ask the NN what it's doing The same question here is what if this RPC/REST call fails (or timeout)? Should we retry and how? Or should we kill the NameNode? To me this is not fundamentally different from the saveNamespace solution: # We're using kill to trigger the shutdown hook which does the checkpoint. This can be mapped to the step sending out a saveNamespace command to NN. # We then keep polling the state of the NameNode using a REST/RPC call, just like waiting for the response from the saveNamespace RPC. # Both solutions finally need to answer the same question: what if the REST/RPC call fails? bq. This will almost certainly break init.d/rc.d/service/launchd/whatever scripts. Yes, but I think if the checkpoint is necessary at this time, breaking these scripts may not be that bad compared with killing the namenode then waiting hours for the namenode to load edits or even fixing corrupted edits. bq. currently does not require a Kerberos credential Regarding to the auth part, how about directly parsing the hdfs-site.xml and getting the namenode fsimage/edits directory location? Then we can directly check if the checkpoint is necessary by going through the fsimage/edits file names. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554842#comment-14554842 ] Allen Wittenauer commented on HDFS-7991: bq. The same question here is what if this RPC/REST call fails (or timeout)? Should we retry and how? Or should we kill the NameNode? To me this is not fundamentally different from the saveNamespace solution If the REST/RPC call fails or shows no progress over X timeout value (e.g., reset the timer every time we show progress), then the NN should be considered hung and it should get killed with prejudice. There's no reason why the REST/RPC port has to be shutdown just because we are saving state. If that's happening now, that's a terrible design decision. This should be pretty trivial to do: 1. send the kill to the daemon to shutdown 2. see that we have a bash hook to call our special timeout function for this daemon instead of sleeping 3. timeout function calls a separate java program that queries the daemon. Decision point: a) shutdown success, it exists. b) if NN shutdown times out due to no progress, exit with failure 4. bash code sees exit with failure and sends kill -9. If you want, I can write up the shell patch to do this after lunch. The shell part to enable this is tiny. bq. Yes, but I think if the checkpoint is necessary at this time, breaking these scripts may not be that bad compared with killing the namenode then waiting hours for the namenode to load edits or even fixing corrupted edits. You have a choice between a breaking change and a non-breaking change. This effectively shifts the burden from one dev writing code to hundreds/thousands. Hint: not all of those hundreds/thousands are nearly as nice as me. ;) bq. how about directly parsing the hdfs-site.xml Someone doesn't know about {{hdfs getconf}} ... ;) bq. Then we can directly check if the checkpoint is necessary by going through the fsimage/edits file names. So this fix isn't needed for the HA case? Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555112#comment-14555112 ] Suresh Srinivas commented on HDFS-7991: --- bq. bash code sees exit with failure and sends kill -9. I think the goal of this jira should be to ensure save namespace is done when editlog size is huge. I have seen many cases where people either had to suffer loss of data or wait for more than 3 days for namenode to startup consuming all the pending editlogs. Blindly sending kill -9 is not an option in my opinion. Instead of emphasizing namenode stop functionality works, I would rather see save namespace work. Isn't there an environment variable that enables this functionality? For folks who want stop to no save namespace or a different behavior, it can be be used to go back to the previous behavior, right? Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555038#comment-14555038 ] Jing Zhao commented on HDFS-7991: - Thanks for the further explanation, Allen. Now I get your point: the client side will still use a separate java program to query the daemon. Then if we also let this java program send out the checkpoint check command, and considering our current RPC already has the capability to handle timeout and retry, I guess we can directly utilize the current saveNamespace RPC? Then the only difference from your proposal is to move your step 1 after step 3. bq. If you want, I can write up the shell patch to do this after lunch. The shell part to enable this is tiny. Thanks, Allen. That will be helpful. bq. So this fix isn't needed for the HA case? For HA, since we're only stopping the local NameNode, the checkpoint can be independent. But one thing I still need to confirm is if we can get enough information about the number of transactions out the fsimage from the local NN directory, if no local edits is stored (i.e., journals are only in JNs). I will explore further on this. bq. You have a choice between a breaking change and a non-breaking change. This effectively shifts the burden from one dev writing code to hundreds/thousands. Looks like this is the main and maybe only place we have different opinion. In your proposal if the java program or the checkpoint process timeout we should send out kill -9. My thoughts: # If the NameNode is healthy, the java program or the checkpoint checking should go through smoothly. This should be the normal case. # The timeout should be rare. But if it happens, NameNode may have some issue or a checkpoint is necessary. Then I think it's worthy to do extra check for the NameNode since killing the NN now can lead to hours of downtime which may really kill the admins. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555300#comment-14555300 ] Allen Wittenauer commented on HDFS-7991: bq. Then if we also let this java program send out the checkpoint check command, and considering our current RPC already has the capability to handle timeout and retry, I guess we can directly utilize the current saveNamespace RPC? I would keep it simple: shutdown also triggers the logic for if checkpoint is necessary. There's zero value in waiting for the helper app to trigger it. This also means the helper app is extremely simple: an unauthenticated call that does is checkpoint still happening? Is checkpoint still happening? What about now? Are we down yet Papa Smurf? This way we also fix [~sureshms] issue: bq. Blindly sending kill -9 is not an option in my opinion. That's why it's not blind. The helper app's *sole* purpose should be to provide the hint to the shell code if things are so screwed up that kill -9 is the only way out. This way all of the key, important logic is in Java code and the one thing the Java code probably shouldn't do (kill) is left to the shell code. bq. Instead of emphasizing namenode stop functionality works, I would rather see save namespace work. To the person who isn't looking at the code, these are effectively one and the same. If I'm stopping the namenode, I expect it to do what is necessary to come back up in a sane state. Why should an admin have to make the decision here when the NN itself knows the state best? Telling me to run save namespace is dumb: Why didn't you just do it yourself, you stupid program? :D bq. Isn't there an environment variable that enables this functionality? For folks who want stop to not save namespace or a different behavior, it can be be used to go back to the previous behavior, right? The # of times this is going to be needed should approach zero... and in those cases, a Java property (or properties!) is *way* better. Some clueless person is going to tell others Hey, set this to make your system shut down faster. The Java apps can read the properties do whatever it needed/desired. This also means they can prompt to say are you sure? because this is the type of operation (shutdown w/out checkpoint) that sounds like should never happen in an automated way. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554593#comment-14554593 ] Allen Wittenauer commented on HDFS-7991: bq. Another way is that, instead of issuing the saveNamespace command directly, the script checks the time of the latest checkpoint and the total number of transactions first (maybe through the jmxget command). jmxget is rarely installed. Some other way to get the data will need to be supplied, almost certainly stuck away in dfsadmin or something. There's also the problem of JMX not being turned on by default (hint: we can't.) But the other part: bq. If it is necessary to do a checkpoint, the script will abort and print out some warning msg asking the admin to run dfsadmin -saveNamespace. No can do. This will almost certainly break init.d/rc.d/service/launchd/whatever scripts. bq. The third option is to move the checkpoint logic into the shutdown hook of the NameNode. The biggest challenge here is the sync between the server and the script, i.e., to decide when and whether to kill the NN in the script. The script may have to polling the current state of the NameNode and guess whether the NameNode is still doing a checkpoint or it hangs somewhere else. Currently I do not see an easy way to achieve this. IMO, this is still the best answer. With a SMOP of the code (at least in trunk. dunno don't care about the disaster zone known as branch-2), it should be relatively trivial to write a hook that uses the almost ubiquitous wget, curl, or something stuck away in hadoop-common to do a REST or RPC call to ask the NN what it's doing. (and, of course, that call would be in a function that could be replaced if the user needed to use something else. best bet: shove it in the hdfs shellprofile). The ONLY big deal is going to be that {{hdfs --daemon stop namenode}} currently does not require a Kerberos credential. Of course that has large implications for boot scripts needing to kinit. Unless we make sure this REST or PC call doesn't require auth, that will change that requirement Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Labels: BB2015-05-TBR Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554598#comment-14554598 ] Allen Wittenauer commented on HDFS-7991: (It just occurred to me that auth is a big problem with the current patch too...) Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Labels: BB2015-05-TBR Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551234#comment-14551234 ] Jing Zhao commented on HDFS-7991: - Recently we just saw several clusters from our customers where the NameNodes were stopped without checking/doing checkpoint. This lead to hours of downtime for loading large amounts of editlog (some clusters also hit the issue reported by HDFS-7609 which makes things worse). I had an offline discussion with [~cnauroth] and [~jnp] about this functionality. Here is the summary of the options we can come up with: # The solution developed in the current patch: the script sends saveNamespace request to the NameNode before stopping it, and the NameNode does an extra checkpoint if necessary based on the time of the latest checkpoint and the total number of transactions outside of the checkpoint. The drawback of the method is that if the checkpoint is necessary, the admin will see the stopping command blocked for 10min or more. And the admin can also get confused if the saveNamespace command fails. # Another way is that, instead of issuing the saveNamespace command directly, the script checks the time of the latest checkpoint and the total number of transactions first (maybe through the jmxget command). If it is necessary to do a checkpoint, the script will abort and print out some warning msg asking the admin to run dfsadmin -saveNamespace. This avoids the long time waiting from solution #1. Also if the jmxget command fails, the admin can use some command argument to force stopping the NameNode if he/she can confirm the checkpoint is not necessary. # The third option is to move the checkpoint logic into the shutdown hook of the NameNode. The biggest challenge here is the sync between the server and the script, i.e., to decide when and whether to kill the NN in the script. The script may have to polling the current state of the NameNode and guess whether the NameNode is still doing a checkpoint or it hangs somewhere else. Currently I do not see an easy way to achieve this. For now we think #2 may be the best solution. I will update the patch accordingly. [~aw], could you please also share your thoughts here? Thanks. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Labels: BB2015-05-TBR Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484681#comment-14484681 ] Hadoop QA commented on HDFS-7991: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12723793/HDFS-7991.004.patch against trunk revision 5b8a3ae. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.util.TestByteArrayManager Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10200//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10200//console This message is automatically generated. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388796#comment-14388796 ] Allen Wittenauer commented on HDFS-7991: bq. (since the stop command only waits 5s) This is easily fixed by just increasing the timeout or adding logic other logic such as asking if the NN is still alive, etc. But in any case, it occurred to me this morning that the current code just flat out won't work in practice. The problem is that HADOOP_OPTS has the NN's configuration inside it. So, for example, if a user sets the heap size to 64g, then dfsadmin is going to run with a 64g heap as well. Same thing with gc logs and any other custom JVM setting. The code absolutely must shell out another bin/hdfs process to get the proper HADOOP_OPTS setting. I suspect it will actually have to use a subshell plus captures parameters so that the environment is clean due to various {{export}} statements throughout the code and in a lot of user's *-env.sh files. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389198#comment-14389198 ] Allen Wittenauer commented on HDFS-7991: bq. But it's hard to know if the NN is still doing checkpoint or NN is stuck in somewhere else. Why can't we ask via REST? bq. can we just simply capture the value of HADOOP_OPTS before appending HADOOP_NAMENODE_OPTS to it, and use the captured value for this checkpoint? Possible? Maybe. Simply? no. It's going to get very messy because you need to juggle pretty much the entire shell state: HADOOP_CLIENT_OPTS, _finalize, logfile settings, etc, all need to get saved off and/or manipulated in order to provide the same/similar execution environment that dfsadmin uses... and that's before we even talk about what happens with custom shell profiles. bq. Looks like this way equals to using a dfsadmin command in the NN's machine. It might look that way at the Java level, but at the shell level it's going to be chaos. It will definitely cause all sorts of problems given how open the shell level has always been. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389178#comment-14389178 ] Jing Zhao commented on HDFS-7991: - bq. This is easily fixed by just increasing the timeout or adding logic other logic such as asking if the NN is still alive, etc. But it's hard to know if the NN is still doing checkpoint or NN is stuck in somewhere else. Also it is hard to get a deterministic bound for the timeout value. bq. The problem is that HADOOP_OPTS has the NN's configuration inside it. So, for example, if a user sets the heap size to 64g Good catch. I will try to fix this in a later patch. bq. The code absolutely must shell out another bin/hdfs process to get the proper HADOOP_OPTS setting. I suspect it will actually have to use a subshell plus parameter captures so that the environment is clean due to various export statements throughout the code and in a lot of user's *-env.sh files. One question here is: can we just simply capture the value of {{HADOOP_OPTS}} before appending {{HADOOP_NAMENODE_OPTS}} to it, and use the captured value for this checkpoint? Looks like this way equals to using a dfsadmin command in the NN's machine. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387475#comment-14387475 ] Jing Zhao commented on HDFS-7991: - I see the problem and the current solution in this way: # The issue this feature is targeting is the corruption while the standby/secondary NN doing checkpoint. This corruption is usually in an old checkpoint or in the editlog. If the NN is shutdown before solving the issue, the corruption may block NN from starting up normally again. # In practice we solve this usually by letting the current running NN do a checkpoint (through the -saveNamespace command). And it is very rare this checkpoint may fail since this is simply dumping the in-memory information into disk (i.e., the possible fsimage/editlog corruption is bypassed) # It is hard to let NN do this checkpoint verification itself before shutdown since the checkpoint may take minutes, and before finishing the checkpoint the NN may have already been killed by the shell script (since the stop command only waits 5s) # Based on the above #2 and #3, in most of the normal cases, using -saveNamespace command before shutdown can satisfy our requirement, i.e., checking if there is editlog corruption and saving the current in-memory namespace to bypass the corruption. # Even if the -saveNamespace fails (which is rare), the admin now has a chance to check the cause of the failure and he/she can take further steps to verify if there is corruption or the checkpoint can be skipped. I think this is better compared with the scenario that the NN is shutdown directly and admin has to manually fix the corruption. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387412#comment-14387412 ] Allen Wittenauer commented on HDFS-7991: looks like you posted a new version. shellcheck probably passes now. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387430#comment-14387430 ] Allen Wittenauer commented on HDFS-7991: Ok, the new one does do error checking. But I'm still soft of left with ... now what? What's the ops person supposed to do? Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387400#comment-14387400 ] Allen Wittenauer commented on HDFS-7991: -1 a) Take a look at http://wiki.apache.org/hadoop/UnixShellScriptProgrammingGuide. b) Why are we trying to fix this at the shell level instead of at the Java level? c) HDFS_CHECKPOINT_BEFORE_STOP_NAMENODE This should be HADOOP_HDFS_ blah, not HDFS_blah. d) There's no way this passes shellcheck. e) error messages are *way* too long for a single line. f) where is the hadoop-env.sh documentation to match this new env var? Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387422#comment-14387422 ] Allen Wittenauer commented on HDFS-7991: Frankly, I'm sort of leaning towards -1 on the feature itself. This is a very bad idea to this at the shell level where it has no way to know how or why things are broken. This really feels like a throw it over the fence and the let shell code sort it out exercise. I mean from HDFS-8003: bq. With new changes in HDFS-7991, if the feature is on, the shell code will exit if the checkpoint fails and the NN will not be stopped. You realize this isn't true, right? There is no error checking in this patch or the previous one that got committed that stops the shell code. It just continues plowing on through. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387419#comment-14387419 ] Jing Zhao commented on HDFS-7991: - bq. b) Why are we trying to fix this at the shell level instead of at the Java level? This has been answered in [here | https://issues.apache.org/jira/browse/HDFS-8003?focusedCommentId=14387349page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14387349] and [here | https://issues.apache.org/jira/browse/HDFS-8003?focusedCommentId=14384256page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14384256]. From your comment in HDFS-8003, I still did not see a valid point against doing this in the shell script. bq. d) There's no way this passes shellcheck. At least it passes the shellcheck in my local machine. Can you post the warning msg you see? bq. -1 Just to clarify, is this -1 on the feature itself or just mean you want me to address your comments? If it's the later I will try to address your comments in the next patch. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387716#comment-14387716 ] Hadoop QA commented on HDFS-7991: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708232/HDFS-7991.002.patch against trunk revision d9ac5ee. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestLeaseRecovery2 org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10114//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10114//console This message is automatically generated. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387772#comment-14387772 ] Hadoop QA commented on HDFS-7991: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708251/HDFS-7991.003.patch against trunk revision cc0a01c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestFileCreation Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10117//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10117//console This message is automatically generated. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382403#comment-14382403 ] Kihwal Lee commented on HDFS-7991: -- I am afraid the default behavior will break existing management scripts/infrastructure built on the hadoop commands. If we are to do this in the shell script, we could add a check for an additional shell variable. If this feature is to be on by default, people will be able to turn it off by setting this variable in hadoop-env.sh, which is normally a part of config. If this variable is not set AND -skipcheckpoint is not specified, saveNamespace will be attempted on shutdown. Regarding what should be the default, I prefer things to remain compatible, but others might think the benefit outweighs the inconvenience. I am fine with either way as long as there is a simple way to disable it and stay compatible. In the patch, did you intend to check the return code right after the first command? Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383061#comment-14383061 ] Hadoop QA commented on HDFS-7991: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12707599/HDFS-7991.001.patch against trunk revision 61df1b2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10080//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10080//console This message is automatically generated. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381099#comment-14381099 ] Jing Zhao commented on HDFS-7991: - [~kihwal], do you think this patch can address your comments? Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode
[ https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381301#comment-14381301 ] Hadoop QA commented on HDFS-7991: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12707373/HDFS-7991.000.patch against trunk revision 44809b8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.tracing.TestTracing Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10073//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10073//console This message is automatically generated. Allow users to skip checkpoint when stopping NameNode - Key: HDFS-7991 URL: https://issues.apache.org/jira/browse/HDFS-7991 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Jing Zhao Assignee: Jing Zhao Attachments: HDFS-7991.000.patch This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to check if saving namespace is necessary before stopping namenode. As [~kihwal] pointed out in this [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898], in a secured cluster this new functionality requires the user to be kinit'ed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)