subject:"\[jira\] \[Commented\] \(HDFS\-7991\) Allow users to skip checkpoint when stopping NameNode"

[
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556911#comment-14556911
]

Allen Wittenauer commented on HDFS-7991:

bq. The current mechanism can be removed when better working solution is
available.

Be aware that any solution (such as that in the current shell code) that calls
dfsadmin without doing the necessary work to authenticate is a backwards
incompatible change and breaks existing, secure deployments. (See [~kihwal]'s
comment above). That's before we even get to HADOOP_OPTS munging problems and
the issues that causes.

So removing the current mechanism is an improvement: from not working to
working namenode shutdown.

Allow users to skip checkpoint when stopping NameNode
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556434#comment-14556434
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. Any thoughts?

Just one, and this is the line that triggered it:

bq. Instead of doing everything at the end while stopping, why not implement a 
periodic check inside Active NameNode itself to check for the checkpoint.

I've been working under the assumption that the sites that are hitting this 
issue are running a secondary namenode.  Is that not true?  Doesn't the 2NN 
make this whole issue go away?  

* If the answer is The 2NN does make this issue go away then this is a won't 
fix and we should yank out the broken bash code that's presently in trunk and 
causes my stop's to actually *fail*.

* If the answer is No, the 2NN has nothing to do with this then [~vinayrpet] 
(either separate or combined with the 2NN) is a MUCH better answer than hacking 
the hell out of this stuff.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Suresh Srinivas (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556586#comment-14556586
]

Suresh Srinivas commented on HDFS-7991:
---

bq. Yup, and in those cases, that's what they pay vendors to fix. For those of
that don't, they roll back to the last good copy and move on.
The proposal here ensure that no vendor needs to be involved to remove faulty
editlog record (BTW I have not seen regex issues, only out of order editlog
entries that could not be applied or editlog records became too big (n^2
growth) and applying it became laboriously slow).

bq. All of the discussion up until recently has been about fixing the broken
bits in the shell code. If we want to switch the discussion to make the
namenode checkpoint optional when it's sent a kill, that's great. It means we
can clean out the shell code and make this entirely a Java-level fix, as it
should be.
We can fix issues in the code. Currently NN is sent kill -9 after a timeout.
That needs to be changed to work with NN shutdown hook. Also NN shutdown hook
and ensuring all the daemon services are done in the right order without
causing failures to namespace requires careful design. It also requires putting
namenode into safemode. I think doing it outside, as done in the current
approach, using save namespace, is much simpler and cleaner. But if you want to
do it as part of shutdown you are welcome to do make that change. If that
change takes some time, I prefer the current mechanism until it gets ready. The
current mechanism can be removed when *better* working solution is available.

Allow users to skip checkpoint when stopping NameNode
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

[
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556496#comment-14556496
]

Allen Wittenauer commented on HDFS-7991:

bq. Ideally when 2NN or standby is working. But we have had many issues where
checkpointing is not done by SNN or standby, for the following reasons:

OK, so these are not new issues at all and have been around for literally years
(decade now?). We had it happen at Y! back in 2007 and it's a story I often
tell during talks.

bq. We need a way to be able to save namespace.

Then fix the NN-2NN relationship to provide better alerting when stuff goes
wrong. Hacking the shell code (and, yes, the code in branch-2 and in trunk are
clearly hacks. Heck, the branch-2 doesn't even trigger if you are running NN
in non-daemon mode...) is completely the wrong thing to do.

.. and has been pointed out, this hack does NOTHING to help in the case of
hardware failure, when you want it most.

bq. Today operators who understand this situation do save namespace manually
before stopping the namenode.

I don't think I can put enough lol's in here to express how many laughs this
statement got from around the office. No, operators who understand this issue
monitor the size of the edits file and the 2NN and then act appropriately. We
don't do safemode-checkpoint-shutdown on every NN bring down.

Allow users to skip checkpoint when stopping NameNode
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556546#comment-14556546
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. Just in your previous comment it seemed to me you did not even understand 
the issue . 

...

bq. 1. editlog had an issue and could not be consumed by 2NN or standby
bq. 2. checkpointing is lagging behind (see HDFS-7609)
bq. 3. There could many others bugs and issues (standby down etc) that could 
result in delayed checkpoint

I've seen every single one of these in production, either at Y! or at LI during 
my time with Hadoop.   My favorite is the time the regex was bugged so bad it 
caused the 2NN to crash during the log parsing because someone wrote a weird 
file name.  

So yeah, I'm pretty sure I do have a good grasp of exactly the issues you are 
talking about, having been on the receiving end of corrupted image files in the 
past and having to walk down to developer row to get them fixed.

bq.  No one is proposing that operators need do 
[safemode-checkpoint-shutdown] on every NN bring down.

Oh?  You mean like this completely broken code that is already sitting in trunk 
during the first attempt  (HDFS-6353 ) to fix this issue?

{code}
  if [[ ${COMMAND} == namenode ]] 
 [[ ${HADOOP_DAEMON_MODE} == stop ]]; then
hadoop_debug Do checkpoint if necessary before stopping NameNode
export CLASSPATH
${JAVA} -Dproc_dfsadmin ${HADOOP_OPTS} 
org.apache.hadoop.hdfs.tools.DFSAdmin -safemode enter
${JAVA} -Dproc_dfsadmin ${HADOOP_OPTS} 
org.apache.hadoop.hdfs.tools.DFSAdmin -saveNamespace -beforeShutdown
${JAVA} -Dproc_dfsadmin ${HADOOP_OPTS} 
org.apache.hadoop.hdfs.tools.DFSAdmin -safemode leave
  fi
{code}

I'm glad that we agree that this code should get removed since it's causing so 
many problems.

bq. In some cases checkpoint could not even be done because editlog was corrupt 
and could not be consumed by 2NN or standby (sorry, repeating myself).

Yup, and in those cases, that's what they pay vendors to fix.  For those of 
that don't, they roll back to the last good copy and move on.  

bq. This jira proposes to save namespace when checkpointing has not happened 
for a long time.

All of the discussion up until recently has been about fixing the broken bits 
in the shell code.  If we want to switch the discussion to make the namenode 
checkpoint optional when it's sent a kill, that's great. It means we can clean 
out the shell code and make this entirely a Java-level fix, as it should be.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Vinayakumar B (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556684#comment-14556684
 ] 

Vinayakumar B commented on HDFS-7991:
-

bq. If doing checkpointing in the active namenode was possible without pausing 
the ongoing requests, we would not have moved checkpointing to either secondary 
or standby
Yes agree that we cant pause ongoing requests for long time.  I actually meant 
for these critical situations, not always, saving namespace directly looked 
better compare to restart of NN, which also requires someone to monitor the 
size of edits and trigger saveNamespace/stop. But in Normal conditions 
occurance of this would be very rare. May be If user apps needs to be informed 
about the situation, then active NN itself can put itself to safemode before 
saving namespace, as done on admin request.
Anyway I am not very strong about safemode or not, that was just a thought as 
practically saving just fsImage to disk will take less time, of-course it again 
depends on size.
But IMHO, to handle such abnormal cases, NN itself should be able to take 
steps, instead of some admin finding out and taking steps.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556511#comment-14556511
 ] 

Suresh Srinivas commented on HDFS-7991:
---

bq. I don't think I can put enough lol's in here to express how many laughs 
this statement got from around the office.
[~aw], I am glad it was amusing. I have a lot of respect for the operations 
background you bring. But that does not mean that others are clueless. Such an 
attitude is disrespectful and counter productive. So please tone it down. 

There are many others who understand operational aspects of the issue we are 
discussing in this jira and have seen many issues where users have gotten 
burnt. 

bq. No, operators who understand this issue monitor the size of the edits file 
and the 2NN and then act appropriately.
Just in your previous comment it seemed to me you did not even understand the 
issue :). What do you mean by act appropriately?

bq. We don't do safemode-checkpoint-shutdown on every NN bring down.
Relax. No one is proposing that operators need do that on every NN bring down. 
Not even the solution in this jira is proposing that, if you read it carefully. 
When checkpoint has not happened for a long time, NN startup could take a very 
long time (I have seen half a dozen cases where it took 3-5 days!). In some 
cases checkpoint *could not even be done* because editlog was corrupt and could 
not be consumed by 2NN or standby (sorry, repeating myself). Some operators 
understand the issue that checkpoint has not happened for a long time and do 
save namespace to avoid issues. Some don't. This jira proposes to save 
namespace when checkpointing has not happened for a long time.

What I see in this jira is we have gone in circles and I am not even sure 
issues are understood well.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556470#comment-14556470
 ] 

Suresh Srinivas commented on HDFS-7991:
---

bq. I've been working under the assumption that the sites that are hitting this 
issue are running a secondary namenode. Is that not true? Doesn't the 2NN make 
this whole issue go away?

Ideally when 2NN or standby is working. But we have had many issues where 
checkpointing is not done by SNN or standby, for the following reasons:
1. editlog had an issue and could not be consumed by 2NN or standby
2. checkpointing is lagging behind (see HDFS-7609)
3. There could many others bugs and issues (standby down etc) that could result 
in delayed checkpoint

Repeating myself, this is a very important functionality to avoid data loss and 
service unavailability. But we need a way to be able to save namespace. Today 
operators who understand this situation do save namespace manually before 
stopping the namenode. People who miss doing that run into production issues. 
This jira proposes automatically saving namespace to avoid issues. I don't 
understand why it hacking the hell out of stuff.

[~vinayrpet], some comments:
bq. What if machine itself goes down suddenly after running for months/years, 
having tons of millions of edits without checkpoint ?
Yes there are times when saving namespace may not be possible. But in large 
majority of case, when HDFS issues are seen, inexperienced administrators just 
restart the cluster and run into this issue. 

bq. Anyway doing checkpoint in Active NameNode is not a big deal
If doing checkpointing in the active namenode was possible without pausing the 
ongoing requests, we would not have moved to checkpointing to either secondary 
or standby. That is also the reason why the namenode is first put into 
safemode, the write request are quiesced, and then save namespace is called.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-21 Thread Jing Zhao (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554800#comment-14554800
]

Jing Zhao commented on HDFS-7991:
-

Thanks Allen. Yes, I also just realized that jmx may not be a good solution
here.

bq. to do a REST or RPC call to ask the NN what it's doing
The same question here is what if this RPC/REST call fails (or timeout)? Should
we retry and how? Or should we kill the NameNode? To me this is not
fundamentally different from the saveNamespace solution:
# We're using kill to trigger the shutdown hook which does the checkpoint. This
can be mapped to the step sending out a saveNamespace command to NN.
# We then keep polling the state of the NameNode using a REST/RPC call, just
like waiting for the response from the saveNamespace RPC.
# Both solutions finally need to answer the same question: what if the REST/RPC
call fails?

bq. This will almost certainly break init.d/rc.d/service/launchd/whatever
scripts.
Yes, but I think if the checkpoint is necessary at this time, breaking these
scripts may not be that bad compared with killing the namenode then waiting
hours for the namenode to load edits or even fixing corrupted edits.

bq. currently does not require a Kerberos credential
Regarding to the auth part, how about directly parsing the hdfs-site.xml and
getting the namenode fsimage/edits directory location? Then we can directly
check if the checkpoint is necessary by going through the fsimage/edits file
names.

Allow users to skip checkpoint when stopping NameNode
-

Key: HDFS-7991
URL: https://issues.apache.org/jira/browse/HDFS-7991
Project: Hadoop HDFS
Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch,
HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554842#comment-14554842
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. The same question here is what if this RPC/REST call fails (or timeout)? 
Should we retry and how? Or should we kill the NameNode? To me this is not 
fundamentally different from the saveNamespace solution

If the REST/RPC call fails or shows no progress over X timeout value (e.g., 
reset the timer every time we show progress), then the NN should be considered 
hung and it should get killed with prejudice. There's no reason why the 
REST/RPC port has to be shutdown just because we are saving state.  If that's 
happening now, that's a terrible design decision.

This should be pretty trivial to do: 

1. send the kill to the daemon to shutdown
2. see that we have a bash hook to call our special timeout function for this 
daemon instead of sleeping
3. timeout function calls a separate java program that queries the daemon. 
Decision point: a) shutdown success, it exists. b) if NN shutdown times out due 
to no progress, exit with failure
4. bash code sees exit with failure and sends kill -9.

If you want, I can write up the shell patch to do this after lunch.  The shell 
part to enable this is tiny.

bq. Yes, but I think if the checkpoint is necessary at this time, breaking 
these scripts may not be that bad compared with killing the namenode then 
waiting hours for the namenode to load edits or even fixing corrupted edits.

You have a choice between a breaking change and a non-breaking change.  This 
effectively shifts the burden from one dev writing code to hundreds/thousands.  
Hint: not all of those hundreds/thousands are nearly as nice as me. ;)

bq. how about directly parsing the hdfs-site.xml 

Someone doesn't know about {{hdfs getconf}} ... ;)

bq. Then we can directly check if the checkpoint is necessary by going through 
the fsimage/edits file names.

So this fix isn't needed for the HA case?

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-21 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555112#comment-14555112
 ] 

Suresh Srinivas commented on HDFS-7991:
---

bq. bash code sees exit with failure and sends kill -9.
I think the goal of this jira should be to ensure save namespace is done when 
editlog size is huge. I have seen many cases where people either had to suffer 
loss of data or wait for more than 3 days for namenode to startup consuming all 
the pending editlogs. 

Blindly sending kill -9 is not an option in my opinion. Instead of emphasizing 
namenode stop functionality works, I would rather see save namespace work. 
Isn't there an environment variable that enables this functionality? For folks 
who want stop to no save namespace or a different behavior, it can be be used 
to go back to the previous behavior, right?

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-21 Thread Jing Zhao (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555038#comment-14555038
]

Jing Zhao commented on HDFS-7991:
-

Thanks for the further explanation, Allen. Now I get your point: the client
side will still use a separate java program to query the daemon. Then if we
also let this java program send out the checkpoint check command, and
considering our current RPC already has the capability to handle timeout and
retry, I guess we can directly utilize the current saveNamespace RPC? Then the
only difference from your proposal is to move your step 1 after step 3.

bq. If you want, I can write up the shell patch to do this after lunch. The
shell part to enable this is tiny.
Thanks, Allen. That will be helpful.

bq. So this fix isn't needed for the HA case?
For HA, since we're only stopping the local NameNode, the checkpoint can be
independent. But one thing I still need to confirm is if we can get enough
information about the number of transactions out the fsimage from the local NN
directory, if no local edits is stored (i.e., journals are only in JNs). I will
explore further on this.

bq. You have a choice between a breaking change and a non-breaking change. This
effectively shifts the burden from one dev writing code to hundreds/thousands.
Looks like this is the main and maybe only place we have different opinion. In
your proposal if the java program or the checkpoint process timeout we should
send out kill -9. My thoughts:
# If the NameNode is healthy, the java program or the checkpoint checking
should go through smoothly. This should be the normal case.
# The timeout should be rare. But if it happens, NameNode may have some issue
or a checkpoint is necessary. Then I think it's worthy to do extra check for
the NameNode since killing the NN now can lead to hours of downtime which may
really kill the admins.

Allow users to skip checkpoint when stopping NameNode
-

Key: HDFS-7991
URL: https://issues.apache.org/jira/browse/HDFS-7991
Project: Hadoop HDFS
Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch,
HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555300#comment-14555300
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. Then if we also let this java program send out the checkpoint check 
command, and considering our current RPC already has the capability to handle 
timeout and retry, I guess we can directly utilize the current saveNamespace 
RPC?

I would keep it simple:  shutdown also triggers the logic for if checkpoint is 
necessary.  There's zero value in waiting for the helper app to trigger it. 
This also means the helper app is extremely simple:  an unauthenticated call 
that does is checkpoint still happening? Is checkpoint still happening? What 
about now? Are we down yet Papa Smurf?  This way we also fix [~sureshms] issue:

bq. Blindly sending kill -9 is not an option in my opinion. 

That's why it's not blind.  The helper app's *sole* purpose should be to 
provide the hint to the shell code if things are so screwed up that kill -9 is 
the only way out.  This way all of the key, important logic is in Java code and 
the one thing the Java code probably shouldn't do (kill) is left to the shell 
code.

bq. Instead of emphasizing namenode stop functionality works, I would rather 
see save namespace work.

To the person who isn't looking at the code, these are effectively one and the 
same. If I'm stopping the namenode, I expect it to do what is necessary to come 
back up in a sane state.  Why should an admin have to make the decision here 
when the NN itself knows the state best?  Telling me to run save namespace is 
dumb:  Why didn't you just do it yourself, you stupid program? :D

bq.  Isn't there an environment variable that enables this functionality? For 
folks who want stop to not save namespace or a different behavior, it can be be 
used to go back to the previous behavior, right?

The # of times this is going to be needed should approach zero... and in those 
cases, a Java property (or properties!) is *way* better.  Some clueless person 
is going to tell others Hey, set this to make your system shut down faster.  
The Java apps can read the properties do whatever it needed/desired.  This also 
means they can prompt to say are you sure? because this is the type of 
operation (shutdown w/out checkpoint) that sounds like should never happen in 
an automated way.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

[
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554593#comment-14554593
]

Allen Wittenauer commented on HDFS-7991:

bq. Another way is that, instead of issuing the saveNamespace command directly,
the script checks the time of the latest checkpoint and the total number of
transactions first (maybe through the jmxget command).

jmxget is rarely installed. Some other way to get the data will need to be
supplied, almost certainly stuck away in dfsadmin or something. There's also
the problem of JMX not being turned on by default (hint: we can't.) But the
other part:

bq. If it is necessary to do a checkpoint, the script will abort and print out
some warning msg asking the admin to run dfsadmin -saveNamespace.

No can do. This will almost certainly break
init.d/rc.d/service/launchd/whatever scripts.

bq. The third option is to move the checkpoint logic into the shutdown hook of
the NameNode. The biggest challenge here is the sync between the server and the
script, i.e., to decide when and whether to kill the NN in the script. The
script may have to polling the current state of the NameNode and guess whether
the NameNode is still doing a checkpoint or it hangs somewhere else. Currently
I do not see an easy way to achieve this.

IMO, this is still the best answer. With a SMOP of the code (at least in trunk.
dunno don't care about the disaster zone known as branch-2), it should be
relatively trivial to write a hook that uses the almost ubiquitous wget, curl,
or something stuck away in hadoop-common to do a REST or RPC call to ask the NN
what it's doing. (and, of course, that call would be in a function that could
be replaced if the user needed to use something else. best bet: shove it in the
hdfs shellprofile).

The ONLY big deal is going to be that {{hdfs --daemon stop namenode}} currently
does not require a Kerberos credential. Of course that has large implications
for boot scripts needing to kinit. Unless we make sure this REST or PC call
doesn't require auth, that will change that requirement

Allow users to skip checkpoint when stopping NameNode
-

Key: HDFS-7991
URL: https://issues.apache.org/jira/browse/HDFS-7991
Project: Hadoop HDFS
Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Labels: BB2015-05-TBR
Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch,
HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554598#comment-14554598
 ] 

Allen Wittenauer commented on HDFS-7991:


(It just occurred to me that auth is a big problem with the current patch 
too...)

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
  Labels: BB2015-05-TBR
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-19 Thread Jing Zhao (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551234#comment-14551234
]

Jing Zhao commented on HDFS-7991:
-

Recently we just saw several clusters from our customers where the NameNodes
were stopped without checking/doing checkpoint. This lead to hours of downtime
for loading large amounts of editlog (some clusters also hit the issue reported
by HDFS-7609 which makes things worse).

I had an offline discussion with [~cnauroth] and [~jnp] about this
functionality. Here is the summary of the options we can come up with:
# The solution developed in the current patch: the script sends saveNamespace
request to the NameNode before stopping it, and the NameNode does an extra
checkpoint if necessary based on the time of the latest checkpoint and the
total number of transactions outside of the checkpoint. The drawback of the
method is that if the checkpoint is necessary, the admin will see the stopping
command blocked for 10min or more. And the admin can also get confused if the
saveNamespace command fails.
# Another way is that, instead of issuing the saveNamespace command directly,
the script checks the time of the latest checkpoint and the total number of
transactions first (maybe through the jmxget command). If it is necessary to do
a checkpoint, the script will abort and print out some warning msg asking the
admin to run dfsadmin -saveNamespace. This avoids the long time waiting from
solution #1. Also if the jmxget command fails, the admin can use some command
argument to force stopping the NameNode if he/she can confirm the checkpoint is
not necessary.
# The third option is to move the checkpoint logic into the shutdown hook of
the NameNode. The biggest challenge here is the sync between the server and the
script, i.e., to decide when and whether to kill the NN in the script. The
script may have to polling the current state of the NameNode and guess whether
the NameNode is still doing a checkpoint or it hangs somewhere else. Currently
I do not see an easy way to achieve this.

For now we think #2 may be the best solution. I will update the patch
accordingly. [~aw], could you please also share your thoughts here? Thanks.

Allow users to skip checkpoint when stopping NameNode
-

Key: HDFS-7991
URL: https://issues.apache.org/jira/browse/HDFS-7991
Project: Hadoop HDFS
Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Labels: BB2015-05-TBR
Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch,
HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-04-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484681#comment-14484681
 ] 

Hadoop QA commented on HDFS-7991:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12723793/HDFS-7991.004.patch
  against trunk revision 5b8a3ae.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.util.TestByteArrayManager

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10200//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10200//console

This message is automatically generated.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-31 Thread Allen Wittenauer (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388796#comment-14388796
]

Allen Wittenauer commented on HDFS-7991:

bq. (since the stop command only waits 5s)

This is easily fixed by just increasing the timeout or adding logic other logic
such as asking if the NN is still alive, etc.

But in any case, it occurred to me this morning that the current code just flat
out won't work in practice. The problem is that HADOOP_OPTS has the NN's
configuration inside it. So, for example, if a user sets the heap size to 64g,
then dfsadmin is going to run with a 64g heap as well. Same thing with gc logs
and any other custom JVM setting.

The code absolutely must shell out another bin/hdfs process to get the proper
HADOOP_OPTS setting. I suspect it will actually have to use a subshell plus
captures parameters so that the environment is clean due to various {{export}}
statements throughout the code and in a lot of user's *-env.sh files.

Allow users to skip checkpoint when stopping NameNode
-

Key: HDFS-7991
URL: https://issues.apache.org/jira/browse/HDFS-7991
Project: Hadoop HDFS
Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch,
HDFS-7991.002.patch, HDFS-7991.003.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-31 Thread Allen Wittenauer (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389198#comment-14389198
]

Allen Wittenauer commented on HDFS-7991:

bq. But it's hard to know if the NN is still doing checkpoint or NN is stuck in
somewhere else.

Why can't we ask via REST?

bq. can we just simply capture the value of HADOOP_OPTS before appending
HADOOP_NAMENODE_OPTS to it, and use the captured value for this checkpoint?

Possible? Maybe. Simply? no. It's going to get very messy because you need to
juggle pretty much the entire shell state: HADOOP_CLIENT_OPTS, _finalize,
logfile settings, etc, all need to get saved off and/or manipulated in order to
provide the same/similar execution environment that dfsadmin uses... and that's
before we even talk about what happens with custom shell profiles.

bq. Looks like this way equals to using a dfsadmin command in the NN's machine.

It might look that way at the Java level, but at the shell level it's going to
be chaos. It will definitely cause all sorts of problems given how open the
shell level has always been.

Allow users to skip checkpoint when stopping NameNode
-

Key: HDFS-7991
URL: https://issues.apache.org/jira/browse/HDFS-7991
Project: Hadoop HDFS
Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch,
HDFS-7991.002.patch, HDFS-7991.003.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-31 Thread Jing Zhao (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389178#comment-14389178
]

Jing Zhao commented on HDFS-7991:
-

bq. This is easily fixed by just increasing the timeout or adding logic other
logic such as asking if the NN is still alive, etc.

But it's hard to know if the NN is still doing checkpoint or NN is stuck in
somewhere else. Also it is hard to get a deterministic bound for the timeout
value.

bq. The problem is that HADOOP_OPTS has the NN's configuration inside it. So,
for example, if a user sets the heap size to 64g

Good catch. I will try to fix this in a later patch.

bq. The code absolutely must shell out another bin/hdfs process to get the
proper HADOOP_OPTS setting. I suspect it will actually have to use a subshell
plus parameter captures so that the environment is clean due to various export
statements throughout the code and in a lot of user's *-env.sh files.

One question here is: can we just simply capture the value of {{HADOOP_OPTS}}
before appending {{HADOOP_NAMENODE_OPTS}} to it, and use the captured value for
this checkpoint? Looks like this way equals to using a dfsadmin command in the
NN's machine.

Allow users to skip checkpoint when stopping NameNode
-

Key: HDFS-7991
URL: https://issues.apache.org/jira/browse/HDFS-7991
Project: Hadoop HDFS
Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch,
HDFS-7991.002.patch, HDFS-7991.003.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-30 Thread Jing Zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387475#comment-14387475
 ] 

Jing Zhao commented on HDFS-7991:
-

I see the problem and the current solution in this way:
# The issue this feature is targeting is the corruption while the 
standby/secondary NN doing checkpoint. This corruption is usually in an old 
checkpoint or in the editlog. If the NN is shutdown before solving the issue, 
the corruption may block NN from starting up normally again.
# In practice we solve this usually by letting the current running NN do a 
checkpoint (through the -saveNamespace command). And it is very rare this 
checkpoint may fail since this is simply dumping the in-memory information into 
disk (i.e., the possible fsimage/editlog corruption is bypassed)
# It is hard to let NN do this checkpoint verification itself before shutdown 
since the checkpoint may take minutes, and before finishing the checkpoint the 
NN may have already been killed by the shell script (since the stop command 
only waits 5s)
# Based on the above #2 and #3, in most of the normal cases, using 
-saveNamespace command before shutdown can satisfy our requirement, i.e., 
checking if there is editlog corruption and saving the current in-memory 
namespace to bypass the corruption.
# Even if the -saveNamespace fails (which is rare), the admin now has a 
chance to check the cause of the failure and he/she can take further steps to 
verify if there is corruption or the checkpoint can be skipped. I think this is 
better compared with the scenario that the NN is shutdown directly and admin 
has to manually fix the corruption.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387412#comment-14387412
 ] 

Allen Wittenauer commented on HDFS-7991:


looks like you posted a new version.  shellcheck probably passes now.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387430#comment-14387430
 ] 

Allen Wittenauer commented on HDFS-7991:


Ok, the new one does do error checking.  But I'm still soft of left with ... 
now what?  What's the ops person supposed to do?

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode


[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387400#comment-14387400
 ] 

Allen Wittenauer commented on HDFS-7991:


-1

a) Take a look at  
http://wiki.apache.org/hadoop/UnixShellScriptProgrammingGuide.

b) Why are we trying to fix this at the shell level instead of at the Java 
level? 

c) HDFS_CHECKPOINT_BEFORE_STOP_NAMENODE

This should be HADOOP_HDFS_ blah, not HDFS_blah.

d) There's no way this passes shellcheck.

e) error messages are *way* too long for a single line.

f) where is the hadoop-env.sh documentation to match this new env var?



 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode