[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-26 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559813#comment-14559813
 ] 

Jing Zhao commented on HDFS-7991:
-

We can use this jira just to remove the original dfsadmin scripts and add a 
script hook as Allen did in his patch.

Allen, for your script patch, besides the secretshutdownhook is just a 
placeholder, looks like you have not handled the HADOOP_OPTS issue right?

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555856#comment-14555856
 ] 

Vinayakumar B commented on HDFS-7991:
-

You never know whether all the time machine will be up for admin to execute 
stop command to have the checkpoint. And also AFAIK in some real and big 
clusters executing stop command itself is very very rare, especially in these 
cases where standby not available.

What if machine itself goes down suddenly after running for months/years, 
having tons of millions of edits without checkpoint ? I have also seen 
sometimes, due to some overusage of openfiles/connections, I was not able to 
open SSH terminal itself to execute command.
Still in this case restart of NN going to take hours/days based on load. Then 
All the effort spent on discussion in this Jira would go waste.

Instead of doing everything at the end while stopping, why not implement a 
periodic check inside Active NameNode itself to check for the checkpoint.
 Similar to {{FSNameSystem#NameNodeEditLogRoller}} added to roll edits after 
reaching threshold to avoid bigger edit logs. Infact we can re-use this thread 
itself to check for checkpoint also with different interval. Interval may be 
multiple of checkpoint interval configured.

Anyway doing *checkpoint* in Active NameNode is not a big deal. Its just saving 
FsImage to all available disks. No big process of loading edits involved as its 
already uptodate. So even NN can do this with just acquiring {{writeLock()}} 
instead of entering safemode and coming out. Still {{saveNamespace()}} external 
RPC can retain current behaviour. 

Since this problem can happen only if Standby/Secondary NameNode not available 
for long time, I feel its Okay for client's operation to wait for 
saveNamespace() to be over.

Any thoughts?

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555786#comment-14555786
 ] 

Hadoop QA commented on HDFS-7991:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 36s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 29s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 37s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | shellcheck |   0m  5s | The applied patch generated  1 
new shellcheck (v0.3.3) issues (total was 25, now 23). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | common tests |  22m 48s | Tests passed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 161m 46s | Tests failed in hadoop-hdfs. |
| | | 218m 51s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestAppendSnapshotTruncate |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12734733/HDFS-7991-shellpart.patch
 |
| Optional Tests | shellcheck javadoc javac unit |
| git revision | trunk / cf2b569 |
| shellcheck | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11096/artifact/patchprocess/diffpatchshellcheck.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11096/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11096/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11096/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11096/console |


This message was automatically generated.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556911#comment-14556911
 ] 

Allen Wittenauer commented on HDFS-7991:


bq.  The current mechanism can be removed when better working solution is 
available.

Be aware that any solution (such as that in the current shell code) that calls 
dfsadmin without doing the necessary work to authenticate is a backwards 
incompatible change and breaks existing, secure deployments. (See [~kihwal]'s 
comment above). That's before we even get to HADOOP_OPTS munging problems and 
the issues that causes.  

So removing the current mechanism is an improvement:  from not working to 
working namenode shutdown.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556434#comment-14556434
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. Any thoughts?

Just one, and this is the line that triggered it:

bq. Instead of doing everything at the end while stopping, why not implement a 
periodic check inside Active NameNode itself to check for the checkpoint.

I've been working under the assumption that the sites that are hitting this 
issue are running a secondary namenode.  Is that not true?  Doesn't the 2NN 
make this whole issue go away?  

* If the answer is The 2NN does make this issue go away then this is a won't 
fix and we should yank out the broken bash code that's presently in trunk and 
causes my stop's to actually *fail*.

* If the answer is No, the 2NN has nothing to do with this then [~vinayrpet] 
(either separate or combined with the 2NN) is a MUCH better answer than hacking 
the hell out of this stuff.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556586#comment-14556586
 ] 

Suresh Srinivas commented on HDFS-7991:
---

bq. Yup, and in those cases, that's what they pay vendors to fix. For those of 
that don't, they roll back to the last good copy and move on. 
The proposal here ensure that no vendor needs to be involved to remove faulty 
editlog record (BTW I have not seen regex issues, only out of order editlog 
entries that could not be applied or editlog records became too big (n^2 
growth) and applying it became laboriously slow).

bq. All of the discussion up until recently has been about fixing the broken 
bits in the shell code. If we want to switch the discussion to make the 
namenode checkpoint optional when it's sent a kill, that's great. It means we 
can clean out the shell code and make this entirely a Java-level fix, as it 
should be.
We can fix issues in the code. Currently NN is sent kill -9 after a timeout. 
That needs to be changed to work with NN shutdown hook. Also NN shutdown hook 
and ensuring all the daemon services are done in the right order without 
causing failures to namespace requires careful design. It also requires putting 
namenode into safemode. I think doing it outside, as done in the current 
approach, using save namespace, is much simpler and cleaner. But if you want to 
do it as part of shutdown you are welcome to do make that change. If that 
change takes some time, I prefer the current mechanism until it gets ready. The 
current mechanism can be removed when *better* working solution is available.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556496#comment-14556496
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. Ideally when 2NN or standby is working. But we have had many issues where 
checkpointing is not done by SNN or standby, for the following reasons:

OK, so these are not new issues at all and have been around for literally years 
(decade now?). We had it happen at Y! back in 2007 and it's a story I often 
tell during talks. 

bq. We need a way to be able to save namespace. 

Then fix the NN-2NN relationship to provide better alerting when stuff goes 
wrong.  Hacking the shell code (and, yes, the code in branch-2 and in trunk are 
clearly hacks.  Heck, the branch-2 doesn't even trigger if you are running NN 
in non-daemon mode...) is completely the wrong thing to do.

.. and has been pointed out, this hack does NOTHING to help in the case of 
hardware failure, when you want it most.

bq. Today operators who understand this situation do save namespace manually 
before stopping the namenode.

I don't think I can put enough lol's in here to express how many laughs this 
statement got from around the office. No, operators who understand this issue 
monitor the size of the edits file and the 2NN and then act appropriately.  We 
don't do safemode-checkpoint-shutdown on every NN bring down.


 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556546#comment-14556546
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. Just in your previous comment it seemed to me you did not even understand 
the issue . 

...

bq. 1. editlog had an issue and could not be consumed by 2NN or standby
bq. 2. checkpointing is lagging behind (see HDFS-7609)
bq. 3. There could many others bugs and issues (standby down etc) that could 
result in delayed checkpoint

I've seen every single one of these in production, either at Y! or at LI during 
my time with Hadoop.   My favorite is the time the regex was bugged so bad it 
caused the 2NN to crash during the log parsing because someone wrote a weird 
file name.  

So yeah, I'm pretty sure I do have a good grasp of exactly the issues you are 
talking about, having been on the receiving end of corrupted image files in the 
past and having to walk down to developer row to get them fixed.

bq.  No one is proposing that operators need do 
[safemode-checkpoint-shutdown] on every NN bring down.

Oh?  You mean like this completely broken code that is already sitting in trunk 
during the first attempt  (HDFS-6353 ) to fix this issue?

{code}
  if [[ ${COMMAND} == namenode ]] 
 [[ ${HADOOP_DAEMON_MODE} == stop ]]; then
hadoop_debug Do checkpoint if necessary before stopping NameNode
export CLASSPATH
${JAVA} -Dproc_dfsadmin ${HADOOP_OPTS} 
org.apache.hadoop.hdfs.tools.DFSAdmin -safemode enter
${JAVA} -Dproc_dfsadmin ${HADOOP_OPTS} 
org.apache.hadoop.hdfs.tools.DFSAdmin -saveNamespace -beforeShutdown
${JAVA} -Dproc_dfsadmin ${HADOOP_OPTS} 
org.apache.hadoop.hdfs.tools.DFSAdmin -safemode leave
  fi
{code}

I'm glad that we agree that this code should get removed since it's causing so 
many problems.

bq. In some cases checkpoint could not even be done because editlog was corrupt 
and could not be consumed by 2NN or standby (sorry, repeating myself).

Yup, and in those cases, that's what they pay vendors to fix.  For those of 
that don't, they roll back to the last good copy and move on.  

bq. This jira proposes to save namespace when checkpointing has not happened 
for a long time.

All of the discussion up until recently has been about fixing the broken bits 
in the shell code.  If we want to switch the discussion to make the namenode 
checkpoint optional when it's sent a kill, that's great. It means we can clean 
out the shell code and make this entirely a Java-level fix, as it should be.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556684#comment-14556684
 ] 

Vinayakumar B commented on HDFS-7991:
-

bq. If doing checkpointing in the active namenode was possible without pausing 
the ongoing requests, we would not have moved checkpointing to either secondary 
or standby
Yes agree that we cant pause ongoing requests for long time.  I actually meant 
for these critical situations, not always, saving namespace directly looked 
better compare to restart of NN, which also requires someone to monitor the 
size of edits and trigger saveNamespace/stop. But in Normal conditions 
occurance of this would be very rare. May be If user apps needs to be informed 
about the situation, then active NN itself can put itself to safemode before 
saving namespace, as done on admin request.
Anyway I am not very strong about safemode or not, that was just a thought as 
practically saving just fsImage to disk will take less time, of-course it again 
depends on size.
But IMHO, to handle such abnormal cases, NN itself should be able to take 
steps, instead of some admin finding out and taking steps.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556511#comment-14556511
 ] 

Suresh Srinivas commented on HDFS-7991:
---

bq. I don't think I can put enough lol's in here to express how many laughs 
this statement got from around the office.
[~aw], I am glad it was amusing. I have a lot of respect for the operations 
background you bring. But that does not mean that others are clueless. Such an 
attitude is disrespectful and counter productive. So please tone it down. 

There are many others who understand operational aspects of the issue we are 
discussing in this jira and have seen many issues where users have gotten 
burnt. 

bq. No, operators who understand this issue monitor the size of the edits file 
and the 2NN and then act appropriately.
Just in your previous comment it seemed to me you did not even understand the 
issue :). What do you mean by act appropriately?

bq. We don't do safemode-checkpoint-shutdown on every NN bring down.
Relax. No one is proposing that operators need do that on every NN bring down. 
Not even the solution in this jira is proposing that, if you read it carefully. 
When checkpoint has not happened for a long time, NN startup could take a very 
long time (I have seen half a dozen cases where it took 3-5 days!). In some 
cases checkpoint *could not even be done* because editlog was corrupt and could 
not be consumed by 2NN or standby (sorry, repeating myself). Some operators 
understand the issue that checkpoint has not happened for a long time and do 
save namespace to avoid issues. Some don't. This jira proposes to save 
namespace when checkpointing has not happened for a long time.

What I see in this jira is we have gone in circles and I am not even sure 
issues are understood well.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-22 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556470#comment-14556470
 ] 

Suresh Srinivas commented on HDFS-7991:
---

bq. I've been working under the assumption that the sites that are hitting this 
issue are running a secondary namenode. Is that not true? Doesn't the 2NN make 
this whole issue go away?

Ideally when 2NN or standby is working. But we have had many issues where 
checkpointing is not done by SNN or standby, for the following reasons:
1. editlog had an issue and could not be consumed by 2NN or standby
2. checkpointing is lagging behind (see HDFS-7609)
3. There could many others bugs and issues (standby down etc) that could result 
in delayed checkpoint

Repeating myself, this is a very important functionality to avoid data loss and 
service unavailability. But we need a way to be able to save namespace. Today 
operators who understand this situation do save namespace manually before 
stopping the namenode. People who miss doing that run into production issues. 
This jira proposes automatically saving namespace to avoid issues. I don't 
understand why it hacking the hell out of stuff.

[~vinayrpet], some comments:
bq. What if machine itself goes down suddenly after running for months/years, 
having tons of millions of edits without checkpoint ?
Yes there are times when saving namespace may not be possible. But in large 
majority of case, when HDFS issues are seen, inexperienced administrators just 
restart the cluster and run into this issue. 

bq. Anyway doing checkpoint in Active NameNode is not a big deal
If doing checkpointing in the active namenode was possible without pausing the 
ongoing requests, we would not have moved to checkpointing to either secondary 
or standby. That is also the reason why the namenode is first put into 
safemode, the write request are quiesced, and then save namespace is called.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991-shellpart.patch, HDFS-7991.000.patch, 
 HDFS-7991.001.patch, HDFS-7991.002.patch, HDFS-7991.003.patch, 
 HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-21 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554800#comment-14554800
 ] 

Jing Zhao commented on HDFS-7991:
-

Thanks Allen. Yes, I also just realized that jmx may not be a good solution 
here.

bq. to do a REST or RPC call to ask the NN what it's doing
The same question here is what if this RPC/REST call fails (or timeout)? Should 
we retry and how? Or should we kill the NameNode? To me this is not 
fundamentally different from the saveNamespace solution:
# We're using kill to trigger the shutdown hook which does the checkpoint. This 
can be mapped to the step sending out a saveNamespace command to NN.
# We then keep polling the state of the NameNode using a REST/RPC call, just 
like waiting for the response from the saveNamespace RPC.
# Both solutions finally need to answer the same question: what if the REST/RPC 
call fails?

bq. This will almost certainly break init.d/rc.d/service/launchd/whatever 
scripts.
Yes, but I think if the checkpoint is necessary at this time, breaking these 
scripts may not be that bad compared with killing the namenode then waiting 
hours for the namenode to load edits or even fixing corrupted edits.

bq. currently does not require a Kerberos credential
Regarding to the auth part, how about directly parsing the hdfs-site.xml and 
getting the namenode fsimage/edits directory location? Then we can directly 
check if the checkpoint is necessary by going through the fsimage/edits file 
names.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-21 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554842#comment-14554842
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. The same question here is what if this RPC/REST call fails (or timeout)? 
Should we retry and how? Or should we kill the NameNode? To me this is not 
fundamentally different from the saveNamespace solution

If the REST/RPC call fails or shows no progress over X timeout value (e.g., 
reset the timer every time we show progress), then the NN should be considered 
hung and it should get killed with prejudice. There's no reason why the 
REST/RPC port has to be shutdown just because we are saving state.  If that's 
happening now, that's a terrible design decision.

This should be pretty trivial to do: 

1. send the kill to the daemon to shutdown
2. see that we have a bash hook to call our special timeout function for this 
daemon instead of sleeping
3. timeout function calls a separate java program that queries the daemon. 
Decision point: a) shutdown success, it exists. b) if NN shutdown times out due 
to no progress, exit with failure
4. bash code sees exit with failure and sends kill -9.

If you want, I can write up the shell patch to do this after lunch.  The shell 
part to enable this is tiny.

bq. Yes, but I think if the checkpoint is necessary at this time, breaking 
these scripts may not be that bad compared with killing the namenode then 
waiting hours for the namenode to load edits or even fixing corrupted edits.

You have a choice between a breaking change and a non-breaking change.  This 
effectively shifts the burden from one dev writing code to hundreds/thousands.  
Hint: not all of those hundreds/thousands are nearly as nice as me. ;)

bq. how about directly parsing the hdfs-site.xml 

Someone doesn't know about {{hdfs getconf}} ... ;)

bq. Then we can directly check if the checkpoint is necessary by going through 
the fsimage/edits file names.

So this fix isn't needed for the HA case?

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-21 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555112#comment-14555112
 ] 

Suresh Srinivas commented on HDFS-7991:
---

bq. bash code sees exit with failure and sends kill -9.
I think the goal of this jira should be to ensure save namespace is done when 
editlog size is huge. I have seen many cases where people either had to suffer 
loss of data or wait for more than 3 days for namenode to startup consuming all 
the pending editlogs. 

Blindly sending kill -9 is not an option in my opinion. Instead of emphasizing 
namenode stop functionality works, I would rather see save namespace work. 
Isn't there an environment variable that enables this functionality? For folks 
who want stop to no save namespace or a different behavior, it can be be used 
to go back to the previous behavior, right?

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-21 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555038#comment-14555038
 ] 

Jing Zhao commented on HDFS-7991:
-

Thanks for the further explanation, Allen. Now I get your point: the client 
side will still use a separate java program to query the daemon. Then if we 
also let this java program send out the checkpoint check command, and 
considering our current RPC already has the capability to handle timeout and 
retry, I guess we can directly utilize the current saveNamespace RPC? Then the 
only difference from your proposal is to move your step 1 after step 3.

bq. If you want, I can write up the shell patch to do this after lunch. The 
shell part to enable this is tiny.
Thanks, Allen. That will be helpful.

bq. So this fix isn't needed for the HA case?
For HA, since we're only stopping the local NameNode, the checkpoint can be 
independent. But one thing I still need to confirm is if we can get enough 
information about the number of transactions out the fsimage from the local NN 
directory, if no local edits is stored (i.e., journals are only in JNs). I will 
explore further on this.

bq. You have a choice between a breaking change and a non-breaking change. This 
effectively shifts the burden from one dev writing code to hundreds/thousands.
Looks like this is the main and maybe only place we have different opinion. In 
your proposal if the java program or the checkpoint process timeout we should 
send out kill -9. My thoughts:
# If the NameNode is healthy, the java program or the checkpoint checking 
should go through smoothly. This should be the normal case.
# The timeout should be rare. But if it happens, NameNode may have some issue 
or a checkpoint is necessary. Then I think it's worthy to do extra check for 
the NameNode since killing the NN now can lead to hours of downtime which may 
really kill the admins.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-21 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555300#comment-14555300
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. Then if we also let this java program send out the checkpoint check 
command, and considering our current RPC already has the capability to handle 
timeout and retry, I guess we can directly utilize the current saveNamespace 
RPC?

I would keep it simple:  shutdown also triggers the logic for if checkpoint is 
necessary.  There's zero value in waiting for the helper app to trigger it. 
This also means the helper app is extremely simple:  an unauthenticated call 
that does is checkpoint still happening? Is checkpoint still happening? What 
about now? Are we down yet Papa Smurf?  This way we also fix [~sureshms] issue:

bq. Blindly sending kill -9 is not an option in my opinion. 

That's why it's not blind.  The helper app's *sole* purpose should be to 
provide the hint to the shell code if things are so screwed up that kill -9 is 
the only way out.  This way all of the key, important logic is in Java code and 
the one thing the Java code probably shouldn't do (kill) is left to the shell 
code.

bq. Instead of emphasizing namenode stop functionality works, I would rather 
see save namespace work.

To the person who isn't looking at the code, these are effectively one and the 
same. If I'm stopping the namenode, I expect it to do what is necessary to come 
back up in a sane state.  Why should an admin have to make the decision here 
when the NN itself knows the state best?  Telling me to run save namespace is 
dumb:  Why didn't you just do it yourself, you stupid program? :D

bq.  Isn't there an environment variable that enables this functionality? For 
folks who want stop to not save namespace or a different behavior, it can be be 
used to go back to the previous behavior, right?

The # of times this is going to be needed should approach zero... and in those 
cases, a Java property (or properties!) is *way* better.  Some clueless person 
is going to tell others Hey, set this to make your system shut down faster.  
The Java apps can read the properties do whatever it needed/desired.  This also 
means they can prompt to say are you sure? because this is the type of 
operation (shutdown w/out checkpoint) that sounds like should never happen in 
an automated way.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-21 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554593#comment-14554593
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. Another way is that, instead of issuing the saveNamespace command directly, 
the script checks the time of the latest checkpoint and the total number of 
transactions first (maybe through the jmxget command).

jmxget is rarely installed.  Some other way to get the data will need to be 
supplied, almost certainly stuck away in dfsadmin or something. There's also 
the problem of JMX not being turned on by default (hint: we can't.) But the 
other part:

bq.  If it is necessary to do a checkpoint, the script will abort and print out 
some warning msg asking the admin to run dfsadmin -saveNamespace. 

No can do. This will almost certainly break 
init.d/rc.d/service/launchd/whatever scripts.

bq. The third option is to move the checkpoint logic into the shutdown hook of 
the NameNode. The biggest challenge here is the sync between the server and the 
script, i.e., to decide when and whether to kill the NN in the script. The 
script may have to polling the current state of the NameNode and guess whether 
the NameNode is still doing a checkpoint or it hangs somewhere else. Currently 
I do not see an easy way to achieve this.

IMO, this is still the best answer. With a SMOP of the code (at least in trunk. 
dunno  don't care about the disaster zone known as branch-2), it should be 
relatively trivial to write a hook that uses the almost ubiquitous wget, curl, 
or something stuck away in hadoop-common to do a REST or RPC call to ask the NN 
what it's doing. (and, of course, that call would be in a function that could 
be replaced if the user needed to use something else. best bet: shove it in the 
hdfs shellprofile).  

The ONLY big deal is going to be that {{hdfs --daemon stop namenode}} currently 
does not require a Kerberos credential. Of course that has large implications 
for boot scripts needing to kinit.  Unless we make sure this REST or PC call 
doesn't require auth, that will change that requirement


 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
  Labels: BB2015-05-TBR
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-21 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554598#comment-14554598
 ] 

Allen Wittenauer commented on HDFS-7991:


(It just occurred to me that auth is a big problem with the current patch 
too...)

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
  Labels: BB2015-05-TBR
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-05-19 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551234#comment-14551234
 ] 

Jing Zhao commented on HDFS-7991:
-

Recently we just saw several clusters from our customers where the NameNodes 
were stopped without checking/doing checkpoint. This lead to hours of downtime 
for loading large amounts of editlog (some clusters also hit the issue reported 
by HDFS-7609 which makes things worse).

I had an offline discussion with [~cnauroth] and [~jnp] about this 
functionality. Here is the summary of the options we can come up with:
# The solution developed in the current patch: the script sends saveNamespace 
request to the NameNode before stopping it, and the NameNode does an extra 
checkpoint if necessary based on the time of the latest checkpoint and the 
total number of transactions outside of the checkpoint. The drawback of the 
method is that if the checkpoint is necessary, the admin will see the stopping 
command blocked for 10min or more. And the admin can also get confused if the 
saveNamespace command fails.
# Another way is that, instead of issuing the saveNamespace command directly, 
the script checks the  time of the latest checkpoint and the total number of 
transactions first (maybe through the jmxget command). If it is necessary to do 
a checkpoint, the script will abort and print out some warning msg asking the 
admin to run dfsadmin -saveNamespace. This avoids the long time waiting from 
solution #1. Also if the jmxget command fails, the admin can use some command 
argument to force stopping the NameNode if he/she can confirm the checkpoint is 
not necessary.
# The third option is to move the checkpoint logic into the shutdown hook of 
the NameNode. The biggest challenge here is the sync between the server and the 
script, i.e., to decide when and whether to kill the NN in the script. The 
script may have to polling the current state of the NameNode and guess whether 
the NameNode is still doing a checkpoint or it hangs somewhere else. Currently 
I do not see an easy way to achieve this.

For now we think #2 may be the best solution. I will update the patch 
accordingly. [~aw], could you please also share your thoughts here? Thanks.


 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
  Labels: BB2015-05-TBR
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-04-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484681#comment-14484681
 ] 

Hadoop QA commented on HDFS-7991:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12723793/HDFS-7991.004.patch
  against trunk revision 5b8a3ae.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.util.TestByteArrayManager

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10200//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10200//console

This message is automatically generated.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-31 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388796#comment-14388796
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. (since the stop command only waits 5s)

This is easily fixed by just increasing the timeout or adding logic other logic 
such as asking if the NN is still alive, etc.

But in any case, it occurred to me this morning that the current code just flat 
out won't work in practice.  The problem is that HADOOP_OPTS has the NN's 
configuration inside it.  So, for example, if a user sets the heap size to 64g, 
then dfsadmin is going to run with a 64g heap as well. Same thing with gc logs 
and any other custom JVM setting.

The code absolutely must shell out another bin/hdfs process to get the proper 
HADOOP_OPTS setting.  I suspect it will actually have to use a subshell plus 
captures parameters so that the environment is clean due to various {{export}} 
statements throughout the code and in a lot of user's *-env.sh files.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-31 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389198#comment-14389198
 ] 

Allen Wittenauer commented on HDFS-7991:


bq. But it's hard to know if the NN is still doing checkpoint or NN is stuck in 
somewhere else. 

Why can't we ask via REST?

bq.  can we just simply capture the value of HADOOP_OPTS before appending 
HADOOP_NAMENODE_OPTS to it, and use the captured value for this checkpoint? 

Possible? Maybe. Simply? no.  It's going to get very messy because you need to 
juggle pretty much the entire shell state: HADOOP_CLIENT_OPTS, _finalize, 
logfile settings, etc, all need to get saved off and/or manipulated in order to 
provide the same/similar execution environment that dfsadmin uses... and that's 
before we even talk about what happens with custom shell profiles.

bq. Looks like this way equals to using a dfsadmin command in the NN's machine.

It might look that way at the Java level, but at the shell level it's going to 
be chaos.  It will definitely cause all sorts of problems given how open the 
shell level has always been.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-31 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389178#comment-14389178
 ] 

Jing Zhao commented on HDFS-7991:
-

bq. This is easily fixed by just increasing the timeout or adding logic other 
logic such as asking if the NN is still alive, etc.

But it's hard to know if the NN is still doing checkpoint or NN is stuck in 
somewhere else. Also it is hard to get a deterministic bound for the timeout 
value.

bq. The problem is that HADOOP_OPTS has the NN's configuration inside it. So, 
for example, if a user sets the heap size to 64g

Good catch. I will try to fix this in a later patch.

bq. The code absolutely must shell out another bin/hdfs process to get the 
proper HADOOP_OPTS setting. I suspect it will actually have to use a subshell 
plus parameter captures so that the environment is clean due to various export 
statements throughout the code and in a lot of user's *-env.sh files.

One question here is: can we just simply capture the value of {{HADOOP_OPTS}} 
before appending {{HADOOP_NAMENODE_OPTS}} to it, and use the captured value for 
this checkpoint? Looks like this way equals to using a dfsadmin command in the 
NN's machine.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-30 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387475#comment-14387475
 ] 

Jing Zhao commented on HDFS-7991:
-

I see the problem and the current solution in this way:
# The issue this feature is targeting is the corruption while the 
standby/secondary NN doing checkpoint. This corruption is usually in an old 
checkpoint or in the editlog. If the NN is shutdown before solving the issue, 
the corruption may block NN from starting up normally again.
# In practice we solve this usually by letting the current running NN do a 
checkpoint (through the -saveNamespace command). And it is very rare this 
checkpoint may fail since this is simply dumping the in-memory information into 
disk (i.e., the possible fsimage/editlog corruption is bypassed)
# It is hard to let NN do this checkpoint verification itself before shutdown 
since the checkpoint may take minutes, and before finishing the checkpoint the 
NN may have already been killed by the shell script (since the stop command 
only waits 5s)
# Based on the above #2 and #3, in most of the normal cases, using 
-saveNamespace command before shutdown can satisfy our requirement, i.e., 
checking if there is editlog corruption and saving the current in-memory 
namespace to bypass the corruption.
# Even if the -saveNamespace fails (which is rare), the admin now has a 
chance to check the cause of the failure and he/she can take further steps to 
verify if there is corruption or the checkpoint can be skipped. I think this is 
better compared with the scenario that the NN is shutdown directly and admin 
has to manually fix the corruption.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-30 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387412#comment-14387412
 ] 

Allen Wittenauer commented on HDFS-7991:


looks like you posted a new version.  shellcheck probably passes now.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-30 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387430#comment-14387430
 ] 

Allen Wittenauer commented on HDFS-7991:


Ok, the new one does do error checking.  But I'm still soft of left with ... 
now what?  What's the ops person supposed to do?

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-30 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387400#comment-14387400
 ] 

Allen Wittenauer commented on HDFS-7991:


-1

a) Take a look at  
http://wiki.apache.org/hadoop/UnixShellScriptProgrammingGuide.

b) Why are we trying to fix this at the shell level instead of at the Java 
level? 

c) HDFS_CHECKPOINT_BEFORE_STOP_NAMENODE

This should be HADOOP_HDFS_ blah, not HDFS_blah.

d) There's no way this passes shellcheck.

e) error messages are *way* too long for a single line.

f) where is the hadoop-env.sh documentation to match this new env var?



 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-30 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387422#comment-14387422
 ] 

Allen Wittenauer commented on HDFS-7991:


Frankly, I'm sort of leaning towards -1 on the feature itself. 

This is a very bad idea to this at the shell level where it has no way to know 
how or why things are broken.  This really feels like a throw it over the 
fence and the let shell code sort it out exercise.

I mean from HDFS-8003:

bq. With new changes in HDFS-7991, if the feature is on, the shell code will 
exit if the checkpoint fails and the NN will not be stopped.

You realize this isn't true, right?  There is no error checking in this patch 
or the previous one that got committed that stops the shell code.  It just 
continues plowing on through. 



 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-30 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387419#comment-14387419
 ] 

Jing Zhao commented on HDFS-7991:
-

bq. b) Why are we trying to fix this at the shell level instead of at the Java 
level?

This has been answered in [here | 
https://issues.apache.org/jira/browse/HDFS-8003?focusedCommentId=14387349page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14387349]
 and [here | 
https://issues.apache.org/jira/browse/HDFS-8003?focusedCommentId=14384256page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14384256].
 From your comment in HDFS-8003, I still did not see a valid point against 
doing this in the shell script.

bq. d) There's no way this passes shellcheck.

At least it passes the shellcheck in my local machine. Can you post the warning 
msg you see?

bq. -1

Just to clarify, is this -1 on the feature itself or just mean you want me to 
address your comments? If it's the later I will try to address your comments in 
the next patch.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387716#comment-14387716
 ] 

Hadoop QA commented on HDFS-7991:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12708232/HDFS-7991.002.patch
  against trunk revision d9ac5ee.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestLeaseRecovery2
  org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10114//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10114//console

This message is automatically generated.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387772#comment-14387772
 ] 

Hadoop QA commented on HDFS-7991:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12708251/HDFS-7991.003.patch
  against trunk revision cc0a01c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestFileCreation

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10117//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10117//console

This message is automatically generated.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
 HDFS-7991.002.patch, HDFS-7991.003.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-26 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382403#comment-14382403
 ] 

Kihwal Lee commented on HDFS-7991:
--

I am afraid the default behavior will break existing management 
scripts/infrastructure built on the hadoop commands.  If we are to do this in 
the shell script, we could add a check for an additional shell variable. If 
this feature is to be on by default, people will be able to turn it off by 
setting this variable in hadoop-env.sh, which is normally a part of config. If 
this variable is not set AND -skipcheckpoint is not specified, saveNamespace 
will be attempted on shutdown.

Regarding what should be the default, I prefer things to remain compatible, but 
others might think the benefit outweighs the inconvenience.  I am fine with 
either way as long as there is a simple way to disable it and stay compatible.

In the patch, did you intend to check the return code right after the first 
command? 

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383061#comment-14383061
 ] 

Hadoop QA commented on HDFS-7991:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12707599/HDFS-7991.001.patch
  against trunk revision 61df1b2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10080//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10080//console

This message is automatically generated.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-25 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381099#comment-14381099
 ] 

Jing Zhao commented on HDFS-7991:
-

[~kihwal], do you think this patch can address your comments?

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7991) Allow users to skip checkpoint when stopping NameNode

2015-03-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381301#comment-14381301
 ] 

Hadoop QA commented on HDFS-7991:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12707373/HDFS-7991.000.patch
  against trunk revision 44809b8.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.tracing.TestTracing

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10073//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10073//console

This message is automatically generated.

 Allow users to skip checkpoint when stopping NameNode
 -

 Key: HDFS-7991
 URL: https://issues.apache.org/jira/browse/HDFS-7991
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
 Attachments: HDFS-7991.000.patch


 This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
 check if saving namespace is necessary before stopping namenode. As [~kihwal] 
 pointed out in this 
 [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)