[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-07 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249181#comment-13249181
 ] 

Todd Lipcon commented on HDFS-3192:
---

The state diagram is included in the design doc attached to HDFS-2185. Please 
comment with an example scenario in which you think there is an incorrect 
behavior - I don't know of any aside from HADOOP-8217, but if you know of some 
I'd be really happy to address them rather than find out about them from a 
broken customer :)

 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-06 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249099#comment-13249099
 ] 

Hari Mankude commented on HDFS-3192:


Would it be possible to post state transition diagram for the failover 
controller and its interactions with NN? There are concerns about the 
correctness of situations where zkfc2 is directing nn1 to change states and 
vice versa. 

 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Suresh Srinivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246059#comment-13246059
 ] 

Suresh Srinivas commented on HDFS-3192:
---

Hari, agree with Aaron that this should not be a subtask of HDFS-3092.

 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246661#comment-13246661
 ] 

Hari Mankude commented on HDFS-3192:


bq.Why?

I think it's an advantage that the FC may die and come back, or that you may 
start the FCs after the NNs.

Well, if FC restarts the health monitoring within the timeout period, then NN 
will not die. However, if FC is having a gc pause or is not restarting, then NN 
should die. This is the first level of protection where in if NN is healthy, it 
can stonith itself.



 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246671#comment-13246671
 ] 

Todd Lipcon commented on HDFS-3192:
---

Why add multiple stonith paths, given we need external stonith anyway? It just 
adds to the complexity by increasing the number of scenarios we have to debug, 
etc.

That is to say: if the ZKFC dies, then it will lose its lock, and the other 
node will stonith this one when it takes over. What's the benefit of having it 
abort itself at the same time? In fact, it seems to be detrimental, because if 
it stays up, the other node can do a graceful transitionToStandby() call rather 
than having to do something more drastic like a full abort.

 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246687#comment-13246687
 ] 

Hari Mankude commented on HDFS-3192:


bq. Why add multiple stonith paths, given we need external stonith anyway? It 
just adds to the complexity by increasing the number of scenarios we have to 
debug, etc.

I thought we are not going to have external stonith using special devices and 
that is mainly the reason why we are going through hoops to implement fencing 
in journal daemons. 

bq. That is to say: if the ZKFC dies, then it will lose its lock, and the other 
node will stonith this one when it takes over. What's the benefit of having it 
abort itself at the same time? In fact, it seems to be detrimental, because if 
it stays up, the other node can do a graceful transitionToStandby() call rather 
than having to do something more drastic like a full abort.

I disagree about two items here.

1. Why is the behaviour different from what happens when zkfc loses the 
ephemeral node? Currently zkfc when it loses the ephemeral node will shutdown 
the active NN. Similarly if active NN does not hear from zkfc, it implies that 
zkfc is dead, going through gc pause essentially resulting in loss of ephemeral 
node. 

2.  If active NN loses quorum, it has to shutdown. There is no way to do a 
transitionToStandby() especially since the log is updated after NN metadata is 
updated and there is no way to roll back the last update. This is just one of 
the issues that we are aware of where a rollback would be necessary. There 
might be other situations where rollback is required. In fact, one of the most 
of the difficult APIs to implement correctly would be transitionToStandby() 
from active state. 

 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246718#comment-13246718
 ] 

Todd Lipcon commented on HDFS-3192:
---

bq. I thought we are not going to have external stonith using special devices 
and that is mainly the reason why we are going through hoops to implement 
fencing in journal daemons.

In the current design, which uses a filer, we *require* external stonith 
devices. There is no correct way of doing it without either stonith or storage 
fencing.

The proposal with the journal-daemon based fencing is essentailly the same as 
storage fencing - just that we do it with our own software storage instead of a 
NAS/SAN.

bq. Why is the behaviour different from what happens when zkfc loses the 
ephemeral node? Currently zkfc when it loses the ephemeral node will shutdown 
the active NN

No, it doesn't - it will transition it to standby. But, as I commented 
elsewhere, this is redundant, because the _new_ active is actually going to 
fence it anyway before taking over.

bq. Similarly if active NN does not hear from zkfc, it implies that zkfc is 
dead, going through gc pause essentially resulting in loss of ephemeral node.

But this can reduce uptime. For example, imagine an administrator accidentally 
changes the ACL on zookeeper. This causes both ZKFCs to get an authentication 
error and crash at the same time. With your design, both NNs will then commit 
suicide. With the existing implementation, the system will continue to run in 
its existing state -- i.e no new failovers will occur, but whoever is active 
will remain active.

bq.  If active NN loses quorum, it has to shutdown

Yes, it has to shut down _before_ it does any edits, or it has to be fenced by 
the next active. Notification of session loss is asynchronous. The same is true 
of your proposal. In either case it can take arbitrarily long before it 
notices that it should not be active. So we still require that the new active 
fence it before it becomes active. So, this proposal doesn't solve any problems.

bq. In fact, one of the most of the difficult APIs to implement correctly would 
be transitionToStandby() from active state.

We already have that implemented. It syncs any existing edits, and then stops 
allowing new ones. We allow failover from one node to another without aborting, 
so long as it's graceful. This is perfectly correct. If we need to do a 
non-graceful failover, we fence the node by STONITH or by disallowing further 
access to the edit logs (which indirectly causes the node to abort, since 
logSync() fails).

It seems you're trying to solve problems we've already solved.

 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246752#comment-13246752
 ] 

Hari Mankude commented on HDFS-3192:


bq.   bq. Why is the behaviour different from what happens when zkfc loses the 
ephemeral node? Currently zkfc when it loses the ephemeral node will shutdown 
the active NN

bq. No, it doesn't - it will transition it to standby. But, as I commented 
elsewhere, this is redundant, because the new active is actually going to fence 
it anyway before taking over.

Well this is incorrect behaviour. It does not handle the situation that I 
mentioned earlier about requiring rollbacks. The transition to standby will 
have result in old active having incorrect in-memory state. The only way around 
this is to shutdown the active. The reason is that as soon zkfc on NN1 has lost 
the ephemeral znode, it is possible that zkfc on NN2 has taken over the znode 
and NN2 has started fencing the journals. There is no way to gracefully 
coordinate this with NN1. This would result in NN1 getting quorum loss which in 
turn could leave the in-memory state in NN1 in an inconsistent shape. Do you 
agree again that in-memory state of NN1 is inconsistent with the editlogs?

bq. Similarly if active NN does not hear from zkfc, it implies that zkfc is 
dead, going through gc pause essentially resulting in loss of ephemeral node.

bq. But this can reduce uptime. For example, imagine an administrator 
accidentally changes the ACL on zookeeper. This causes both ZKFCs to get an 
authentication error and crash at the same time. With your design, both NNs 
will then commit suicide. With the existing implementation, the system will 
continue to run in its existing state – i.e no new failovers will occur, but 
whoever is active will remain active.

Firstly, how often does some change ACLs in zookeeper? Secondly, why is ZKFC 
dying when this happens? ZKFC must be more robust than NN. NN is a resource 
that is controlled by ZKFC. We should make zkfc more robust to handle zookeeper 
acl changes if this is a common occurance.

bq. If active NN loses quorum, it has to shutdown

bq. Yes, it has to shut down before it does any edits, or it has to be fenced 
by the next active. Notification of session loss is asynchronous. The same is 
true of your proposal. In either case it can take arbitrarily long before it 
notices that it should not be active. So we still require that the new active 
fence it before it becomes active. So, this proposal doesn't solve any problems.

My proposal was not meant to handle active NN losing quorum. My proposal is 
shutdown NN when ZKFC has died or is in a gc pause. 

My comment was with regards to earlier comment regarding doing a 
transitionToStandby(). Do you agree that active NN has invalid in-memory state 
and cannot go through transitionToStandby() when it loses quorum? There seems 
to be two solutions. 
1. Implement rollback for various types of editlog entries and then do 
transitionToStandby() OR
2. Shutdown NN when it loses quorum
Does this sound right?









 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246778#comment-13246778
 ] 

Todd Lipcon commented on HDFS-3192:
---

I think there is confusion here over the terminology loses quorum.

I agree completely about the following: if any NN fails to sync its edit logs, 
it needs to abort. This is already what it does today - no changes necessary.
If the edit log happens to be implemented using a quorum protocol (HDFS-3092 or 
HDFS-3077 for example), then that behavior should be maintained. The 
JournalManager implementation needs to throw an exception in response to 
logSync(). That will cause the NN to abort.

That's all that's necessary for correctness - an NN won't ack success to a 
write unless it successfully syncs it, and will abort rather than rollback, 
since we have no rollback capability.

In the above sense, loses quorum really means loses write access to the edit 
logs.


If instead you're talking about loses quorum as loses its ZK session, then 
no abort is necessary, because it may still be able to write to its edits. So 
long as it's getting success back from editLog.logSync(), then the edits are 
being persisted. It is the responsibility of the next active to fence access to 
the shared edits. It may do so in one of two ways:
1) Edits fencing: ensure that the next write to the edits mechanism throws IOE. 
In the case of FileJournalManager on NAS, this is done via an RPC to the NAS 
system to fence the given export.
2) STONITH: ensure that the next write fails because power has been yanked from 
the machine.

Alternatively, the new active may first try a graceful transition:
3) Gracefully ask the prior active to stop writing. The prior active flushes 
anything buffered, successfully syncs, and then enters standby mode.


Notably, self-stonith upon losing the ZK lease is not an option, because it 
may take arbitrarily long before it notices. EG:
1) NN1 writing to edits log
2) ZKFC1 loses lease, but doesn't know about it yet
3) ZKFC2 gets lease
4) NN2 becomes active, starts writing logs
5) NN1 writes some edits. World explodes.
6) ZKFC1 gets asynchronous notification from ZK that it lots its session. 
Anything you do at this point is _too late_.

Before step 4, NN2 must use a fencing mechanism. *Regardless* of whatever steps 
NN1 or ZKFC1 might take in step 6.


 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246805#comment-13246805
 ] 

Hari Mankude commented on HDFS-3192:


Excellent comment regarding quorum and active aborting when it cannot write to 
(n/2 +1) number of editlog entries. I was getting worried that 
transitionToStandby() without NN restart was being suggested in this scenario 
also. Thanks for clarifying this for me.

Before we get back to ZKFC dead implies NN should die situation, let us 
understand a specific scenario here

bq. 1) NN1 writing to edits log
bq. 2) ZKFC1 loses lease, but doesn't know about it yet
bq. 3) ZKFC2 gets lease
bq. 4) NN2 becomes active, starts writing logs
bq. 5) NN1 writes some edits. World explodes.
bq. 6) ZKFC1 gets asynchronous notification from ZK that it lots its session. 
Anything you do at this point is too late.

Let us assume that NN1 has no edit logs to write. The reason could be that we 
implement an ip failover also and DFSClients now automatically start talking to 
NN2 after the ip failover. There are no edit log entries being created at NN1. 
So, NN1 stays active and never behaves as a standby.

So in step #6, irrespective of when ZKFC1 gets the notification,  ZKFC1 has to 
restart NN1. Otherwise, we don't know as to how long NN1 will stay in limbo.

Now coming to the scenario that I was referring to, ZKFC on NN1 is either dead 
or is in gc. NN1 could stay immobile similarly. Also, NN1 could resign much 
earlier without having go through uncontrolled abort via fencing. It is always 
a much stronger correctness argument when NN1 can self-resign with the 
information that is available rather than being forced to resign.







 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246806#comment-13246806
 ] 

Hari Mankude commented on HDFS-3192:


bq.Excellent comment regarding quorum and active aborting when it cannot write 
to (n/2 +1) number of editlog entries.

Should say editlog locations.

 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246818#comment-13246818
 ] 

Todd Lipcon commented on HDFS-3192:
---

bq. So in step #6, irrespective of when ZKFC1 gets the notification, ZKFC1 has 
to restart NN1. Otherwise, we don't know as to how long NN1 will stay in limbo.

Can you explain why it has to restart, instead of just transitioning to 
standby? What do you mean by in limbo here?


bq. Also, NN1 could resign much earlier without having go through uncontrolled 
abort via fencing
Before issuing an uncontrolled abort, the ZKFC2 will always try to do a 
graceful fence -- ie ask it to self-resign via an RPC. See the 
{{tryGracefulFence}} function in the {{FailoverController}} class.

Having the other node asking it to resign is better than having it ask itself 
to resign -- the reason being that this is the only way the other node can be 
sure that it's in the clear to start writing to the logs. (a 
self-resignation might come too late). Since the other node always has to 
verify the resignation before it starts to write, there's nothing extra gained 
by having it resign itself first. It's just a redundancy.



 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246852#comment-13246852
 ] 

Hari Mankude commented on HDFS-3192:


bq.Can you explain why it has to restart, instead of just transitioning to 
standby? What do you mean by in limbo here?

in limbo implies that NN1 thinks that it is active even though NN2 has taken 
over since it has not tried to access editlogs. So, it is not behaving as 
standby and keeping up with active. Are you suggesting that ZKFC1 does 
transitionToStandby() when it loses znode? On an active NN, there is a high 
probability that it might abort. Also, does transitionToStandby() guarantee 
that all the active-state threads have quisced? 

bq.Before issuing an uncontrolled abort, the ZKFC2 will always try to do a 
graceful fence – ie ask it to self-resign via an RPC. See the 
tryGracefulFence function in the FailoverController class.

I don't think that doing tryGraceFulFence() from NN2 to NN1 is safe. First of 
all, this is opening up one more channel of communication between NN1 and NN2 
and this is subject to various races sequences, split-brain etc. I think 
self-resign is much safer than trygracefulfence(). So far, I dont see a lack of 
correctness argument in our discussion. Is my description correct here?




 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246865#comment-13246865
 ] 

Todd Lipcon commented on HDFS-3192:
---

bq. Are you suggesting that ZKFC1 does transitionToStandby() when it loses 
znode? 

It currently does, but I don't think it has to -- since ZKFC2 will call 
NN1.transitionToStandby (see below).

bq. On an active NN, there is a high probability that it might abort

There are two possible scenarios:
1) *the local node finds out about the session expiration before the standby.* 
In this case, it will call transitionToStandby, the local node will flush and 
close its edit logs, and gracefully transition.
2) *the local node finds out after the other node.* In this case, the other 
node will have already initiated the fencing process.
2a) If the local node is still accessible, then the other node will have 
already called transitionToStandby(), in which case our own call will be a 
no-op (since we're already in standby state). Everything is correct, because 
the transitionToStandby() call flushes everything and gracefully closes its 
edit log writer.
2b) If the local node is inaccessible (eg network down) then the other node 
initiates non-graceful fencing. If it does STONITH, then our node will go down, 
and the discussion is moot. If it does storage fencing, then our node no longer 
has access to write to storage. This will prevent transitionToStandby() from 
succeeding, since it will try to finalize its current edit log segment (which 
involves mutating the fenced-off storage). So, it will correctly abort.

bq. I don't think that doing tryGraceFulFence() from NN2 to NN1 is safe. First 
of all, this is opening up one more channel of communication between NN1 and 
NN2 and this is subject to various races sequences, split-brain etc.

Doing RPC to your own NN is subject to way more race conditions because we have 
no way of enforcing an ordering between NN1 going standby and NN2 becoming 
active. NN2 *has* to verify that NN1 is either standby or effectively dead 
before becoming active. The only way to do that is to first (a) ask it to be 
standby, or (b) fence.


The lack of correct-ness in relying on self-resign is the example I gave above:

{quote}
1) NN1 writing to edits log
2) ZKFC1 loses lease, but doesn't know about it yet
3) ZKFC2 gets lease
4) NN2 becomes active, starts writing logs
5) NN1 writes some edits. World explodes.
6) ZKFC1 gets asynchronous notification from ZK that it lots its session. 
Anything you do at this point is too late.
{quote}

The self-resign in step 6 is insufficient. We have to fence between step 3 
and step 4. Whatever NN1 happens to do _after_ that point doesn't help anything 
because it's too late.


 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246868#comment-13246868
 ] 

Todd Lipcon commented on HDFS-3192:
---

Maybe the confusion is about readers? The one hole we have now is that, if 
there are no writes happening on the active, then it may happily continue to 
think it's active until the next write. I'd be in favor of solving this by 
adding code which writes a no-op edit to the edit log once every second or 
so. This bounds the amount of time in which it may think it's active and 
respond to read requests -- since the no-op edit will cause an abort if it 
loses its write access. Does that satisfy the issue you're raising?

 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-04 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246949#comment-13246949
 ] 

Hari Mankude commented on HDFS-3192:


bq.2a) If the local node is still accessible, then the other node will have 
already called transitionToStandby(), in which case our own call will be a 
no-op (since we're already in standby state). Everything is correct, because 
the transitionToStandby() call flushes everything and gracefully closes its 
edit log writer.

This is very dangerous and can result in all sorts of races.

1. ZKFC2 initiates transitionToStandby() to NN1.
2. Meanwhile, RPC does not start on NN1.
3. ZKFC2 loses the znode.
4. ZKFC1 now takes over
5. ZKFC1 does a becomeActive() on NN1.
6. transitionToStandby() starts executing converting NN1 to standby. There are 
no active NNs in the cluster now.

ZKFC should communicate ONLY with its local NN. Otherwise, it will result in 
all sorts of messy race conditions. Communication between NNs should be via 
zookeeper znodes, editlogs and datanodes.

bq.1) NN1 writing to edits log
2) ZKFC1 loses lease, but doesn't know about it yet
3) ZKFC2 gets lease
4) NN2 becomes active, starts writing logs
5) NN1 writes some edits. World explodes.
6) ZKFC1 gets asynchronous notification from ZK that it lots its session. 
Anything you do at this point is too late.

bq.Doing RPC to your own NN is subject to way more race conditions because we 
have no way of enforcing an ordering between NN1 going standby and NN2 becoming 
active. NN2 has to verify that NN1 is either standby or effectively dead before 
becoming active. The only way to do that is to first (a) ask it to be standby, 
or (b) fence.

I disagree. Doing RPC to your own NN is the safest mechanism that is available 
in the HA environment. It is definitely safer than doing the RPC to a remote 
NN. Do you agree? 
I would like to make sure that I consider fencing required also and I am not 
suggesting this method as an alternative to fencing. Instead, this method will 
ensure that there are lesser situations where complicated algorithm of fencing 
would have to be used and ensures that there is less probability of error.

bq. The self-resign in step 6 is insufficient. We have to fence between step 
3 and step 4. Whatever NN1 happens to do after that point doesn't help anything 
because it's too late.

I am not talking about self-resign in this situation. Self-resign as per this 
jira will happen only if ZKFC1 is dead. In the above example, ZKFC1 is not 
dead. 
For the above example, ZKFC1 should abort NN1 when znode state change has 
happened and restart NN1. 




 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-03 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245889#comment-13245889
 ] 

Hari Mankude commented on HDFS-3192:


FC via healthmonitor will periodically poll the status of the NN via 
getServiceStatus(). If active NN has not heard from FC for timeout number of 
seconds, it should exit. This timeout has to be tied to the timeout of the 
ephemeral node in zookeeper and it should be a fraction of the ephemeral node 
timeout.

This ensures that if FC has died, then active NN also failfasts.

 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

2012-04-03 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245895#comment-13245895
 ] 

Todd Lipcon commented on HDFS-3192:
---

Why?

I think it's an _advantage_ that the FC may die and come back, or that you may 
start the FCs after the NNs.

 Active NN should exit when it has not received a getServiceStatus() rpc from 
 ZKFC for timeout secs
 --

 Key: HDFS-3192
 URL: https://issues.apache.org/jira/browse/HDFS-3192
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, name-node
Reporter: Hari Mankude
Assignee: Hari Mankude



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira