[jira] [Commented] (HDFS-3797) QJM: add segment txid as a parameter to journal() RPC

2012-08-15 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13435410#comment-13435410
 ] 

Todd Lipcon commented on HDFS-3797:
---

bq. have you considered adding a test case that ensures that a JN which 
experiences this scenario will return to participating in the quorum after the 
next finalize/new segment?

I plan to add this test as part of HDFS-3726. 


Will fix the nit and commit momentarily.

 QJM: add segment txid as a parameter to journal() RPC
 -

 Key: HDFS-3797
 URL: https://issues.apache.org/jira/browse/HDFS-3797
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: QuorumJournalManager (HDFS-3077)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Minor
 Attachments: hdfs-3797.txt


 During fault testing of QJM, I saw the following issue:
 1) NN sends txn 5 to JN
 2) NN gets partitioned from JN while JN remains up. The next two RPCs are 
 missed while the partition has happened:
 2a) finalizeSegment(1-5)
 2b) startSegment(6)
 3) NN sends txn 6 to JN
 This caused one of the JNs to end up with a segment 1-10 while the others had 
 two segments; 1-5 and 6-10. This broke some invariants of the QJM protocol 
 and prevented the recovery protocol from running properly.
 This can be addressed on the client side by HDFS-3726, which would cause the 
 NN to not send the RPC in #3. But it makes sense to also add an extra safety 
 check here on the server side: with every journal() call, we can send the 
 segment's txid. Then if the JN and the client get out of sync, the JN can 
 reject the RPCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3797) QJM: add segment txid as a parameter to journal() RPC

2012-08-14 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13434626#comment-13434626
 ] 

Aaron T. Myers commented on HDFS-3797:
--

Patch looks pretty good to me. One question: have you considered adding a test 
case that ensures that a JN which experiences this scenario will return to 
participating in the quorum after the next finalize/new segment?

Nit: looks like the method comment for testMissFinalizeAndNextStart got messed 
up a little bit: +   **/

 QJM: add segment txid as a parameter to journal() RPC
 -

 Key: HDFS-3797
 URL: https://issues.apache.org/jira/browse/HDFS-3797
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Affects Versions: QuorumJournalManager (HDFS-3077)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Minor
 Attachments: hdfs-3797.txt


 During fault testing of QJM, I saw the following issue:
 1) NN sends txn 5 to JN
 2) NN gets partitioned from JN while JN remains up. The next two RPCs are 
 missed while the partition has happened:
 2a) finalizeSegment(1-5)
 2b) startSegment(6)
 3) NN sends txn 6 to JN
 This caused one of the JNs to end up with a segment 1-10 while the others had 
 two segments; 1-5 and 6-10. This broke some invariants of the QJM protocol 
 and prevented the recovery protocol from running properly.
 This can be addressed on the client side by HDFS-3726, which would cause the 
 NN to not send the RPC in #3. But it makes sense to also add an extra safety 
 check here on the server side: with every journal() call, we can send the 
 segment's txid. Then if the JN and the client get out of sync, the JN can 
 reject the RPCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira