[jira] [Updated] (HDFS-2742) HA: observed dataloss in replication stress test

2012-01-31 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2742:
--

Attachment: hdfs-2742.txt

Previous patch accidentally missed one of the earlier review passes by Eli 
(only enable incremental block tracking for ha NN). Re-incorporating it.

> HA: observed dataloss in replication stress test
> 
>
> Key: HDFS-2742
> URL: https://issues.apache.org/jira/browse/HDFS-2742
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, 
> hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, 
> log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2742) HA: observed dataloss in replication stress test

2012-01-30 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2742:
--

Attachment: hdfs-2742.txt

Attached patch adds the two extra tests described above, and addresses the dup 
import. Also rebased on tip of branch.

> HA: observed dataloss in replication stress test
> 
>
> Key: HDFS-2742
> URL: https://issues.apache.org/jira/browse/HDFS-2742
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, 
> hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2742) HA: observed dataloss in replication stress test

2012-01-27 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2742:
--

Attachment: hdfs-2742.txt

Fixed all the nits above except for the indentation - I didn't see any place 
with improper indentation.

{quote}
I think BM should distinguish between corrupt and out-of-dates replicas. The 
new case in processFirstBlockReport in thispatch, and where we mark reported 
RBW replicas for completed blocks as corrupt are using "corrupt" as a proxy for 
"please delete". I wasn't able to come up with additional bugs that with a 
similar cause but it would be easier to reason about if only truly corrupt 
replicas were marked as such. Can punt to a separate jira, if you agree.
{quote}
I don't entirely follow what you're getting at here... so let's open a new JIRA 
:)

bq. In FSNamesystem#isSafeModeTrackingBlocks, shouldn't we assert haEnabled is 
enabled if we're in SM and shouldIncrementallyTrackBlocks is true, instead of 
short-circuiting? We currently wouldn't know if we violate this condition 
because we'll return false if haEnabled.

I did the check for haEnabled in FSNamesystem rather than SafeModeInfo, since 
when HA is enabled it means we can avoid the volatile read of safeModeInfo. 
This is to avoid having any impact on the HA case. Is that what you're 
referring to? Not sure specifically what you're asking for in this change...

I changed {{setBlockTotal}} to only set {{shouldIncrementallyTrackBlocks}} to 
true when HA is enabled, and added {{assert haEnabled}} in 
{{adjustBlockTotals}}. Does that address your comment?


> HA: observed dataloss in replication stress test
> 
>
> Key: HDFS-2742
> URL: https://issues.apache.org/jira/browse/HDFS-2742
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, 
> hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2742) HA: observed dataloss in replication stress test

2012-01-23 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2742:
--

Attachment: hdfs-2742.txt

Updated on tip of branch. This applies on top of HDFS-2791 since one of the 
test cases fails without it.

> HA: observed dataloss in replication stress test
> 
>
> Key: HDFS-2742
> URL: https://issues.apache.org/jira/browse/HDFS-2742
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, 
> hdfs-2742.txt, hdfs-2742.txt, log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2742) HA: observed dataloss in replication stress test

2012-01-20 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2742:
--

Attachment: hdfs-2742.txt

OK, here's a patch which addresses several issues since the previous revision. 
I've been testing on a real cluster running HBase by a combination of graceful 
failovers, full restarts, etc, and think I've ironed out the bugs. I also added 
a number of new asserts to expose any places where we might have further bugs 
(and running my cluster with assertions enabled).

> HA: observed dataloss in replication stress test
> 
>
> Key: HDFS-2742
> URL: https://issues.apache.org/jira/browse/HDFS-2742
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, 
> hdfs-2742.txt, log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2742) HA: observed dataloss in replication stress test

2012-01-16 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2742:
--

Attachment: hdfs-2742.txt

Updated patch addresses issues around append while the SBN is in safe mode, and 
cleans up some more of the code/logging/etc. I ran the full suite of unit tests 
last night and they passed.

> HA: observed dataloss in replication stress test
> 
>
> Key: HDFS-2742
> URL: https://issues.apache.org/jira/browse/HDFS-2742
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, 
> log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2742) HA: observed dataloss in replication stress test

2012-01-14 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2742:
--

Attachment: hdfs-2742.txt

Attached patch fixes the issue described here, though I hope to write a few 
more test cases around the new code, and also add a little code when the NN 
transitions to Active that flushes the new queue.

The approach is a combination of my original proposal with some of the ideas 
from Suresh above:

- Change the PendingDataNodeMessages queue to be a block-specific queue rather 
than queueing entire messages. This actually helps the SBN stay up to date 
better and reduces memory usage, so it's good for other reasons as well. I 
moved it into BlockManager and added some simple true unit tests for it.
- When we get a block which is "bad" due to state reasons (eg RBW on a COMPLETE 
block, or wrong gen stamp, or future gen stamp), we queue it.
- When the edit log loader hits an OP_ADD or OP_CLOSE which modifies a block, 
we flush the queue for that block. Any messages that are now "correct" are 
applied, while "incorrect" messages end up getting re-queued by the same logic 
as above.
- I had to modify a couple of the TestHASafeMode tests since the expected 
behavior is slightly different.

There are still some edge cases around what happens if a RBW block report is 
delayed so long that it arrives after the OP_ADD and OP_CLOSE have been 
replicated. This is true in trunk, as well, though - I opened HDFS-2791 which 
describes the issue. I think this will be very rare in practice, so would like 
to address it separately from this JIRA here.

> HA: observed dataloss in replication stress test
> 
>
> Key: HDFS-2742
> URL: https://issues.apache.org/jira/browse/HDFS-2742
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Attachments: hdfs-2742.txt, hdfs-2742.txt, log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2742) HA: observed dataloss in replication stress test

2012-01-08 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2742:
--

Attachment: hdfs-2742.txt

Attached patch adds a test to trigger the issue, and a fix.

> HA: observed dataloss in replication stress test
> 
>
> Key: HDFS-2742
> URL: https://issues.apache.org/jira/browse/HDFS-2742
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Attachments: hdfs-2742.txt, log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HDFS-2742) HA: observed dataloss in replication stress test

2012-01-02 Thread Todd Lipcon (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-2742:
--

Attachment: log-colorized.txt

Here's the log with some color added to make it easier to read. View with "less 
-R". The block in question is blk_6553001497155157032_1081

> HA: observed dataloss in replication stress test
> 
>
> Key: HDFS-2742
> URL: https://issues.apache.org/jira/browse/HDFS-2742
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Attachments: log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira