[
https://issues.apache.org/jira/browse/HADOOP-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646950#action_12646950
]
Steve Loughran commented on HADOOP-2483:
----------------------------------------
I'm assuming that the problems here are to see how the system handles network,
host and disk failures. A good first step would be: what problems happen most
often? And which problems are traumatic enough to send everyones pagers off, as
they are the ones to care about.
* disk failures could be mocked with some filesystem which simulates problems:
bad data, missing data, even hanging reads and writes.
* network failures are harder to simulate as there are so many kinds. DNS
failures, exceptions on every single part of an IO operation, are all
candidates. Perhaps we could have a special mock IPC client that raises these
exceptions in test runs.
* This is the kind of thing that virtualized clusters are good for, but they
have odd timing quirks to make you worry about what is going on.
> Large-scale reliability tests
> -----------------------------
>
> Key: HADOOP-2483
> URL: https://issues.apache.org/jira/browse/HADOOP-2483
> Project: Hadoop Core
> Issue Type: Test
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Devaraj Das
> Fix For: 0.20.0
>
>
> The fact that we do not have any large-scale reliability tests bothers me.
> I'll be first to admit that it isn't the easiest of tasks, but I'd like to
> start a discussion around this... especially given that the code-base is
> growing to an extent that interactions due to small changes are very hard to
> predict.
> One of the simple scripts I run for every patch I work on does something very
> simple: run sort500 (or greater), then it randomly picks n tasktrackers from
> ${HADOOP_CONF_DIR}/conf/slaves and then kills them, a similar script one
> kills and restarts the tasktrackers.
> This helps in checking a fair number of reliability stories: lost
> tasktrackers, task-failures etc. Clearly this isn't good enough to cover
> everything, but a start.
> Lets discuss - What do we do for HDFS? We need more for Map-Reduce!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.