Large-scale reliability tests
-----------------------------

                 Key: HADOOP-2483
                 URL: https://issues.apache.org/jira/browse/HADOOP-2483
             Project: Hadoop
          Issue Type: Test
          Components: test
            Reporter: Arun C Murthy


The fact that we do not have any large-scale reliability tests bothers me. I'll 
be first to admit that it isn't the easiest of tasks, but I'd like to start a 
discussion around this... especially given that the code-base is growing to an 
extent that interactions due to small changes are very hard to predict.

One of the simple scripts I run for every patch I work on does something very 
simple: run sort500 (or greater), then it randomly picks n tasktrackers from 
${HADOOP_CONF_DIR}/conf/slaves and then kills them, a similar script one kills 
and restarts the tasktrackers. 

This helps in checking a fair number of reliability stories: lost tasktrackers, 
task-failures etc. Clearly this isn't good enough to cover everything, but a 
start.

Lets discuss - What do we do for HDFS? We need more for Map-Reduce!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to