So we moved 50 machines to a data center for a beta cluster of a new search engine based on Nutch and Hadoop. We fired all of the machines up and started fetching and almost immediately started experiencing JVM crashes and checksum/IO errors which would cause jobs to fail, tasks to fail, and random data corruption. After digging through and fixing the problems we have come up with some observations that may seem obvious but may also help someone else avoid the same problems.
First, checksum errors are not a normal occurrence. If you are experiencing a lot of checksum errors or data corruption, you very well could have a hardware problem. As developers we many times tend to thing the "problem" is somewhere in our code. Sometimes that is just not the case. With systems such as Nutch and Hadoop that can put extreme load on machines and keep it sustained for long periods of time, hardware problems will surface. When debugging problems, don't limit yourself to just the software. If you think hardware might be bad you can use programs such as badblocks and memtest to test it out. In our cluster we had 2 machines out of 50 that had hardware problems. When they were removed from the cluster most of the errors disappeared. Second, running Nutch and Hadoop in a distributed environment means that jobs are sharing data from other machines. So if a job tracker or data node keeps failing on a single machine (or a single set of machines), that machine could have hardware problems. If you continually get checksum errors over many machines and it seems random it probably isn't. More than likely it is caused by a single machine or set of machines having hardware problems and then sharing the data (i.e. map outputs, etc.) with other machines. The symptoms of this would be when one or more task fail for checksum or similar IO related errors and then complete successfully when restarted on a different machine. So the point it distributed systems mean that where you are seeing the problem occurring might not be where it is being caused. Third, running on multiprocessor and multi-core machines on linux, as many of us do, means some jvm changes that aren't very well documented. For one, parallel machines will use a parallel garbage collector. You used to have to set -XX:+UseParallelGC to enable this option. But in the later versions of Java 5 and Java 6, when running on multi-proc machines is set by default. Single proc machines will still use the serial collector. If you start seeing random JVM crashes and in the core dump files you see lines like this: VM_Operation (0x69a55580): parallel gc failed allocation This means that a parallel garbage collector bug bit the JVM. To turn off parallel garbage collection, use the -XX:+UseSerialGC option. This can be set in the hadoop-default.xml file in the child opts and in the hadoop-env.sh file in the HADOOP_OPTS variable. So part of this is just to rant, but in part I hope some of this information helps someone else to avoid having to spend a week tracking down hardware and weird JVM problems. Dennis Kubes ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
