I'm not sure of that. I wrote a small checksum program for testing. After the size of a block gets to larger than 8192 bytes, I don't see much performance improvement. See the code below. I don't think 64MB can bring us any benefit. I did change io.bytes.per.checksum to 131072 in hadoop, and the program ran about 4 or 5 minutes faster (the total time for reducing is about 35 minutes).

import java.util.zip.CRC32;
import java.util.zip.Checksum;


public class Test1 {
    public static void main(String args[]) {
        Checksum sum = new CRC32();
        byte[] bs = new byte[512];
        final int tot_size = 64 * 1024 * 1024;
        long time = System.nanoTime();
        for (int k = 0; k < tot_size / bs.length; k++) {
            for (int i = 0; i < bs.length; i++)
                bs[i] = (byte) i;
            sum.update(bs, 0, bs.length);
        }
System.out.println("takes " + (System.nanoTime() - time) / 1000 / 1000);
    }
}


On 01/05/2011 05:03 PM, Milind Bhandarkar wrote:
I agree with Jay B. Checksumming is usually the culprit for high CPU on clients 
and datanodes. Plus, a checksum of 4 bytes for every 512, means for 64MB block, 
the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it to generate 1 
ext3 checksum block per DFS block will speedup read/write without any loss of 
reliability.

- milind

---
Milind Bhandarkar
(mbhandar...@linkedin.com)
(650-776-3236)







Reply via email to