I'm not sure of that. I wrote a small checksum program for testing.
After the size of a block gets to larger than 8192 bytes, I don't see
much performance improvement. See the code below. I don't think 64MB can
bring us any benefit.
I did change io.bytes.per.checksum to 131072 in hadoop, and the program
ran about 4 or 5 minutes faster (the total time for reducing is about 35
minutes).
import java.util.zip.CRC32;
import java.util.zip.Checksum;
public class Test1 {
public static void main(String args[]) {
Checksum sum = new CRC32();
byte[] bs = new byte[512];
final int tot_size = 64 * 1024 * 1024;
long time = System.nanoTime();
for (int k = 0; k < tot_size / bs.length; k++) {
for (int i = 0; i < bs.length; i++)
bs[i] = (byte) i;
sum.update(bs, 0, bs.length);
}
System.out.println("takes " + (System.nanoTime() - time) / 1000
/ 1000);
}
}
On 01/05/2011 05:03 PM, Milind Bhandarkar wrote:
I agree with Jay B. Checksumming is usually the culprit for high CPU on clients
and datanodes. Plus, a checksum of 4 bytes for every 512, means for 64MB block,
the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it to generate 1
ext3 checksum block per DFS block will speedup read/write without any loss of
reliability.
- milind
---
Milind Bhandarkar
(mbhandar...@linkedin.com)
(650-776-3236)