CRC32 performance

Bryan Duxbury Mon, 06 Oct 2008 16:22:08 -0700

Hey all,

I've been profiling our map/reduce applications quite a bit over thelast few weeks to try and get some performance improvements in ourjobs. I noticed an interesting bottleneck in Hadoop itself I thoughtI should bring up.

FSDataOutputStream appears to create a CRC of the data being writtenvia FSOutputSummer.write1. It uses the built-in Java CRC32implementation to do so. However, out of a 41-second reducer mainthread, this CRC call is taking up around 13 seconds, or about 32%.This appears to dwarf the actual writing time(FSOutputSummer.flushBuffer) which only takes 1.9s (5%). This seemslike an incredibly large amount of overhead to pay.

To my surprise, there's already a faster CRC implementation in theJava standard library called Adler32 which is described as "almost asreliable as a CRC-32 but can be computed much faster". This soundsvery attractive, indeed. Some quick tests indicate that Adler32 isabout 3x as fast.

Is there any reason why CRC32 was chosen, or why Adler32 wouldn't bean acceptable CRC? I understand that Adler32 is bad for smallmessages (small as in hundreds of bytes), but since this is behind abuffered writer, the messages should all be thousands of bytes tobegin with. Worst case, I guess we could select the CRC algorithmbased on the size of the message, using CRC32 for small messages andAdler32 for bigger ones.


-Bryan

CRC32 performance

Reply via email to