What are the algorithms and codecs used in Hadoop to compress data and pass it around between mappers and reducers? I'm curious to understand the effects it has (if any) on double precision values.
So far my trainer (MAHOUT-627) uses unscaled EM training and I'm soon starting the work on using log-scaled values for improved accuracy and minimizing underflow. It will be interesting to compare the accuracy of the unscaled and log scaled variants so I'm curious.
