I've been benchmarking hadoop streaming against just regular old command line grep.
I set the job to run 4 tasks at a time per box, with one box (with 4 processors). The file is a 54 Gb file with <100 bytes per line (DFS block size 128 MB). I grep an item that shows up in about 2% of the lines in the data set. And then I set -mapper "/bin/grep myregexp" -numReduceTasks 0 MapReduce gives me a time to complete on average of about 45 minutes. Command Line Unix gives me a time to complete of about 7 minutes. Then I did the same with a much smaller file (1 GB) and still got MR=3min, Linux=7seconds) Does anyone know of a better/faster way to do grep via streaming? Is there a better, more optimized version written in Java or Python? Last, why would the method I am using take so long? I've determined that some of the time is write time (output) from the mappers... but could it really be that much overhead due to read time? Thanks for your help! -- Theodore Van Rooy http://greentheo.scroggles.com