On Mon, May 11, 2009 at 12:08 PM, Todd Lipcon <t...@cloudera.com> wrote: > In addition to Jason's suggestion, you could also see about setting some of > Hadoop's directories to subdirs of /dev/shm. If the dataset is really small, > it should be easy to re-load it onto the cluster if it's lost, so even > putting dfs.data.dir in /dev/shm might be worth trying. > You'll probably also want mapred.local.dir in /dev/shm > > Note that if in fact you don't have enough RAM to do this, you'll start > swapping and your performance will suck like crazy :) > > That said, you may find that even with all storage in RAM your jobs are > still too slow. Hadoop isn't optimized for this kind of small-job > performance quite yet. You may find that task setup time dominates the job. > I think it's entirely reasonable to shoot for sub-60-second jobs down the > road, and I'd find it interesting to hear what the results are now. Hope you > report back! > > -Todd > > On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer > <mattbowy...@googlemail.com>wrote: > >> Hi, >> >> I am trying to do 'on demand map reduce' - something which will return in >> reasonable time (a few seconds). >> >> My dataset is relatively small and can fit into my datanode's memory. Is it >> possible to keep a block in the datanode's memory so on the next job the >> response will be much quicker? The majority of the time spent during the >> job >> run appears to be during the 'HDFS_BYTES_READ' part of the job. I have >> tried >> using the setNumTasksToExecutePerJvm but the block still seems to be >> cleared >> from memory after the job. >> >> thanks! >> >
Also if your data set is small your can reduce overhead and (parallelism) by lowering the number of mappers and reducers. -Dmapred.map.tasks=11 -Dmapred.reduce.tasks=3 Or maybe even go as low as: -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 I use this tactic on jobs with small data sets where the processing time is much less then the overhead of starting multiple mappers/ reducers and shuffling data.