In addition to Jason's suggestion, you could also see about setting some of Hadoop's directories to subdirs of /dev/shm. If the dataset is really small, it should be easy to re-load it onto the cluster if it's lost, so even putting dfs.data.dir in /dev/shm might be worth trying. You'll probably also want mapred.local.dir in /dev/shm
Note that if in fact you don't have enough RAM to do this, you'll start swapping and your performance will suck like crazy :) That said, you may find that even with all storage in RAM your jobs are still too slow. Hadoop isn't optimized for this kind of small-job performance quite yet. You may find that task setup time dominates the job. I think it's entirely reasonable to shoot for sub-60-second jobs down the road, and I'd find it interesting to hear what the results are now. Hope you report back! -Todd On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer <mattbowy...@googlemail.com>wrote: > Hi, > > I am trying to do 'on demand map reduce' - something which will return in > reasonable time (a few seconds). > > My dataset is relatively small and can fit into my datanode's memory. Is it > possible to keep a block in the datanode's memory so on the next job the > response will be much quicker? The majority of the time spent during the > job > run appears to be during the 'HDFS_BYTES_READ' part of the job. I have > tried > using the setNumTasksToExecutePerJvm but the block still seems to be > cleared > from memory after the job. > > thanks! >