On Mon, May 11, 2009 at 12:08 PM, Todd Lipcon <t...@cloudera.com> wrote:
> In addition to Jason's suggestion, you could also see about setting some of
> Hadoop's directories to subdirs of /dev/shm. If the dataset is really small,
> it should be easy to re-load it onto the cluster if it's lost, so even
> putting dfs.data.dir in /dev/shm might be worth trying.
> You'll probably also want mapred.local.dir in /dev/shm
>
> Note that if in fact you don't have enough RAM to do this, you'll start
> swapping and your performance will suck like crazy :)
>
> That said, you may find that even with all storage in RAM your jobs are
> still too slow. Hadoop isn't optimized for this kind of small-job
> performance quite yet. You may find that task setup time dominates the job.
> I think it's entirely reasonable to shoot for sub-60-second jobs down the
> road, and I'd find it interesting to hear what the results are now. Hope you
> report back!
>
> -Todd
>
> On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer 
> <mattbowy...@googlemail.com>wrote:
>
>> Hi,
>>
>> I am trying to do 'on demand map reduce' - something which will return in
>> reasonable time (a few seconds).
>>
>> My dataset is relatively small and can fit into my datanode's memory. Is it
>> possible to keep a block in the datanode's memory so on the next job the
>> response will be much quicker? The majority of the time spent during the
>> job
>> run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
>> tried
>> using the setNumTasksToExecutePerJvm but the block still seems to be
>> cleared
>> from memory after the job.
>>
>> thanks!
>>
>

Also if your data set is small your can reduce overhead and
(parallelism) by lowering the number of mappers and reducers.

-Dmapred.map.tasks=11
-Dmapred.reduce.tasks=3

Or maybe even go as low as:

-Dmapred.map.tasks=1
-Dmapred.reduce.tasks=1

I use this tactic on jobs with small data sets where the processing
time is much less then the overhead of starting multiple mappers/
reducers and shuffling data.

Reply via email to