On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi <
harish.mallipe...@gmail.com> wrote:

> On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun <roman.w...@gmail.com>
> wrote:
>
> >
> > Hello Harish,
> >
> > I know that TaskTracker creates separate threads (up to
> > mapred.tasktracker.map.tasks.maximum) which execute the map() function.
> > However, I haven't found the piece of code which associate FileSplit with
> > the given map thread. Is it downloaded locally in the TaskTracker
> function
> > or in MapTask?
> >
> >
> >
> Yes this is done by the MapTask.


Thanks, I will have a better look into it.

>
>
> >
> > I know I can increase the input file size by changing
> > 'mapred.min.split.size' , however, the file is split sequentially and
> very
> > rarely two consecutive HDFS blocks are stored on a single node. This
> means
> > that the data locality will not be exploited cause every map() will have
> to
> > download part of the file from network.
> >
> > Roman Kolcun
> >
>
> I see what you mean - you want to modify the hadoop code to allocate
> multiple (non-sequential) data-local blocks to one MapTask.


That's exactly what I want to do.


> I don't know if you'll achieve much by doing all that work.


Basically I would like to emulate larger DFS blocksize. I've performed 2
word count benchmarks on a cluster of 10 machines with 100GB file. With 64MB
blocksize it took 2035 seconds, when I've increased it to 256MB it took 1694
seconds - which is 16.76% increase.


> Hadoop lets you reuse the
> launched JVMs for multiple MapTasks. That should minimize the overhead of
> launching MapTasks.
> Increasing the DFS blocksize for the input files is another means to
> achieve
> the same effect.
>
> Do you think that this could be eliminated by reusing JVMs?
I am doing it as a project for my university degree so I really hope it will
lower the processing time significantly. I would like to make it general for
different block sizes.

Thank you for your help.

Roman Kolcun

Reply via email to