On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi < harish.mallipe...@gmail.com> wrote:
> On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun <roman.w...@gmail.com> > wrote: > > > > > Hello Harish, > > > > I know that TaskTracker creates separate threads (up to > > mapred.tasktracker.map.tasks.maximum) which execute the map() function. > > However, I haven't found the piece of code which associate FileSplit with > > the given map thread. Is it downloaded locally in the TaskTracker > function > > or in MapTask? > > > > > > > Yes this is done by the MapTask. Thanks, I will have a better look into it. > > > > > > I know I can increase the input file size by changing > > 'mapred.min.split.size' , however, the file is split sequentially and > very > > rarely two consecutive HDFS blocks are stored on a single node. This > means > > that the data locality will not be exploited cause every map() will have > to > > download part of the file from network. > > > > Roman Kolcun > > > > I see what you mean - you want to modify the hadoop code to allocate > multiple (non-sequential) data-local blocks to one MapTask. That's exactly what I want to do. > I don't know if you'll achieve much by doing all that work. Basically I would like to emulate larger DFS blocksize. I've performed 2 word count benchmarks on a cluster of 10 machines with 100GB file. With 64MB blocksize it took 2035 seconds, when I've increased it to 256MB it took 1694 seconds - which is 16.76% increase. > Hadoop lets you reuse the > launched JVMs for multiple MapTasks. That should minimize the overhead of > launching MapTasks. > Increasing the DFS blocksize for the input files is another means to > achieve > the same effect. > > Do you think that this could be eliminated by reusing JVMs? I am doing it as a project for my university degree so I really hope it will lower the processing time significantly. I would like to make it general for different block sizes. Thank you for your help. Roman Kolcun