Hi Rob I'd try to answer this. From my understanding if you are using Multithreaded mapper on word count example with TextInputFormat and imagine you have 2 threads and 2 lines in your input split . RecordReader would read Line 1 and give it to map thread 1 and line 2 to map thread 2. So kind of identical process as defined would be happening with these two lines in parallel. This would be the default behavior. Regards Bejoy K S
From handheld, Please excuse typos. -----Original Message----- From: Rob Stewart <robstewar...@gmail.com> Date: Fri, 10 Feb 2012 18:39:44 To: <common-user@hadoop.apache.org> Reply-To: common-user@hadoop.apache.org Subject: Re: Combining MultithreadedMapper threadpool size & map.tasks.maximum Thanks, this is a lot clearer. One final question... On 10 February 2012 14:20, Harsh J <ha...@cloudera.com> wrote: > Hello again, > > On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart <robstewar...@gmail.com> wrote: >> OK, take word count. The <k,v> to the map is <null,"foo bar lambda >> beta">. The canonical Hadoop program would tokenize this line of text >> and output <"foo",1> and so on. How would the multithreadedmapper know >> how to further divide this line of text into, say: [<null,"foo >> bar">,<null,"lambda beta">] for 2 threads to run in parallel? Can you >> somehow provide an additional record reader to split the input to the >> map task into sub-inputs for each thread? > > In MultithreadedMapper, the IO work is still single threaded, while > the map() calling post-read is multithreaded. But yes you could use a > mix of CombineFileInputFormat and some custom logic to have multiple > local splits per map task, and divide readers of them among your > threads. But why do all this when thats what slots at the TT are for? I'm still unsure how the multi-threaded mapper knows how to split the input value into chunks, one chunk for each thread. There is only one example in the Hadoop 0.23 trunk that offers an example: hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java And in that source code, there is no custom logic for local splits per map task at all. Again, going back to the word count example. Given a line of text as input to a map, which comprises of 6 words. I specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words analysed by one thread, and the 3 to the other. Is what what would happen? i.e. - I'm unsure whether the multithreadedmapper class does the splitting of inputs to map tasks... Regards,