Ryan,

  Hadoop splits based on the min size, as Matt mentioned, and the max split 
size, and also the dfs.block.size.  You can calculate the split size from that 
as max(minSplit,min(maxSplit,blockSize)).

I have found that for CPU intensive operations on smaller data sets, like I was 
doing with LDA, this didn't split enough, or at all.  I was able to force a 
minimum number of map tasks by calculating what the split size should be and 
reducing the max (not min) split size to that if necessary.

Ryan

On Apr 19, 2013, at 16:13, Matt Molek <mpmo...@gmail.com> wrote:

> Instead of manually splitting your files, you should be able pass
> -Dmapred.min.split.size=<split size in bytes> at the command line, or
> otherwise set the mapred.min.split.size property to get the number of
> mappers you want.
> 
> On Wed, Apr 17, 2013 at 7:55 PM, Ryan Compton <compton.r...@gmail.com> wrote:
>> Got it, thanks.
>> 
>> For some reason I had the impression that mahout wanted one
>> (splittable) file per label. I ran a test on the 20 news groups where
>> I split soc.religion.christian.txt into several files with arbitrary
>> names before training. Mahout's trainclassifier launched as many
>> mappers as files, trained on all 599 documents, and performed well in
>> testing. Unless I'm missing something subtle, my question is answered
>> now.
>> 
>> On Wed, Apr 17, 2013 at 2:58 PM, Robin Anil <robin.a...@gmail.com> wrote:
>>> You wont its tiny amount of data. Mapper are determined by the split size
>>> and input shards. Either shard the input more than 10 or reduce the map
>>> split size.
>>> 
>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> 
>>> 
>>> On Wed, Apr 17, 2013 at 3:32 PM, Ryan Compton <compton.r...@gmail.com>wrote:
>>> 
>>>> Any ideas where to look? Does anyone get more than 20 mappers when
>>>> running the 20 news groups data?
>>>> 
>>>> On Tue, Apr 16, 2013 at 9:04 PM, Robin Anil <robin.a...@gmail.com> wrote:
>>>>> Sounds like a config issue. the Mr version should be able to parallelize
>>>>> based on the size of the input.
>>>> 

Reply via email to