Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-22 Thread Ryan Josal
Ryan, Hadoop splits based on the min size, as Matt mentioned, and the max split size, and also the dfs.block.size. You can calculate the split size from that as max(minSplit,min(maxSplit,blockSize)). I have found that for CPU intensive operations on smaller data sets, like I was doing with

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-19 Thread Matt Molek
Instead of manually splitting your files, you should be able pass -Dmapred.min.split.size= at the command line, or otherwise set the mapred.min.split.size property to get the number of mappers you want. On Wed, Apr 17, 2013 at 7:55 PM, Ryan Compton wrote: > Got it, thanks. > > For some reason I h

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-17 Thread Ryan Compton
Got it, thanks. For some reason I had the impression that mahout wanted one (splittable) file per label. I ran a test on the 20 news groups where I split soc.religion.christian.txt into several files with arbitrary names before training. Mahout's trainclassifier launched as many mappers as files,

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-17 Thread Robin Anil
You wont its tiny amount of data. Mapper are determined by the split size and input shards. Either shard the input more than 10 or reduce the map split size. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Wed, Apr 17, 2013 at 3:32 PM, Ryan Compton wrote: > Any ideas where to

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-17 Thread Ryan Compton
Any ideas where to look? Does anyone get more than 20 mappers when running the 20 news groups data? On Tue, Apr 16, 2013 at 9:04 PM, Robin Anil wrote: > Sounds like a config issue. the Mr version should be able to parallelize > based on the size of the input.

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-16 Thread Robin Anil
Sounds like a config issue. the Mr version should be able to parallelize based on the size of the input.