I've found the problem: the MahoutDriver uses a Map to organize the command line arguments and this reorders them so that the -D arguments may not be first. This causes them to be treated as job-specific options, causing the failures. I'm working on a fix.
Jeff -----Original Message----- From: Jeff Eastman [mailto:jeast...@narus.com] Sent: Tuesday, December 28, 2010 5:19 PM To: d...@mahout.apache.org Subject: RE: where i can set -Dmapred.map.tasks=X That's where I'm beginning to look too. It seems the driver code is working correctly (I thought I had tested that) but the CLI isn't. The original post was for -Dmapred.map.tasks but I noticed the reduce.tasks didn't work either. -----Original Message----- From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] Sent: Tuesday, December 28, 2010 5:15 PM To: d...@mahout.apache.org Subject: Re: where i can set -Dmapred.map.tasks=X Oh, so you are trying to set number of reduce tasks. i missed that, original post was about # of map tasks. sorry. No, no idea why that error pops up in mahout command line. i would need to dig into the mahout's cli code -- i don't thing i dug that deep there before. On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman <jeast...@narus.com> wrote: > It's very odd: when I run k-means from Eclipse and add > -Dmapred.reduce.tasks=10 as the first argument the driver loves it and > job.getNumReduceTasks() is set correctly to 10. When I run the same command > line using bin/mahout; however, it fails: with "Unexpected > -Dmapred.reduce.tasks=10 while processing Job-Specific Options. > > The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I ... > > > > -----Original Message----- > From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] > Sent: Tuesday, December 28, 2010 4:55 PM > To: d...@mahout.apache.org > Subject: Re: where i can set -Dmapred.map.tasks=X > > PPS it doesn't tell you what InputFileFormat actually uses for it as a > property, and i don't remember on top of my head either. but i assume you > could use them with -D as well. > > On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > In particular, QJob is one of the drivers that uses that , in the > following > > way: > > > > f ( minSplitSize>0) > > SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize); > > > > Interestng pecularity about that parameter is that in the current hadoop > > release for anything derived from InputFileFormat it ensures that all > splits > > are at least that big and the last split is at least times 1.1 that big. > I > > am not quite sure why special treatment for the last split but that's how > it > > goes there. > > > > -Dmitriy > > > > > > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dlie...@gmail.com > >wrote: > > > >> Jeff, > >> > >> it's mahout-376 patch i don't think it is committed. the driver class > >> there is SSVDCli, for your convenience you can find it here : > >> > https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd > >> > >> but like i said, i did not try to use it with -D option since i wanted > to > >> give an explicit option to increase split size if needed (and a help for > >> it). Another reason is that solver has a series of jobs and only those > >> reading the source matrix have anything to do with the split size. > >> > >> > >> -d > >> > >> > >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <jeast...@narus.com> > wrote: > >> > >>> What's the driver class? If the -D parameters are working for you I > want > >>> to compare to the clustering drovers > >>> > >>> -----Original Message----- > >>> From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] > >>> Sent: Tuesday, December 28, 2010 4:37 PM > >>> To: d...@mahout.apache.org > >>> Subject: Re: where i can set -Dmapred.map.tasks=X > >>> > >>> as far as i understand, this option is not forced. I suspect it > actually > >>> means 'minimum degree of parallelism'. so if you expect to use that to > >>> reduce number of mappers, i don't think this is expected to work so > much. > >>> The one that do enforce anything are min split size and max split size > in > >>> file input so i guess you can try those. I rely on them (and open it up > >>> as a > >>> job-specific option) in stochastic svd. > >>> > >>> but usually forcing split size to increase creates a 'superslits' > >>> problem, > >>> where a lot of data is moved around to just supply data to mappers. > which > >>> is > >>> perhaps why this option is meant to increase parallelism only, but > >>> probably > >>> not to decrease it. > >>> > >>> -d > >>> > >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <jeast...@narus.com> > >>> wrote: > >>> > >>> > This is supposed to be a generic option. You should be able to > specify > >>> > Hadoop options such as this on the command line invocation of your > >>> favorite > >>> > Mahout routine, but I'm having a similar problem setting > >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with > and > >>> > without a space after the -D. > >>> > > >>> > Can someone point me to a Mahout command where this does work? Both > >>> drivers > >>> > extend AbstractJob and do the usual option processing pushups. I > don't > >>> have > >>> > Hadoop source locally so I can't debug the generic options parsing. > >>> > > >>> > -----Original Message----- > >>> > From: beneo_7 [mailto:bene...@163.com] > >>> > Sent: Monday, December 27, 2010 10:45 PM > >>> > To: d...@mahout.apache.org > >>> > Subject: where i can set -Dmapred.map.tasks=X > >>> > > >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X > >>> > but it did not work for hadoop > >>> > > >>> > >> > >> > > >