ok, thank you, Jeff. Good to know. I actually expected to rely on this for a wide range of issues (most common being task jvm parameters override).
On Wed, Dec 29, 2010 at 11:29 AM, Jeff Eastman <[email protected]> wrote: > I've found the problem: the MahoutDriver uses a Map to organize the command > line arguments and this reorders them so that the -D arguments may not be > first. This causes them to be treated as job-specific options, causing the > failures. I'm working on a fix. > > Jeff > > -----Original Message----- > From: Jeff Eastman [mailto:[email protected]] > Sent: Tuesday, December 28, 2010 5:19 PM > To: [email protected] > Subject: RE: where i can set -Dmapred.map.tasks=X > > That's where I'm beginning to look too. It seems the driver code is working > correctly (I thought I had tested that) but the CLI isn't. > > The original post was for -Dmapred.map.tasks but I noticed the reduce.tasks > didn't work either. > > -----Original Message----- > From: Dmitriy Lyubimov [mailto:[email protected]] > Sent: Tuesday, December 28, 2010 5:15 PM > To: [email protected] > Subject: Re: where i can set -Dmapred.map.tasks=X > > Oh, so you are trying to set number of reduce tasks. i missed that, > original > post was about # of map tasks. sorry. > > No, no idea why that error pops up in mahout command line. i would need to > dig into the mahout's cli code -- i don't thing i dug that deep there > before. > > On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman <[email protected]> wrote: > > > It's very odd: when I run k-means from Eclipse and add > > -Dmapred.reduce.tasks=10 as the first argument the driver loves it and > > job.getNumReduceTasks() is set correctly to 10. When I run the same > command > > line using bin/mahout; however, it fails: with "Unexpected > > -Dmapred.reduce.tasks=10 while processing Job-Specific Options. > > > > The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I > ... > > > > > > > > -----Original Message----- > > From: Dmitriy Lyubimov [mailto:[email protected]] > > Sent: Tuesday, December 28, 2010 4:55 PM > > To: [email protected] > > Subject: Re: where i can set -Dmapred.map.tasks=X > > > > PPS it doesn't tell you what InputFileFormat actually uses for it as a > > property, and i don't remember on top of my head either. but i assume you > > could use them with -D as well. > > > > On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > In particular, QJob is one of the drivers that uses that , in the > > following > > > way: > > > > > > f ( minSplitSize>0) > > > SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize); > > > > > > Interestng pecularity about that parameter is that in the current > hadoop > > > release for anything derived from InputFileFormat it ensures that all > > splits > > > are at least that big and the last split is at least times 1.1 that > big. > > I > > > am not quite sure why special treatment for the last split but that's > how > > it > > > goes there. > > > > > > -Dmitriy > > > > > > > > > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <[email protected] > > >wrote: > > > > > >> Jeff, > > >> > > >> it's mahout-376 patch i don't think it is committed. the driver class > > >> there is SSVDCli, for your convenience you can find it here : > > >> > > > https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd > > >> > > >> but like i said, i did not try to use it with -D option since i wanted > > to > > >> give an explicit option to increase split size if needed (and a help > for > > >> it). Another reason is that solver has a series of jobs and only those > > >> reading the source matrix have anything to do with the split size. > > >> > > >> > > >> -d > > >> > > >> > > >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <[email protected]> > > wrote: > > >> > > >>> What's the driver class? If the -D parameters are working for you I > > want > > >>> to compare to the clustering drovers > > >>> > > >>> -----Original Message----- > > >>> From: Dmitriy Lyubimov [mailto:[email protected]] > > >>> Sent: Tuesday, December 28, 2010 4:37 PM > > >>> To: [email protected] > > >>> Subject: Re: where i can set -Dmapred.map.tasks=X > > >>> > > >>> as far as i understand, this option is not forced. I suspect it > > actually > > >>> means 'minimum degree of parallelism'. so if you expect to use that > to > > >>> reduce number of mappers, i don't think this is expected to work so > > much. > > >>> The one that do enforce anything are min split size and max split > size > > in > > >>> file input so i guess you can try those. I rely on them (and open it > up > > >>> as a > > >>> job-specific option) in stochastic svd. > > >>> > > >>> but usually forcing split size to increase creates a 'superslits' > > >>> problem, > > >>> where a lot of data is moved around to just supply data to mappers. > > which > > >>> is > > >>> perhaps why this option is meant to increase parallelism only, but > > >>> probably > > >>> not to decrease it. > > >>> > > >>> -d > > >>> > > >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <[email protected]> > > >>> wrote: > > >>> > > >>> > This is supposed to be a generic option. You should be able to > > specify > > >>> > Hadoop options such as this on the command line invocation of your > > >>> favorite > > >>> > Mahout routine, but I'm having a similar problem setting > > >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with > > and > > >>> > without a space after the -D. > > >>> > > > >>> > Can someone point me to a Mahout command where this does work? Both > > >>> drivers > > >>> > extend AbstractJob and do the usual option processing pushups. I > > don't > > >>> have > > >>> > Hadoop source locally so I can't debug the generic options parsing. > > >>> > > > >>> > -----Original Message----- > > >>> > From: beneo_7 [mailto:[email protected]] > > >>> > Sent: Monday, December 27, 2010 10:45 PM > > >>> > To: [email protected] > > >>> > Subject: where i can set -Dmapred.map.tasks=X > > >>> > > > >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X > > >>> > but it did not work for hadoop > > >>> > > > >>> > > >> > > >> > > > > > >
