In particular, QJob is one of the drivers that uses that , in the following
way:

f ( minSplitSize>0)
SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);

Interestng pecularity about that parameter is that in the current hadoop
release for anything derived from InputFileFormat it ensures that all splits
are at least that big and the last split is at least times 1.1  that big. I
am not quite sure why special treatment for the last split but that's how it
goes there.

-Dmitriy


On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <[email protected]> wrote:

> Jeff,
>
> it's mahout-376 patch i don't think it is committed. the driver class there
> is SSVDCli, for your convenience you can find it here :
> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
>
> but like i said, i did not try to use it with -D option since i wanted to
> give an explicit option to increase split size if needed (and a help for
> it). Another reason is that solver has a series of jobs and only those
> reading the source matrix have anything to do with the split size.
>
>
> -d
>
>
> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <[email protected]> wrote:
>
>> What's the driver class? If the -D parameters are working for you I want
>> to compare to the clustering drovers
>>
>> -----Original Message-----
>> From: Dmitriy Lyubimov [mailto:[email protected]]
>> Sent: Tuesday, December 28, 2010 4:37 PM
>> To: [email protected]
>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>
>> as far as i understand, this option is not forced. I suspect it actually
>> means 'minimum degree of parallelism'. so if you expect to use that to
>> reduce number of mappers, i don't think this is expected to work so much.
>> The one that do enforce anything are min split size and max split size in
>> file input so i guess you can try those. I rely on them (and open it up as
>> a
>> job-specific option) in stochastic svd.
>>
>> but usually forcing split size to increase creates a 'superslits' problem,
>> where a lot of data is moved around to just supply data to mappers. which
>> is
>> perhaps why this option is meant to increase parallelism only, but
>> probably
>> not to decrease it.
>>
>> -d
>>
>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <[email protected]> wrote:
>>
>> > This is supposed to be a generic option. You should be able to specify
>> > Hadoop options such as this on the command line invocation of your
>> favorite
>> > Mahout routine, but I'm having a similar problem setting
>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with and
>> > without a space after the -D.
>> >
>> > Can someone point me to a Mahout command where this does work? Both
>> drivers
>> > extend AbstractJob and do the usual option processing pushups. I don't
>> have
>> > Hadoop source locally so I can't debug the generic options parsing.
>> >
>> > -----Original Message-----
>> > From: beneo_7 [mailto:[email protected]]
>> > Sent: Monday, December 27, 2010 10:45 PM
>> > To: [email protected]
>> > Subject: where i can set -Dmapred.map.tasks=X
>> >
>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
>> > but it did not work for hadoop
>> >
>>
>
>

Reply via email to