Re: distributed RandomSampler job?

Timothy Potter Mon, 08 Aug 2011 21:21:06 -0700

Hi Ted,

Thanks for the response. I'll implement, open a ticket, and post a patch
after I'm satisfied with the outcome.


Cheers,
Tim

On Mon, Aug 8, 2011 at 1:34 PM, Ted Dunning <[email protected]> wrote:

> There is not such a thing now.  It should be relatively easy to build.  The
> simplest method is to have each mapper produce a full-sized sample which is
> sent to a single reducer which produces another sample.  The output of the
> mappers needs to have a count of items retained and items considered in
> order for this to work correctly.
>
> This cuts down on the amount of data that the reducer has to handle but is
> similar in many respects.
>
> On Mon, Aug 8, 2011 at 11:47 AM, Timothy Potter <[email protected]
> >wrote:
>
> > Is there a distributed Mahout job to produce a random sample for a large
> > collection of vectors stored in HDFS? For example, if I wanted only 2M
> > vectors randomly selected from the ASF mail archive vectors (~6M total),
> is
> > there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not, can
> > this be done in a distributed manner using multiple reducers or would I
> > have
> > to send all vectors to 1 reducer and then use RandomSampler in the single
> > reducer?
> >
> > Cheers,
> > Tim
> >
>

Re: distributed RandomSampler job?

Reply via email to