Hi Ted, Thanks for the response. I'll implement, open a ticket, and post a patch after I'm satisfied with the outcome.
Cheers, Tim On Mon, Aug 8, 2011 at 1:34 PM, Ted Dunning <[email protected]> wrote: > There is not such a thing now. It should be relatively easy to build. The > simplest method is to have each mapper produce a full-sized sample which is > sent to a single reducer which produces another sample. The output of the > mappers needs to have a count of items retained and items considered in > order for this to work correctly. > > This cuts down on the amount of data that the reducer has to handle but is > similar in many respects. > > On Mon, Aug 8, 2011 at 11:47 AM, Timothy Potter <[email protected] > >wrote: > > > Is there a distributed Mahout job to produce a random sample for a large > > collection of vectors stored in HDFS? For example, if I wanted only 2M > > vectors randomly selected from the ASF mail archive vectors (~6M total), > is > > there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not, can > > this be done in a distributed manner using multiple reducers or would I > > have > > to send all vectors to 1 reducer and then use RandomSampler in the single > > reducer? > > > > Cheers, > > Tim > > >
