To get random sampling and sorting:

Generate a hashcode from each of your "real" keys, then map on the hashcode
instead. This gives a random sort. Make the reducer do a modulo at the
beginning of the method and return without writing anything. Now, make the
reducer a combiner also. Now, only your desired subset of samples goes
across the wire. Each real reducer only gets one live sample, so just save
it. You now have a randomly sorted and sampled output. Use a Partitioner or
just one reducer, depending on size.

This is deterministic. To get a different random set each time, munge each
hashcode with a random number.

On Thu, Dec 8, 2011 at 10:24 AM, Raphael Cendrillon <
[email protected]> wrote:

> Hi Remi,
>
> I've started coding this up. One issue is how to generate the random
> permutation. This could be done in memory however for large data sets this
> is going to be an issue.
>
> Another possibility is to just generate random numbers and accept that
> repetitions will sometimes occur.
>
> Third approach is to generate a random permutation on the fly, which is a
> little more tricky.
>
> On Dec 8, 2011, at 9:49 AM, "Remi Melisson (Commented) (JIRA)" <
> [email protected]> wrote:
>
> >
> >    [
> https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165353#comment-13165353]
> >
> > Remi Melisson commented on MAHOUT-904:
> > --------------------------------------
> >
> > Hi,
> > I had a look on it too, and one question remains :
> > Do we need to randomize all the set (training and test) or only the
> training data ?
> >
> > @Raphael Let me know if you already started, because I planned to begin
> dev soon.
> >
> >> SplitInput should support randomizing the input
> >> -----------------------------------------------
> >>
> >>                Key: MAHOUT-904
> >>                URL: https://issues.apache.org/jira/browse/MAHOUT-904
> >>            Project: Mahout
> >>         Issue Type: Improvement
> >>           Reporter: Grant Ingersoll
> >>           Assignee: Grant Ingersoll
> >>             Labels: MAHOUT_INTRO_CONTRIBUTE
> >>
> >> For some learning tasks, we need the input to be randomized (SGD)
> instead of blocks of labels all at once.  SplitInput is a useful tool for
> setting up train/test files but it currently doesn't support randomizing
> the input.
> >
> > --
> > This message is automatically generated by JIRA.
> > If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >
> >
>



-- 
Lance Norskog
[email protected]

Reply via email to