To get random sampling and sorting: Generate a hashcode from each of your "real" keys, then map on the hashcode instead. This gives a random sort. Make the reducer do a modulo at the beginning of the method and return without writing anything. Now, make the reducer a combiner also. Now, only your desired subset of samples goes across the wire. Each real reducer only gets one live sample, so just save it. You now have a randomly sorted and sampled output. Use a Partitioner or just one reducer, depending on size.
This is deterministic. To get a different random set each time, munge each hashcode with a random number. On Thu, Dec 8, 2011 at 10:24 AM, Raphael Cendrillon < [email protected]> wrote: > Hi Remi, > > I've started coding this up. One issue is how to generate the random > permutation. This could be done in memory however for large data sets this > is going to be an issue. > > Another possibility is to just generate random numbers and accept that > repetitions will sometimes occur. > > Third approach is to generate a random permutation on the fly, which is a > little more tricky. > > On Dec 8, 2011, at 9:49 AM, "Remi Melisson (Commented) (JIRA)" < > [email protected]> wrote: > > > > > [ > https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165353#comment-13165353] > > > > Remi Melisson commented on MAHOUT-904: > > -------------------------------------- > > > > Hi, > > I had a look on it too, and one question remains : > > Do we need to randomize all the set (training and test) or only the > training data ? > > > > @Raphael Let me know if you already started, because I planned to begin > dev soon. > > > >> SplitInput should support randomizing the input > >> ----------------------------------------------- > >> > >> Key: MAHOUT-904 > >> URL: https://issues.apache.org/jira/browse/MAHOUT-904 > >> Project: Mahout > >> Issue Type: Improvement > >> Reporter: Grant Ingersoll > >> Assignee: Grant Ingersoll > >> Labels: MAHOUT_INTRO_CONTRIBUTE > >> > >> For some learning tasks, we need the input to be randomized (SGD) > instead of blocks of labels all at once. SplitInput is a useful tool for > setting up train/test files but it currently doesn't support randomizing > the input. > > > > -- > > This message is automatically generated by JIRA. > > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > > For more information on JIRA, see: > http://www.atlassian.com/software/jira > > > > > -- Lance Norskog [email protected]
