[ https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168453#comment-13168453 ]
jirapos...@reviews.apache.org commented on MAHOUT-904: ------------------------------------------------------ bq. On 2011-12-13 13:19:13, Grant Ingersoll wrote: bq. > Thoughts: bq. > this class is often run from the command line, so we should add CLI support for telling it to randomly permute. bq. > bq. > I wonder if we should make this a map-reduce job. Perhaps we split out the existing version and leave as is and then add a new MR one that can do the permutation. One idea there would be to generate random keys (by appending onto the existing key) and letting the shuffle effectively do the permutations. Then, during reduce phase we simply strip off the random part of the key and output. I don't know how bad this would hurt the shuffle, but it seems like it would work functionally anyway. bq. > bq. > Otherwise, the approach seems reasonable. I don't know off hand if there is a better way of doing it (even though I wish there were). bq. bq. Ted Dunning wrote: bq. Separating the randomization sounds like a nice idea. I still think that the SGD jobs need to be able to randomize within a single map as well. bq. bq. Permuting in the shuffle should work fine. Lance had a similar suggestion. I think there are two tasks required here. One is to randomize the training examples within a split, and the other is to randomize the order of different splits. I'll update this to use map reduce to randomize the splits aswell. Lance had a good suggestion for this based on hashing/randomizing the key. Given that we will be parallelizing this, I guess each split should fit comfortably into memory? If that's the case randomization of the lines within a split can be done much more efficiently. - Raphael ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3092/#review3876 ----------------------------------------------------------- On 2011-12-09 08:57:18, Raphael Cendrillon wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3092/ bq. ----------------------------------------------------------- bq. bq. (Updated 2011-12-09 08:57:18) bq. bq. bq. Review request for Grant Ingersoll. bq. bq. bq. Summary bq. ------- bq. bq. Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track. A couple of comments: bq. bq. - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly. bq. - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly. bq. bq. Any suggestions would be very welcome! bq. bq. bq. This addresses bug MAHOUT-904. bq. https://issues.apache.org/jira/browse/MAHOUT-904 bq. bq. bq. Diffs bq. ----- bq. bq. /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1212249 bq. /trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java 1212249 bq. bq. Diff: https://reviews.apache.org/r/3092/diff bq. bq. bq. Testing bq. ------- bq. bq. bq. Thanks, bq. bq. Raphael bq. bq. > SplitInput should support randomizing the input > ----------------------------------------------- > > Key: MAHOUT-904 > URL: https://issues.apache.org/jira/browse/MAHOUT-904 > Project: Mahout > Issue Type: Improvement > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Labels: MAHOUT_INTRO_CONTRIBUTE > Attachments: MAHOUT-904.patch > > > For some learning tasks, we need the input to be randomized (SGD) instead of > blocks of labels all at once. SplitInput is a useful tool for setting up > train/test files but it currently doesn't support randomizing the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira