[ https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174845#comment-13174845 ]
Raphael Cendrillon commented on MAHOUT-904: ------------------------------------------- Thanks Grant. I was wondering the same thing, for example supporting randomSelectionSize in addition to randomSelectionPct. However supporting size based splits may not be quite so straightforward since the size is generally unknown if the SequenceFile is large, plus its split across mappers. I also would have liked to have the training and test outputs go to different directories (instead of just using different filename prefixes), but this is not quite so straightforward due to issues with the new API (unless I just write to the SequenceFile by hand in the reducer which raises its own issues). I think this can be made a little neater once we move to Hadoop 0.21. Is there something else that you had in mind? > SplitInput should support randomizing the input > ----------------------------------------------- > > Key: MAHOUT-904 > URL: https://issues.apache.org/jira/browse/MAHOUT-904 > Project: Mahout > Issue Type: Improvement > Reporter: Grant Ingersoll > Assignee: Raphael Cendrillon > Labels: MAHOUT_INTRO_CONTRIBUTE > Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch, > MAHOUT-904.patch, MAHOUT-904.patch > > > For some learning tasks, we need the input to be randomized (SGD) instead of > blocks of labels all at once. SplitInput is a useful tool for setting up > train/test files but it currently doesn't support randomizing the input. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira