[ 
https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174845#comment-13174845
 ] 

Raphael Cendrillon commented on MAHOUT-904:
-------------------------------------------

Thanks Grant. I was wondering the same thing, for example supporting 
randomSelectionSize in addition to randomSelectionPct. However supporting size 
based splits may not be quite so straightforward since the size is generally 
unknown if the SequenceFile is large, plus its split across mappers.

I also would have liked to have the training and test outputs go to different 
directories (instead of just using different filename prefixes), but this is 
not quite so straightforward due to issues with the new API (unless I just 
write to the SequenceFile by hand in the reducer which raises its own issues).  
I think this can be made a little neater once we move to Hadoop 0.21.

Is there something else that you had in mind?




                
> SplitInput should support randomizing the input
> -----------------------------------------------
>
>                 Key: MAHOUT-904
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-904
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Raphael Cendrillon
>              Labels: MAHOUT_INTRO_CONTRIBUTE
>         Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch, 
> MAHOUT-904.patch, MAHOUT-904.patch
>
>
> For some learning tasks, we need the input to be randomized (SGD) instead of 
> blocks of labels all at once.  SplitInput is a useful tool for setting up 
> train/test files but it currently doesn't support randomizing the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to