[
https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165948#comment-13165948
]
Raphael Cendrillon commented on MAHOUT-904:
-------------------------------------------
This is an early start but I've posted it up just to check if I'm on the right
track.
A couple of comments:
- currently the code runs through the entire file looking for the line
corresponding to the random index. This has to be repeated for every line,
which is slow and somewhat ugly.
- the permutation indices are stored in an array. This could lead to scaling
issues if the number of input lines is large. This problem may also exist with
ridx in the existing code. One option is to use a linear feedback shift
register to generate a permutation sequence on the fly.
Any suggestions would be very welcome!
> SplitInput should support randomizing the input
> -----------------------------------------------
>
> Key: MAHOUT-904
> URL: https://issues.apache.org/jira/browse/MAHOUT-904
> Project: Mahout
> Issue Type: Improvement
> Reporter: Grant Ingersoll
> Assignee: Grant Ingersoll
> Labels: MAHOUT_INTRO_CONTRIBUTE
> Attachments: MAHOUT-904.patch
>
>
> For some learning tasks, we need the input to be randomized (SGD) instead of
> blocks of labels all at once. SplitInput is a useful tool for setting up
> train/test files but it currently doesn't support randomizing the input.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira