Hi Remi,

I've started coding this up. One issue is how to generate the random 
permutation. This could be done in memory however for large data sets this is 
going to be an issue. 

Another possibility is to just generate random numbers and accept that 
repetitions will sometimes occur. 

Third approach is to generate a random permutation on the fly, which is a 
little more tricky.

On Dec 8, 2011, at 9:49 AM, "Remi Melisson (Commented) (JIRA)" 
<[email protected]> wrote:

> 
>    [ 
> https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165353#comment-13165353
>  ] 
> 
> Remi Melisson commented on MAHOUT-904:
> --------------------------------------
> 
> Hi,
> I had a look on it too, and one question remains :
> Do we need to randomize all the set (training and test) or only the training 
> data ?
> 
> @Raphael Let me know if you already started, because I planned to begin dev 
> soon.
> 
>> SplitInput should support randomizing the input
>> -----------------------------------------------
>> 
>>                Key: MAHOUT-904
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-904
>>            Project: Mahout
>>         Issue Type: Improvement
>>           Reporter: Grant Ingersoll
>>           Assignee: Grant Ingersoll
>>             Labels: MAHOUT_INTRO_CONTRIBUTE
>> 
>> For some learning tasks, we need the input to be randomized (SGD) instead of 
>> blocks of labels all at once.  SplitInput is a useful tool for setting up 
>> train/test files but it currently doesn't support randomizing the input.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA 
> administrators: 
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
> 

Reply via email to