[ 
https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168453#comment-13168453
 ] 

jirapos...@reviews.apache.org commented on MAHOUT-904:
------------------------------------------------------



bq.  On 2011-12-13 13:19:13, Grant Ingersoll wrote:
bq.  > Thoughts:
bq.  > this class is often run from the command line, so we should add CLI 
support for telling it to randomly permute.
bq.  > 
bq.  > I wonder if we should make this a map-reduce job.  Perhaps we split out 
the existing version and leave as is and then add a new MR one that can do the 
permutation.  One idea there would be to generate random keys (by appending 
onto the existing key) and letting the shuffle effectively do the permutations. 
 Then, during reduce phase we simply strip off the random part of the key and 
output.  I don't know how bad this would hurt the shuffle, but it seems like it 
would work functionally anyway.
bq.  > 
bq.  > Otherwise, the approach seems reasonable.  I don't know off hand if 
there is a better way of doing it (even though I wish there were).
bq.  
bq.  Ted Dunning wrote:
bq.      Separating the randomization sounds like a nice idea.  I still think 
that the SGD jobs need to be able to randomize within a single map as well.
bq.      
bq.      Permuting in the shuffle should work fine.

Lance had a similar suggestion. I think there are two tasks required here. One 
is to randomize the training examples within a split, and the other is to 
randomize the order of different splits. I'll update this to use map reduce to 
randomize the splits aswell. Lance had a good suggestion for this based on 
hashing/randomizing the key.

Given that we will be parallelizing this, I guess each split should fit 
comfortably into memory? If that's the case randomization of the lines within a 
split can be done much more efficiently.


- Raphael


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/#review3876
-----------------------------------------------------------


On 2011-12-09 08:57:18, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3092/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-09 08:57:18)
bq.  
bq.  
bq.  Review request for Grant Ingersoll.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Early support for randomizing input in SplitInput class. This is an early 
start but I've posted it up just to check if I'm on the right track.  A couple 
of comments:
bq.  
bq.    - currently the code runs through the entire file looking for the line 
corresponding to the random index. This has to be repeated for every line, 
which is slow and somewhat ugly.
bq.    - the permutation indices are stored in an array. This could lead to 
scaling issues if the number of input lines is large. This problem may also 
exist with ridx in the existing code. One option is to use a linear feedback 
shift register to generate a permutation sequence on the fly.
bq.  
bq.  Any suggestions would be very welcome!
bq.  
bq.  
bq.  This addresses bug MAHOUT-904.
bq.      https://issues.apache.org/jira/browse/MAHOUT-904
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 
1212249 
bq.    
/trunk/examples/src/test/java/org/apache/mahout/classifier/bayes/SplitBayesInputTest.java
 1212249 
bq.  
bq.  Diff: https://reviews.apache.org/r/3092/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> SplitInput should support randomizing the input
> -----------------------------------------------
>
>                 Key: MAHOUT-904
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-904
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>              Labels: MAHOUT_INTRO_CONTRIBUTE
>         Attachments: MAHOUT-904.patch
>
>
> For some learning tasks, we need the input to be randomized (SGD) instead of 
> blocks of labels all at once.  SplitInput is a useful tool for setting up 
> train/test files but it currently doesn't support randomizing the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to