Thanks Sean. Currently I'm thinking of reading out the current key class from the SequenceFile and just propagating it through. Do you think that's reasonable?
On Dec 23, 2011, at 4:52 AM, "Sean Owen (Commented) (JIRA)" <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175408#comment-13175408 > ] > > Sean Owen commented on MAHOUT-904: > ---------------------------------- > > (I don't know if this is a relevant comment, but we ought to be using > VarIntWritable and VarLongWritable, not IntWritable and LongWritable, for > better space savings.) > >> SplitInput should support randomizing the input >> ----------------------------------------------- >> >> Key: MAHOUT-904 >> URL: https://issues.apache.org/jira/browse/MAHOUT-904 >> Project: Mahout >> Issue Type: Improvement >> Reporter: Grant Ingersoll >> Assignee: Raphael Cendrillon >> Labels: MAHOUT_INTRO_CONTRIBUTE >> Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch, >> MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch >> >> >> For some learning tasks, we need the input to be randomized (SGD) instead of >> blocks of labels all at once. SplitInput is a useful tool for setting up >> train/test files but it currently doesn't support randomizing the input. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > >