But the Writables in each tuple include a vector which could be hundreds of doubles. It's not a big deal.
On Sun, Dec 18, 2011 at 9:29 PM, Raphael Cendrillon <cendrillon1...@gmail.com> wrote: > Yes, but tuplewritable is pretty inefficient since it stores the classname > with every record. This seems wasteful given that the class is always the > same. > > On 18 Dec, 2011, at 9:19 PM, Lance Norskog wrote: > >> JIRA is acting up, so posting here instead. >> >> You have already made RandomPermuteJob extend AbstractJob. Never mind. >> >> bq. Does this seem like a reasonable approach? It would require that a >> class be created for each object type of interest which is somewhat >> painfull. However I can't see a simpler approach since >> setMapOutputValueClass() needs to take a class that has a default >> constructor (and PairWritable doesn't have a default constructor since >> it doesn't know how to call new for first and second since it doesn't >> know what class first and second belong to). >> >> TupleWritable handles this by writing the classname. Looking at this >> again, can't this just use TupleWritable? >> >> http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-3/org/apache/hadoop/mapred/join/TupleWritable.java >> >> On Sun, Dec 18, 2011 at 7:48 PM, Raphael Cendrillon (Commented) (JIRA) >> <j...@apache.org> wrote: >>> >>> [ >>> https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172021#comment-13172021 >>> ] >>> >>> Raphael Cendrillon commented on MAHOUT-904: >>> ------------------------------------------- >>> >>> Hi Lance. Is that a general comment, or specifically for the issue >>> regarding PairWritable/IntVectorWritable? >>> >>>> SplitInput should support randomizing the input >>>> ----------------------------------------------- >>>> >>>> Key: MAHOUT-904 >>>> URL: https://issues.apache.org/jira/browse/MAHOUT-904 >>>> Project: Mahout >>>> Issue Type: Improvement >>>> Reporter: Grant Ingersoll >>>> Assignee: Raphael Cendrillon >>>> Labels: MAHOUT_INTRO_CONTRIBUTE >>>> Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch >>>> >>>> >>>> For some learning tasks, we need the input to be randomized (SGD) instead >>>> of blocks of labels all at once. SplitInput is a useful tool for setting >>>> up train/test files but it currently doesn't support randomizing the input. >>> >>> -- >>> This message is automatically generated by JIRA. >>> If you think it was sent incorrectly, please contact your JIRA >>> administrators: >>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >>> For more information on JIRA, see: http://www.atlassian.com/software/jira >>> >>> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com > -- Lance Norskog goks...@gmail.com