I'd suggest we stick w/ what you have for now and we can generalize later if needed.
On Dec 19, 2011, at 3:08 AM, Raphael Cendrillon wrote: > That's a very good point. Using this type of framework will make things much > cleaner. > > This comment (from the top of the TupleWritable file) is what makes me a > little concerned: > > This is *not* a general-purpose tuple type. In almost all cases, users are > encouraged to implement their own serializable types, which can perform > better validation and provide more efficient encodings than this class is > capable. TupleWritable relies on the join framework for type safety and > assumes its instances will rarely be persisted, assumptions not only > incompatible with, but contrary to the general case. > > If we don't mind storing the class name, would it be better to use > ObjectWritable for the vector, or whatever else happens to be there? > > > On 18 Dec, 2011, at 11:26 PM, Lance Norskog wrote: > >> But the Writables in each tuple include a vector which could be >> hundreds of doubles. It's not a big deal. >> >> On Sun, Dec 18, 2011 at 9:29 PM, Raphael Cendrillon >> <cendrillon1...@gmail.com> wrote: >>> Yes, but tuplewritable is pretty inefficient since it stores the classname >>> with every record. This seems wasteful given that the class is always the >>> same. >>> >>> On 18 Dec, 2011, at 9:19 PM, Lance Norskog wrote: >>> >>>> JIRA is acting up, so posting here instead. >>>> >>>> You have already made RandomPermuteJob extend AbstractJob. Never mind. >>>> >>>> bq. Does this seem like a reasonable approach? It would require that a >>>> class be created for each object type of interest which is somewhat >>>> painfull. However I can't see a simpler approach since >>>> setMapOutputValueClass() needs to take a class that has a default >>>> constructor (and PairWritable doesn't have a default constructor since >>>> it doesn't know how to call new for first and second since it doesn't >>>> know what class first and second belong to). >>>> >>>> TupleWritable handles this by writing the classname. Looking at this >>>> again, can't this just use TupleWritable? >>>> >>>> http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-3/org/apache/hadoop/mapred/join/TupleWritable.java >>>> >>>> On Sun, Dec 18, 2011 at 7:48 PM, Raphael Cendrillon (Commented) (JIRA) >>>> <j...@apache.org> wrote: >>>>> >>>>> [ >>>>> https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172021#comment-13172021 >>>>> ] >>>>> >>>>> Raphael Cendrillon commented on MAHOUT-904: >>>>> ------------------------------------------- >>>>> >>>>> Hi Lance. Is that a general comment, or specifically for the issue >>>>> regarding PairWritable/IntVectorWritable? >>>>> >>>>>> SplitInput should support randomizing the input >>>>>> ----------------------------------------------- >>>>>> >>>>>> Key: MAHOUT-904 >>>>>> URL: https://issues.apache.org/jira/browse/MAHOUT-904 >>>>>> Project: Mahout >>>>>> Issue Type: Improvement >>>>>> Reporter: Grant Ingersoll >>>>>> Assignee: Raphael Cendrillon >>>>>> Labels: MAHOUT_INTRO_CONTRIBUTE >>>>>> Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch >>>>>> >>>>>> >>>>>> For some learning tasks, we need the input to be randomized (SGD) >>>>>> instead of blocks of labels all at once. SplitInput is a useful tool >>>>>> for setting up train/test files but it currently doesn't support >>>>>> randomizing the input. >>>>> >>>>> -- >>>>> This message is automatically generated by JIRA. >>>>> If you think it was sent incorrectly, please contact your JIRA >>>>> administrators: >>>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >>>>> For more information on JIRA, see: http://www.atlassian.com/software/jira >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Lance Norskog >>>> goks...@gmail.com >>> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com