Thanks Lance and Grant. My concern is that the current approach I've adopted is 
a bit of a hack, which is why I'm interested to hear your suggestions. 

I've thought about this a bit more, and there's a cleaner implementation for 
this (I'll post a new patch shortly). Basically I'm thinking we can leave the 
keys and values as they are and do the random permutation using a custom 
comparator that just generates a random number in the compare operation. 

The advantage of this is that since the key and value don't need to be packed 
together, no new value class is needed for the mapper output, avoiding the 
ObjectWritable issue.

The only painful thing at the moment is that having full control over the 
output directories for the test and training data requires 
MultipleOutputFormat, which explicitly tests to make sure that the new API is 
not being used, so a little bit of rewriting is required to use the old API. 

Thanks for your help!

On Dec 21, 2011, at 5:45 AM, Grant Ingersoll <gsing...@apache.org> wrote:

> I'd suggest we stick w/ what you have for now and we can generalize later if 
> needed.
> 
> On Dec 19, 2011, at 3:08 AM, Raphael Cendrillon wrote:
> 
>> That's a very good point.  Using this type of framework will make things 
>> much cleaner.
>> 
>> This comment (from the top of the TupleWritable file) is what makes me a 
>> little concerned:
>> 
>> This is *not* a general-purpose tuple type. In almost all cases, users are 
>> encouraged to implement their own serializable types, which can perform 
>> better validation and provide more efficient encodings than this class is 
>> capable. TupleWritable relies on the join framework for type safety and 
>> assumes its instances will rarely be persisted, assumptions not only 
>> incompatible with, but contrary to the general case.
>> 
>> If we don't mind storing the class name, would it be better to use 
>> ObjectWritable for the vector, or whatever else happens to be there?
>> 
>> 
>> On 18 Dec, 2011, at 11:26 PM, Lance Norskog wrote:
>> 
>>> But the Writables in each tuple include a vector which could be
>>> hundreds of doubles. It's not a big deal.
>>> 
>>> On Sun, Dec 18, 2011 at 9:29 PM, Raphael Cendrillon
>>> <cendrillon1...@gmail.com> wrote:
>>>> Yes, but tuplewritable is pretty inefficient since it stores the classname 
>>>> with every record.  This seems wasteful given that the class is always the 
>>>> same.
>>>> 
>>>> On 18 Dec, 2011, at 9:19 PM, Lance Norskog wrote:
>>>> 
>>>>> JIRA is acting up, so posting here instead.
>>>>> 
>>>>> You have already made RandomPermuteJob extend AbstractJob. Never mind.
>>>>> 
>>>>> bq. Does this seem like a reasonable approach? It would require that a
>>>>> class be created for each object type of interest which is somewhat
>>>>> painfull. However I can't see a simpler approach since
>>>>> setMapOutputValueClass() needs to take a class that has a default
>>>>> constructor (and PairWritable doesn't have a default constructor since
>>>>> it doesn't know how to call new for first and second since it doesn't
>>>>> know what class first and second belong to).
>>>>> 
>>>>> TupleWritable handles this by writing the classname. Looking at this
>>>>> again, can't this just use TupleWritable?
>>>>> 
>>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-3/org/apache/hadoop/mapred/join/TupleWritable.java
>>>>> 
>>>>> On Sun, Dec 18, 2011 at 7:48 PM, Raphael Cendrillon (Commented) (JIRA)
>>>>> <j...@apache.org> wrote:
>>>>>> 
>>>>>>  [ 
>>>>>> https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172021#comment-13172021
>>>>>>  ]
>>>>>> 
>>>>>> Raphael Cendrillon commented on MAHOUT-904:
>>>>>> -------------------------------------------
>>>>>> 
>>>>>> Hi Lance. Is that a general comment, or specifically for the issue 
>>>>>> regarding PairWritable/IntVectorWritable?
>>>>>> 
>>>>>>> SplitInput should support randomizing the input
>>>>>>> -----------------------------------------------
>>>>>>> 
>>>>>>>               Key: MAHOUT-904
>>>>>>>               URL: https://issues.apache.org/jira/browse/MAHOUT-904
>>>>>>>           Project: Mahout
>>>>>>>        Issue Type: Improvement
>>>>>>>          Reporter: Grant Ingersoll
>>>>>>>          Assignee: Raphael Cendrillon
>>>>>>>            Labels: MAHOUT_INTRO_CONTRIBUTE
>>>>>>>       Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch
>>>>>>> 
>>>>>>> 
>>>>>>> For some learning tasks, we need the input to be randomized (SGD) 
>>>>>>> instead of blocks of labels all at once.  SplitInput is a useful tool 
>>>>>>> for setting up train/test files but it currently doesn't support 
>>>>>>> randomizing the input.
>>>>>> 
>>>>>> --
>>>>>> This message is automatically generated by JIRA.
>>>>>> If you think it was sent incorrectly, please contact your JIRA 
>>>>>> administrators: 
>>>>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Lance Norskog
>>>>> goks...@gmail.com
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Lance Norskog
>>> goks...@gmail.com
>> 
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 
> 

Reply via email to