Re: [jira] [Commented] (MAHOUT-904) SplitInput should support randomizing the input

Grant Ingersoll Wed, 21 Dec 2011 05:46:00 -0800

I'd suggest we stick w/ what you have for now and we can generalize later if 
needed.


On Dec 19, 2011, at 3:08 AM, Raphael Cendrillon wrote:

> That's a very good point.  Using this type of framework will make things much 
> cleaner.
> 
> This comment (from the top of the TupleWritable file) is what makes me a 
> little concerned:
> 
> This is *not* a general-purpose tuple type. In almost all cases, users are 
> encouraged to implement their own serializable types, which can perform 
> better validation and provide more efficient encodings than this class is 
> capable. TupleWritable relies on the join framework for type safety and 
> assumes its instances will rarely be persisted, assumptions not only 
> incompatible with, but contrary to the general case.
> 
> If we don't mind storing the class name, would it be better to use 
> ObjectWritable for the vector, or whatever else happens to be there?
> 
> 
> On 18 Dec, 2011, at 11:26 PM, Lance Norskog wrote:
> 
>> But the Writables in each tuple include a vector which could be
>> hundreds of doubles. It's not a big deal.
>> 
>> On Sun, Dec 18, 2011 at 9:29 PM, Raphael Cendrillon
>> <cendrillon1...@gmail.com> wrote:
>>> Yes, but tuplewritable is pretty inefficient since it stores the classname 
>>> with every record.  This seems wasteful given that the class is always the 
>>> same.
>>> 
>>> On 18 Dec, 2011, at 9:19 PM, Lance Norskog wrote:
>>> 
>>>> JIRA is acting up, so posting here instead.
>>>> 
>>>> You have already made RandomPermuteJob extend AbstractJob. Never mind.
>>>> 
>>>> bq. Does this seem like a reasonable approach? It would require that a
>>>> class be created for each object type of interest which is somewhat
>>>> painfull. However I can't see a simpler approach since
>>>> setMapOutputValueClass() needs to take a class that has a default
>>>> constructor (and PairWritable doesn't have a default constructor since
>>>> it doesn't know how to call new for first and second since it doesn't
>>>> know what class first and second belong to).
>>>> 
>>>> TupleWritable handles this by writing the classname. Looking at this
>>>> again, can't this just use TupleWritable?
>>>> 
>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-3/org/apache/hadoop/mapred/join/TupleWritable.java
>>>> 
>>>> On Sun, Dec 18, 2011 at 7:48 PM, Raphael Cendrillon (Commented) (JIRA)
>>>> <j...@apache.org> wrote:
>>>>> 
>>>>>   [ 
>>>>> https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172021#comment-13172021
>>>>>  ]
>>>>> 
>>>>> Raphael Cendrillon commented on MAHOUT-904:
>>>>> -------------------------------------------
>>>>> 
>>>>> Hi Lance. Is that a general comment, or specifically for the issue 
>>>>> regarding PairWritable/IntVectorWritable?
>>>>> 
>>>>>> SplitInput should support randomizing the input
>>>>>> -----------------------------------------------
>>>>>> 
>>>>>>                Key: MAHOUT-904
>>>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-904
>>>>>>            Project: Mahout
>>>>>>         Issue Type: Improvement
>>>>>>           Reporter: Grant Ingersoll
>>>>>>           Assignee: Raphael Cendrillon
>>>>>>             Labels: MAHOUT_INTRO_CONTRIBUTE
>>>>>>        Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch
>>>>>> 
>>>>>> 
>>>>>> For some learning tasks, we need the input to be randomized (SGD) 
>>>>>> instead of blocks of labels all at once.  SplitInput is a useful tool 
>>>>>> for setting up train/test files but it currently doesn't support 
>>>>>> randomizing the input.
>>>>> 
>>>>> --
>>>>> This message is automatically generated by JIRA.
>>>>> If you think it was sent incorrectly, please contact your JIRA 
>>>>> administrators: 
>>>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Lance Norskog
>>>> goks...@gmail.com
>>> 
>> 
>> 
>> 
>> -- 
>> Lance Norskog
>> goks...@gmail.com
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: [jira] [Commented] (MAHOUT-904) SplitInput should support randomizing the input

Reply via email to