Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Dmitriy Lyubimov Sun, 18 Dec 2011 14:28:29 -0800

ps you can save time for pull/push tremendously by just cloning
github.com:dlyubimov/mahout-commits repo. its trunk is already
up-to-date with apache's.


-d

On Sun, Dec 18, 2011 at 2:25 PM, Dmitriy Lyubimov <[email protected]> wrote:
> PS if it is not terribly difficult, if you could post your patch on
> github, it would be awesome (with complete mahout history based on
> git.apache.org/mahout)
>
> Then we can merge it more easily in case it gets out of sync with the
> trunk HEAD.
>
> Thank you for doing this.
>
>
> On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <[email protected]> wrote:
>> If i had to guess, the mapper reported time should be under 1 minute
>> regardless of the input size on any __non-vm__ machine (unless it is
>> IBM XT :) even with -Xmx200m which is hadoop default.
>>
>> The reducer depends on the input size, but unless you manage to
>> generate 1000 mappers, i don't think it will jump out of 1 min either.
>>
>> Thanks.
>> -Dmitriy
>>
>> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
>> <[email protected]> wrote:
>>> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it 
>>> dense.
>>>
>>> Let me try out some larger data sets and see how it runs. Do you have any 
>>> suggestions / expectations on performance that I should aim for? E.g. Given 
>>> x nodes and a y by y matrix the job should take around z minutes?
>>>
>>> As a follow up, would it be worth starting work on the 'brute force' job 
>>> for subtracting the average from each of the rows?
>>>
>>> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" 
>>> <[email protected]> wrote:
>>>
>>>>
>>>>    [ 
>>>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946
>>>>  ]
>>>>
>>>> Dmitriy Lyubimov commented on MAHOUT-923:
>>>> -----------------------------------------
>>>>
>>>> Raphael, thank you for seeing this thru.
>>>>
>>>> Q:
>>>> 1) -- why do you need vector class for the accumulator now? mean is kind 
>>>> of expected to be dense in the end, if not in the mappers then at least in 
>>>> the reducer for sure. And secondly, if you want to do this, why don't your 
>>>> api would accept a class instance, not a "short" name? that would be 
>>>> consistent with the Hadoop Job and file format apis which kind of take 
>>>> classes, not strings.
>>>>
>>>> 2) --  I know you have a unit test, but did you test it on a simulated 
>>>> input, like say 2G big? if not, i will have to test it before you proceed.
>>>>
>>>> As a next step, i guess i need to try it out to see if it works on various 
>>>> kind of inputs.
>>>>
>>>>> Row mean job for PCA
>>>>> --------------------
>>>>>
>>>>>                Key: MAHOUT-923
>>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>>>            Project: Mahout
>>>>>         Issue Type: Improvement
>>>>>         Components: Math
>>>>>   Affects Versions: 0.6
>>>>>           Reporter: Raphael Cendrillon
>>>>>           Assignee: Raphael Cendrillon
>>>>>            Fix For: Backlog
>>>>>
>>>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>>>
>>>>>
>>>>> Add map reduce job for calculating mean row (column-wise mean) of a 
>>>>> Distributed Row Matrix for use in PCA.
>>>>
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> If you think it was sent incorrectly, please contact your JIRA 
>>>> administrators: 
>>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>>
>>>>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Reply via email to