Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Dmitriy Lyubimov Sun, 18 Dec 2011 14:29:42 -0800

PPPS
you can also clone github's own mirror of git.apache.com but be
careful: they seem to be out of date pretty badly from time to time.
so better either use my branch or clone from apache directly (longer)
if github's mirror is out of date.


-d

On Sun, Dec 18, 2011 at 2:28 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> ps you can save time for pull/push tremendously by just cloning
> github.com:dlyubimov/mahout-commits repo. its trunk is already
> up-to-date with apache's.
>
> -d
>
> On Sun, Dec 18, 2011 at 2:25 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>> PS if it is not terribly difficult, if you could post your patch on
>> github, it would be awesome (with complete mahout history based on
>> git.apache.org/mahout)
>>
>> Then we can merge it more easily in case it gets out of sync with the
>> trunk HEAD.
>>
>> Thank you for doing this.
>>
>>
>> On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>>> If i had to guess, the mapper reported time should be under 1 minute
>>> regardless of the input size on any __non-vm__ machine (unless it is
>>> IBM XT :) even with -Xmx200m which is hadoop default.
>>>
>>> The reducer depends on the input size, but unless you manage to
>>> generate 1000 mappers, i don't think it will jump out of 1 min either.
>>>
>>> Thanks.
>>> -Dmitriy
>>>
>>> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
>>> <cendrillon1...@gmail.com> wrote:
>>>> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it 
>>>> dense.
>>>>
>>>> Let me try out some larger data sets and see how it runs. Do you have any 
>>>> suggestions / expectations on performance that I should aim for? E.g. 
>>>> Given x nodes and a y by y matrix the job should take around z minutes?
>>>>
>>>> As a follow up, would it be worth starting work on the 'brute force' job 
>>>> for subtracting the average from each of the rows?
>>>>
>>>> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" 
>>>> <j...@apache.org> wrote:
>>>>
>>>>>
>>>>>    [ 
>>>>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946
>>>>>  ]
>>>>>
>>>>> Dmitriy Lyubimov commented on MAHOUT-923:
>>>>> -----------------------------------------
>>>>>
>>>>> Raphael, thank you for seeing this thru.
>>>>>
>>>>> Q:
>>>>> 1) -- why do you need vector class for the accumulator now? mean is kind 
>>>>> of expected to be dense in the end, if not in the mappers then at least 
>>>>> in the reducer for sure. And secondly, if you want to do this, why don't 
>>>>> your api would accept a class instance, not a "short" name? that would be 
>>>>> consistent with the Hadoop Job and file format apis which kind of take 
>>>>> classes, not strings.
>>>>>
>>>>> 2) --  I know you have a unit test, but did you test it on a simulated 
>>>>> input, like say 2G big? if not, i will have to test it before you proceed.
>>>>>
>>>>> As a next step, i guess i need to try it out to see if it works on 
>>>>> various kind of inputs.
>>>>>
>>>>>> Row mean job for PCA
>>>>>> --------------------
>>>>>>
>>>>>>                Key: MAHOUT-923
>>>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>>>>            Project: Mahout
>>>>>>         Issue Type: Improvement
>>>>>>         Components: Math
>>>>>>   Affects Versions: 0.6
>>>>>>           Reporter: Raphael Cendrillon
>>>>>>           Assignee: Raphael Cendrillon
>>>>>>            Fix For: Backlog
>>>>>>
>>>>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>>>>
>>>>>>
>>>>>> Add map reduce job for calculating mean row (column-wise mean) of a 
>>>>>> Distributed Row Matrix for use in PCA.
>>>>>
>>>>> --
>>>>> This message is automatically generated by JIRA.
>>>>> If you think it was sent incorrectly, please contact your JIRA 
>>>>> administrators: 
>>>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>>>
>>>>>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Reply via email to