PS if it is not terribly difficult, if you could post your patch on
github, it would be awesome (with complete mahout history based on
git.apache.org/mahout)

Then we can merge it more easily in case it gets out of sync with the
trunk HEAD.

Thank you for doing this.


On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> If i had to guess, the mapper reported time should be under 1 minute
> regardless of the input size on any __non-vm__ machine (unless it is
> IBM XT :) even with -Xmx200m which is hadoop default.
>
> The reducer depends on the input size, but unless you manage to
> generate 1000 mappers, i don't think it will jump out of 1 min either.
>
> Thanks.
> -Dmitriy
>
> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
> <cendrillon1...@gmail.com> wrote:
>> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it 
>> dense.
>>
>> Let me try out some larger data sets and see how it runs. Do you have any 
>> suggestions / expectations on performance that I should aim for? E.g. Given 
>> x nodes and a y by y matrix the job should take around z minutes?
>>
>> As a follow up, would it be worth starting work on the 'brute force' job for 
>> subtracting the average from each of the rows?
>>
>> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" 
>> <j...@apache.org> wrote:
>>
>>>
>>>    [ 
>>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946
>>>  ]
>>>
>>> Dmitriy Lyubimov commented on MAHOUT-923:
>>> -----------------------------------------
>>>
>>> Raphael, thank you for seeing this thru.
>>>
>>> Q:
>>> 1) -- why do you need vector class for the accumulator now? mean is kind of 
>>> expected to be dense in the end, if not in the mappers then at least in the 
>>> reducer for sure. And secondly, if you want to do this, why don't your api 
>>> would accept a class instance, not a "short" name? that would be consistent 
>>> with the Hadoop Job and file format apis which kind of take classes, not 
>>> strings.
>>>
>>> 2) --  I know you have a unit test, but did you test it on a simulated 
>>> input, like say 2G big? if not, i will have to test it before you proceed.
>>>
>>> As a next step, i guess i need to try it out to see if it works on various 
>>> kind of inputs.
>>>
>>>> Row mean job for PCA
>>>> --------------------
>>>>
>>>>                Key: MAHOUT-923
>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>>            Project: Mahout
>>>>         Issue Type: Improvement
>>>>         Components: Math
>>>>   Affects Versions: 0.6
>>>>           Reporter: Raphael Cendrillon
>>>>           Assignee: Raphael Cendrillon
>>>>            Fix For: Backlog
>>>>
>>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>>
>>>>
>>>> Add map reduce job for calculating mean row (column-wise mean) of a 
>>>> Distributed Row Matrix for use in PCA.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA 
>>> administrators: 
>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>

Reply via email to