PS if it is not terribly difficult, if you could post your patch on github, it would be awesome (with complete mahout history based on git.apache.org/mahout)
Then we can merge it more easily in case it gets out of sync with the trunk HEAD. Thank you for doing this. On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > If i had to guess, the mapper reported time should be under 1 minute > regardless of the input size on any __non-vm__ machine (unless it is > IBM XT :) even with -Xmx200m which is hadoop default. > > The reducer depends on the input size, but unless you manage to > generate 1000 mappers, i don't think it will jump out of 1 min either. > > Thanks. > -Dmitriy > > On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon > <cendrillon1...@gmail.com> wrote: >> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it >> dense. >> >> Let me try out some larger data sets and see how it runs. Do you have any >> suggestions / expectations on performance that I should aim for? E.g. Given >> x nodes and a y by y matrix the job should take around z minutes? >> >> As a follow up, would it be worth starting work on the 'brute force' job for >> subtracting the average from each of the rows? >> >> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" >> <j...@apache.org> wrote: >> >>> >>> [ >>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946 >>> ] >>> >>> Dmitriy Lyubimov commented on MAHOUT-923: >>> ----------------------------------------- >>> >>> Raphael, thank you for seeing this thru. >>> >>> Q: >>> 1) -- why do you need vector class for the accumulator now? mean is kind of >>> expected to be dense in the end, if not in the mappers then at least in the >>> reducer for sure. And secondly, if you want to do this, why don't your api >>> would accept a class instance, not a "short" name? that would be consistent >>> with the Hadoop Job and file format apis which kind of take classes, not >>> strings. >>> >>> 2) -- I know you have a unit test, but did you test it on a simulated >>> input, like say 2G big? if not, i will have to test it before you proceed. >>> >>> As a next step, i guess i need to try it out to see if it works on various >>> kind of inputs. >>> >>>> Row mean job for PCA >>>> -------------------- >>>> >>>> Key: MAHOUT-923 >>>> URL: https://issues.apache.org/jira/browse/MAHOUT-923 >>>> Project: Mahout >>>> Issue Type: Improvement >>>> Components: Math >>>> Affects Versions: 0.6 >>>> Reporter: Raphael Cendrillon >>>> Assignee: Raphael Cendrillon >>>> Fix For: Backlog >>>> >>>> Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch >>>> >>>> >>>> Add map reduce job for calculating mean row (column-wise mean) of a >>>> Distributed Row Matrix for use in PCA. >>> >>> -- >>> This message is automatically generated by JIRA. >>> If you think it was sent incorrectly, please contact your JIRA >>> administrators: >>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >>> For more information on JIRA, see: http://www.atlassian.com/software/jira >>> >>>