ps you can save time for pull/push tremendously by just cloning github.com:dlyubimov/mahout-commits repo. its trunk is already up-to-date with apache's.
-d On Sun, Dec 18, 2011 at 2:25 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > PS if it is not terribly difficult, if you could post your patch on > github, it would be awesome (with complete mahout history based on > git.apache.org/mahout) > > Then we can merge it more easily in case it gets out of sync with the > trunk HEAD. > > Thank you for doing this. > > > On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: >> If i had to guess, the mapper reported time should be under 1 minute >> regardless of the input size on any __non-vm__ machine (unless it is >> IBM XT :) even with -Xmx200m which is hadoop default. >> >> The reducer depends on the input size, but unless you manage to >> generate 1000 mappers, i don't think it will jump out of 1 min either. >> >> Thanks. >> -Dmitriy >> >> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon >> <cendrillon1...@gmail.com> wrote: >>> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it >>> dense. >>> >>> Let me try out some larger data sets and see how it runs. Do you have any >>> suggestions / expectations on performance that I should aim for? E.g. Given >>> x nodes and a y by y matrix the job should take around z minutes? >>> >>> As a follow up, would it be worth starting work on the 'brute force' job >>> for subtracting the average from each of the rows? >>> >>> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" >>> <j...@apache.org> wrote: >>> >>>> >>>> [ >>>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946 >>>> ] >>>> >>>> Dmitriy Lyubimov commented on MAHOUT-923: >>>> ----------------------------------------- >>>> >>>> Raphael, thank you for seeing this thru. >>>> >>>> Q: >>>> 1) -- why do you need vector class for the accumulator now? mean is kind >>>> of expected to be dense in the end, if not in the mappers then at least in >>>> the reducer for sure. And secondly, if you want to do this, why don't your >>>> api would accept a class instance, not a "short" name? that would be >>>> consistent with the Hadoop Job and file format apis which kind of take >>>> classes, not strings. >>>> >>>> 2) -- I know you have a unit test, but did you test it on a simulated >>>> input, like say 2G big? if not, i will have to test it before you proceed. >>>> >>>> As a next step, i guess i need to try it out to see if it works on various >>>> kind of inputs. >>>> >>>>> Row mean job for PCA >>>>> -------------------- >>>>> >>>>> Key: MAHOUT-923 >>>>> URL: https://issues.apache.org/jira/browse/MAHOUT-923 >>>>> Project: Mahout >>>>> Issue Type: Improvement >>>>> Components: Math >>>>> Affects Versions: 0.6 >>>>> Reporter: Raphael Cendrillon >>>>> Assignee: Raphael Cendrillon >>>>> Fix For: Backlog >>>>> >>>>> Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch >>>>> >>>>> >>>>> Add map reduce job for calculating mean row (column-wise mean) of a >>>>> Distributed Row Matrix for use in PCA. >>>> >>>> -- >>>> This message is automatically generated by JIRA. >>>> If you think it was sent incorrectly, please contact your JIRA >>>> administrators: >>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >>>> For more information on JIRA, see: http://www.atlassian.com/software/jira >>>> >>>>