[ https://issues.apache.org/jira/browse/MAHOUT-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520216#comment-14520216 ]
ASF GitHub Bot commented on MAHOUT-1626: ---------------------------------------- Github user gcapan commented on the pull request: https://github.com/apache/mahout/pull/62#issuecomment-97582966 Under some conditions, which are satisfied in the case of linear and logistic regression, a statistical optimization problem (the parameter estimation) over i.i.d. data, distributed as: 1. average of the local estimates 2. a combination of the average of the local estimates and the average of the estimates on the subsamples of the local sample sets converges in mean to the optimal risk minimizer, as it is described in [1]. Given that, these methods are not only a way to distribute machine learning, they also provide a _justification for machine learning on Big Data_ (that is, these algorithms converge to the true risk minimizer as the whole data were processed on a single computer). With this motivation, I propose to add the two distributing schemes for machine learning: averaging and bootstrap-averaging. These would be abstracted away from the actual loss minimization algorithms, and the backend engines would only provide these two simple functions. The users can throw their favourite (in-core) optimization algorithm, and of course we would want to provide some of them out-of-box. Very soon, I am hoping to submit a patch for that. The current patch would be obsolete then, so there is no need to replicate this. Once I submit it, I'll close the current PR. [1] http://arxiv.org/abs/1209.4129 (The short version in NIPS: http://stanford.edu/~jduchi/projects/ZhangDuWa12_nips.pdf) > Support for required quasi-algebraic operations and starting with aggregating > rows/blocks > ----------------------------------------------------------------------------------------- > > Key: MAHOUT-1626 > URL: https://issues.apache.org/jira/browse/MAHOUT-1626 > Project: Mahout > Issue Type: New Feature > Components: Math > Affects Versions: 1.0.0 > Reporter: Gokhan Capan > Assignee: Gokhan Capan > Labels: DSL, scala, spark > Fix For: 0.11.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)