[
https://issues.apache.org/jira/browse/MAHOUT-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520216#comment-14520216
]
ASF GitHub Bot commented on MAHOUT-1626:
----------------------------------------
Github user gcapan commented on the pull request:
https://github.com/apache/mahout/pull/62#issuecomment-97582966
Under some conditions, which are satisfied in the case of linear and
logistic regression, a statistical optimization problem (the parameter
estimation) over i.i.d. data, distributed as:
1. average of the local estimates
2. a combination of the average of the local estimates and the average of
the estimates on the subsamples of the local sample sets
converges in mean to the optimal risk minimizer, as it is described in [1].
Given that, these methods are not only a way to distribute machine learning,
they also provide a _justification for machine learning on Big Data_ (that is,
these algorithms converge to the true risk minimizer as the whole data were
processed on a single computer).
With this motivation, I propose to add the two distributing schemes for
machine learning: averaging and bootstrap-averaging. These would be abstracted
away from the actual loss minimization algorithms, and the backend engines
would only provide these two simple functions. The users can throw their
favourite (in-core) optimization algorithm, and of course we would want to
provide some of them out-of-box.
Very soon, I am hoping to submit a patch for that. The current patch would
be obsolete then, so there is no need to replicate this. Once I submit it, I'll
close the current PR.
[1] http://arxiv.org/abs/1209.4129
(The short version in NIPS:
http://stanford.edu/~jduchi/projects/ZhangDuWa12_nips.pdf)
> Support for required quasi-algebraic operations and starting with aggregating
> rows/blocks
> -----------------------------------------------------------------------------------------
>
> Key: MAHOUT-1626
> URL: https://issues.apache.org/jira/browse/MAHOUT-1626
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Affects Versions: 1.0.0
> Reporter: Gokhan Capan
> Assignee: Gokhan Capan
> Labels: DSL, scala, spark
> Fix For: 0.11.0
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)