[ 
https://issues.apache.org/jira/browse/MAHOUT-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714801#comment-13714801
 ] 

Kun Yang commented on MAHOUT-1273:
----------------------------------

Hi,

Thanks for the feedback.

The algorithm is not solving the normal equation as in the ordinary linear 
regression. I did not detail the algorithm to solve the penalized optimization 
in the paper. To solve the penalized version, I will use the coordinate descent 
which is well documented in other paper (see Freedman's paper, for 1000 
variables, it takes ~1min to do cross validation in the R package) and is very 
efficient.

As I discussed in the conclusion section, to solve the problem with large 
number of predictors, it is still a challenge even though in the single machine 
or MPI version, but the proposed algorithm can handle the number of variable at 
the order of 5000 and it covers lots of applications.

My plan is to implement a working version first then add some refinements such 
as sparse vectors.

Any feedback?
Best
-Kun


                
> Single Pass Algorithm for Penalized Linear Regression on MapReduce
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-1273
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1273
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Kun Yang
>         Attachments: PenalizedLinear.pdf
>
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Penalized linear regression such as Lasso, Elastic-net are widely used in 
> machine learning, but there are no very efficient scalable implementations on 
> MapReduce.
> The published distributed algorithms for solving this problem is either 
> iterative (which is not good for MapReduce, see Steven Boyd's paper) or 
> approximate (what if we need exact solutions, see Paralleled stochastic 
> gradient descent); another disadvantage of these algorithms is that they can 
> not do cross validation in the training phase, which requires a 
> user-specified penalty parameter in advance. 
> My ideas can train the model with cross validation in a single pass. They are 
> based on some simple observations.
> I have implemented the primitive version of this algorithm in Alpine Data 
> Labs. Advanced features such as inner-mapper combiner are employed to reduce 
> the network traffic in the shuffle phase.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to