LSQR is essentially a Krylov sub-space technique since it is based on Lanczos algorithm. As such, it is highly iterative and requires many passes through the data.
It might be more efficient if you could use a random projection technique. These algorithms can generally do the work of many Krylov iterations in a single pass over the data. I haven't looked into this in detail, but it seems to me that a single pass could get you the least-squares fit for the first k < 100 singular values in a single pass and that subsequent passes could be used on the error to get the effect of 100 singular values per iteration. The crux of the idea is that if you right multiply your large matrix A by a tall skinny matrix random matrix R, you should be able to generate an orthonormal matrix Q such that Q Y = A R. The exciting part is that Q Q' A will be a close approximation of A in terms of Frobenius norm. Minimizing || Q Q' A x - b ||_F should be possible fairly quickly without actually constructing Q Q' A because (I think) you should be able to minimize || Q' A x - Q' b ||_F instead. Moreover, since Q' A has limited rank, I think you can use (Q' A)' (Q' A) to find the answer you want without the horrible numerical problems that approach usually entails. Less directly, you can form the partial SVD and use that to solve for x. Solving the L_2 regularized system would proceed essentially the same way with a fancier value for A. I don't know how this would extend to solving the L_1 regularized system. Since just reading A is the dominant cost and since you get something proportional to k singular vectors in a single determination of Q versus something proportional to 1 with each Krylov iteration, this could get you two or three orders of magnitude improvement in speed. As usual, I would refer you to Halko, Martinsson, and Tropp's great survey article at http://arxiv.org/abs/0909.4061v1 They refer to some approaches for least squares solutions, but I haven't looked at the references that they cite. On Tue, Aug 24, 2010 at 10:21 PM, David G. Boney <[email protected]>wrote: > I assume that the proper etiquette is to send all correspondences to the > list, that side conversations are discouraged. > > I apologize in advance if I appear ignorant of all that Mahout has to > offer. I am currently reading though the book by Mahout in Action and > looking at the code. > > The regression algorithm (real inputs, real outputs) that I am working to > implement is LSQR. It is an iterative algorithm based on the conjugate > gradient method. It has a nice property that is only uses calculations from > Ax and A'y where A is a matrix, A' is the transpose, and x & y are vectors. > I would assume that with big data, one wants to avoid calculating A inverse > since this is O(n^3). Also, with sparse matrices, one wants to avoid > calculating AA' because the result can be dense. Below is a link to the > algorithm. > > http://www.stanford.edu/group/SOL/software/lsqr.html > > One issue I am working on is calculating Ax and A'y with only one pass > through A but perhaps multiple passes through x and y. In general, for > regression problems A is m rows and n columns where m >> n. The vector x has > the dimension n and the vector y has the dimension m. It might be the case > the x will fit in memory but I think the design should not assume this. I > also want to look at the issues where A is dense and A is sparse. > > LSQR can be used for maximum likelihood estimation of parameters of models > that are linear in their coefficients. I will first work on linear > regression then investigate the issue of using other basis sets. There may > already be work in Mahout to draw on for this issue. > > I think this will keep me busy for a while. > > Future issues would include: > 1. subset selection using the lasso or ridge regression > 2. simple bayesian regression, which their may already be work to draw on > in Mahout > 3. special basis sets > > My reference for least squares is: > Bjorck, Ake, (1996) Numerical Methods for Least Squares Problems > > http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=numerical+least+squares&x=0&y=0&ih=12_7_1_1_0_1_0_0_1_1.96_104&fsc=-1 > > My reference for numerical linear algebra is: > Saad, Yousef, (2003), Iterative Methods for Sparse Linear Systems > http://www-users.cs.umn.edu/~saad/books.html > > ------------- > Sincerely, > David G. Boney > [email protected] > http://www.sonicartifacts.com > > > > > On Aug 24, 2010, at 9:29 PM, Ted Dunning wrote: > > > On Tue, Aug 24, 2010 at 1:28 PM, David G. Boney <[email protected] > >wrote: > > > >> > >> I would like to contribute to Mahout. Who would be the point person on > the > >> following topics: linear algebra routines, regression (real inputs, real > >> outputs), subset selection for regression (lasso or ridge regression), > and > >> spectral graph methods? > >> > > > > Several of us can help on linear algebra. > > > > We have no linear regression to speak of at this point. > > > > I have done a fair bit of work on gradient descent for regularization. > > > > We have the beginnings of a spectral clustering model (not the same as > > general spectral graph meethods) and we have an OK, but not widely used > > large-scale eigenvector decomposer. > > > > I am in the process of implementing a least squares linear regression > >> algorithm, LSQR. In my quick review of Mahout, and I make no claims of > >> digging into the code at this point, there appears to be extensive work > in > >> the area of classifiers, discrete outputs, but not regression, real > output. > >> I have an interest in building up a library of regression techniques > (real > >> inputs, real outputs). > > > > > > As far as Mahout is concerned, scalability is the key question. For many > > regression problems, especially those with sparse inputs, gradient > descent > > is very effective and the current SGD for logistic regression could > probably > > be leveraged. My guess is that for non-gradient descent methods, the SVD > > decompositions would be a better starting point. > > > > I believe that Vowpal Wabbit can be used for linear regression which > > probably implies that Mahout's SGD solver could be as well with small > > changes to the gradient computation. > > > > > >> I am also interested in the implementation of the numerical linear > algebra > >> routines, as these algorithms are at the crux of most regression > problems. > >> > > > > We would love more help with this, especially in distributed cases. Raw > > numerical speed is rarely the bottleneck for mahout code because large > scale > > systems are typically I/O and network bound. That said, much of our > matrix > > code is still as we inherited it and while the quality is surprisingly > high > > for code without unit tests, I know from direct experience that there are > > problems. I am testing and correcting things as I need them, but you > would > > likely have a broader reach and thus might have substantially more impact > > than I have had on the testing side. > >
