I did put a reference in the footnotes to halko's dissertation but didn't explicit reference it (haven't looked at it closely and so wasn't sure where to reference it).. The Parallelizeton strategy right?
The r code could be good to to show the for comparison. Will try update it over the weekend. Sent from my Verizon Wireless 4G LTE smartphone <div>-------- Original message --------</div><div>From: Dmitriy Lyubimov <[email protected]> </div><div>Date:03/27/2015 8:00 PM (GMT-05:00) </div><div>To: [email protected] </div><div>Subject: Re: math-scala dssvd docs </div><div> </div> and R simulation sources perhaps ... On Fri, Mar 27, 2015 at 4:57 PM, Dmitriy Lyubimov <[email protected]> wrote: > Andrew, thanks a lot! > > I think acknowledgement and refference to N. Halko's dissertation from MR > page is also worthy of mention on this page as well. > > On Fri, Mar 27, 2015 at 4:41 PM, Andrew Palumbo <[email protected]> > wrote: > >> Ok i think everything should be good. I'll try to get (d)spca and >> possibly d-als up. I think it does make sense to put the in-core algos on >> that page as well. Will do that soon. >> >> yeah.. showing that it is almost line for line (a few more maybe ~1.25 >> lines per step ) to follow the algorithms step by step ..i think is good to >> have up on the site. >> >> please let me know if you see any other changes that need to be made. >> >> >> >> >> On 03/27/2015 07:06 PM, Dmitriy Lyubimov wrote: >> >>> In fact, algorithm just executes the outline formulas. Not always line >>> for >>> line, but step for step for sure. >>> >>> On Fri, Mar 27, 2015 at 4:05 PM, Andrew Palumbo <[email protected]> >>> wrote: >>> >>> Just commited and published it. Its pretty cool to see that there are >>>> not >>>> a whole lot more lines in the implementation than there are steps in the >>>> algorithm. >>>> >>>> >>>> On 03/27/2015 06:58 PM, Andrew Palumbo wrote: >>>> >>>> On 03/27/2015 06:46 PM, Dmitriy Lyubimov wrote: >>>>> >>>>> Note also that all these related beasts come in pairs (in-core input >>>>>> <-> >>>>>> distributed input): >>>>>> >>>>>> ssvd - dssvd >>>>>> spca - dspca >>>>>> >>>>>> yeah I've been thinking that i'd give a less detailed description of >>>>> them >>>>> (the in core algos) in an overview page (which would include the >>>>> basics and >>>>> operators, etc.). Probably makes sense to reference them here as well. >>>>> I'd like to get most of the manual easily viewable on different pages. >>>>> >>>>> Yes. Except it doesn't follow same parallel reordered Givens QR but >>>>> uses >>>>> Cholesky QR (which we call "thin QR") as an easy-to-implement shortcut. >>>>> But >>>>> this page makes no mention of QR specifics i think >>>>> >>>>> >>>>> right.. no QR specifics, just the dqrThin(...) call in the code. I'm >>>>> going to put the link directly below the Cholesky QR link so that will >>>>> tie >>>>> together well. >>>>> >>>>> >>>>> >>>>> On Fri, Mar 27, 2015 at 3:45 PM, Dmitriy Lyubimov <[email protected]> >>>>>> wrote: >>>>>> >>>>>> But MR version of SSVD is more stable because of the QR differences. >>>>>> >>>>>>> On Fri, Mar 27, 2015 at 3:44 PM, Dmitriy Lyubimov <[email protected] >>>>>>> > >>>>>>> wrote: >>>>>>> >>>>>>> Yes. Except it doesn't follow same parallel reordered Givens QR but >>>>>>> >>>>>>>> uses >>>>>>>> Cholesky QR (which we call "thin QR") as an easy-to-implement >>>>>>>> shortcut. But >>>>>>>> this page makes no mention of QR specifics i think >>>>>>>> >>>>>>>> On Fri, Mar 27, 2015 at 12:57 PM, Andrew Palumbo < >>>>>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> math-scala dssvd follows the same algorithm as MR ssvd correct? >>>>>>>> >>>>>>>>> Looking >>>>>>>>> at the code against the algorithm outlined at the bottom of >>>>>>>>> http://mahout.apache.org/users/dim-reduction/ssvd.html it seems >>>>>>>>> the >>>>>>>>> same, but I wanted to make I'm not missing anything before I put >>>>>>>>> the >>>>>>>>> following doc up (with the algorithm taken from the MR ssvd page): >>>>>>>>> >>>>>>>>> # Distributed Stochastic Singular Value Decomposition >>>>>>>>> >>>>>>>>> >>>>>>>>> ## Intro >>>>>>>>> >>>>>>>>> Mahout has a distributed implementation of Stochastic Singular >>>>>>>>> Value >>>>>>>>> Decomposition [1]. >>>>>>>>> >>>>>>>>> ## Modified SSVD Algorithm >>>>>>>>> >>>>>>>>> Given an `\(m\times n\)` >>>>>>>>> matrix `\(\mathbf{A}\)`, a target rank `\(k\in\mathbb{N}_{1}\)` >>>>>>>>> , an oversampling parameter `\(p\in\mathbb{N}_{1}\)`, >>>>>>>>> and the number of additional power iterations >>>>>>>>> `\(q\in\mathbb{N}_{0}\)`, >>>>>>>>> this procedure computes an `\(m\times\left(k+p\right)\)` >>>>>>>>> SVD `\(\mathbf{A\approx U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)`: >>>>>>>>> >>>>>>>>> 1. Create seed for random `\(n\times\left(k+p\right)\)` >>>>>>>>> matrix `\(\boldsymbol{\Omega}\)`. The seed defines matrix >>>>>>>>> `\(\mathbf{\Omega}\)` >>>>>>>>> using Gaussian unit vectors per one of suggestions in [Halko, >>>>>>>>> Martinsson, Tropp]. >>>>>>>>> >>>>>>>>> 2. `\(\mathbf{Y=A\boldsymbol{\Omega}},\,\mathbf{Y}\in\ >>>>>>>>> mathbb{R}^{m\times\left(k+p\right)}\)` >>>>>>>>> >>>>>>>>> 3. Column-orthonormalize `\(\mathbf{Y}\rightarrow\mathbf{Q}\)` >>>>>>>>> by computing thin decomposition `\(\mathbf{Y}=\mathbf{Q}\ >>>>>>>>> mathbf{R}\)`. >>>>>>>>> Also, `\(\mathbf{Q}\in\mathbb{R}^{m\times\left(k+p\right)},\,\ >>>>>>>>> mathbf{R}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\ >>>>>>>>> right)}\)`; >>>>>>>>> denoted as `\(\mathbf{Q}=\mbox{qr}\left(\ >>>>>>>>> mathbf{Y}\right).\mathbf{Q}\)` >>>>>>>>> >>>>>>>>> 4. `\(\mathbf{B}_{0}=\mathbf{Q}^{ >>>>>>>>> \top}\mathbf{A}:\,\,\mathbf{B} >>>>>>>>> \in\mathbb{R}^{\left(k+p\right)\times n}\)`. >>>>>>>>> >>>>>>>>> 5. If `\(q>0\)` >>>>>>>>> repeat: for `\(i=1..q\)`: >>>>>>>>> `\(\mathbf{B}_{i}^{\top}=\mathbf{A}^{\top}\mbox{qr}\ >>>>>>>>> left(\mathbf{A}\mathbf{B}_{i-1}^{\top}\right).\mathbf{Q}\)` >>>>>>>>> (power iterations step). >>>>>>>>> >>>>>>>>> 6. Compute Eigensolution of a small Hermitian >>>>>>>>> `\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}=\mathbf{\hat{U}}\ >>>>>>>>> boldsymbol{\Lambda}\mathbf{\hat{U}}^{\top}\)`, >>>>>>>>> `\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}\in\mathbb{R}^{\left( >>>>>>>>> k+p\right)\times\left(k+p\right)}\)`. >>>>>>>>> >>>>>>>>> 7. Singular values `\(\mathbf{\boldsymbol{\Sigma} >>>>>>>>> }=\boldsymbol{\Lambda}^{0.5}\)`, >>>>>>>>> or, in other words, `\(s_{i}=\sqrt{\sigma_{i}}\)`. >>>>>>>>> >>>>>>>>> 8. If needed, compute `\(\mathbf{U}=\mathbf{Q}\hat{\ >>>>>>>>> mathbf{U}}\)`. >>>>>>>>> >>>>>>>>> 9. If needed, compute `\(\mathbf{V}=\mathbf{B}_{q}^{ >>>>>>>>> \top}\hat{\mathbf{U}}\boldsymbol{\Sigma}^{-1}\)`. >>>>>>>>> Another way is `\(\mathbf{V}=\mathbf{A}^{\ >>>>>>>>> top}\mathbf{U}\boldsymbol{\ >>>>>>>>> Sigma}^{-1}\)`. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ## Implementation >>>>>>>>> >>>>>>>>> Mahout `dssvd(...)` is implemented in the mahout `math-scala` >>>>>>>>> algebraic >>>>>>>>> optimizer which translates Mahout's R-like linear algebra operators >>>>>>>>> into a >>>>>>>>> physical plan for both Spark and H2O distributed engines. >>>>>>>>> >>>>>>>>> def dssvd[K: ClassTag](drmA: DrmLike[K], k: Int, p: Int = >>>>>>>>> 15, q: >>>>>>>>> Int >>>>>>>>> = 0): >>>>>>>>> (DrmLike[K], DrmLike[Int], Vector) = { >>>>>>>>> >>>>>>>>> val drmAcp = drmA.checkpoint() >>>>>>>>> >>>>>>>>> val m = drmAcp.nrow >>>>>>>>> val n = drmAcp.ncol >>>>>>>>> assert(k <= (m min n), "k cannot be greater than smaller >>>>>>>>> of >>>>>>>>> m, >>>>>>>>> n.") >>>>>>>>> val pfxed = safeToNonNegInt((m min n) - k min p) >>>>>>>>> >>>>>>>>> // Actual decomposition rank >>>>>>>>> val r = k + pfxed >>>>>>>>> >>>>>>>>> // We represent Omega by its seed. >>>>>>>>> val omegaSeed = RandomUtils.getRandom().nextInt() >>>>>>>>> >>>>>>>>> // Compute Y = A*Omega. >>>>>>>>> var drmY = drmAcp.mapBlock(ncol = r) { >>>>>>>>> case (keys, blockA) => >>>>>>>>> val blockY = blockA %*% >>>>>>>>> Matrices.symmetricUniformView(n, >>>>>>>>> r, omegaSeed) >>>>>>>>> keys -> blockY >>>>>>>>> } >>>>>>>>> >>>>>>>>> var drmQ = dqrThin(drmY.checkpoint())._1 >>>>>>>>> >>>>>>>>> // Checkpoint Q if last iteration >>>>>>>>> if (q == 0) drmQ = drmQ.checkpoint() >>>>>>>>> >>>>>>>>> var drmBt = drmAcp.t %*% drmQ >>>>>>>>> >>>>>>>>> // Checkpoint B' if last iteration >>>>>>>>> if (q == 0) drmBt = drmBt.checkpoint() >>>>>>>>> >>>>>>>>> for (i <- 0 until q) { >>>>>>>>> drmY = drmAcp %*% drmBt >>>>>>>>> drmQ = dqrThin(drmY.checkpoint())._1 >>>>>>>>> >>>>>>>>> // Checkpoint Q if last iteration >>>>>>>>> if (i == q - 1) drmQ = drmQ.checkpoint() >>>>>>>>> >>>>>>>>> drmBt = drmAcp.t %*% drmQ >>>>>>>>> >>>>>>>>> // Checkpoint B' if last iteration >>>>>>>>> if (i == q - 1) drmBt = drmBt.checkpoint() >>>>>>>>> } >>>>>>>>> >>>>>>>>> val (inCoreUHat, d) = eigen(drmBt.t %*% drmBt) >>>>>>>>> val s = d.sqrt >>>>>>>>> >>>>>>>>> // Since neither drmU nor drmV are actually computed >>>>>>>>> until >>>>>>>>> actually used >>>>>>>>> // we don't need the flags instructing compute (or not >>>>>>>>> compute) >>>>>>>>> either of the U,V outputs >>>>>>>>> val drmU = drmQ %*% inCoreUHat >>>>>>>>> val drmV = drmBt %*% (inCoreUHat %*%: diagv(1 /: s)) >>>>>>>>> >>>>>>>>> (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k)) >>>>>>>>> } >>>>>>>>> >>>>>>>>> Note: As a side effect of checkpointing, U and V values are >>>>>>>>> returned >>>>>>>>> as >>>>>>>>> logical operators (i.e. they are neither checkpointed nor >>>>>>>>> computed). >>>>>>>>> Therefore there is no physical work actually done to compute >>>>>>>>> `\(\mathbf{U}\)` or `\(\mathbf{V}\)` until they are used in a >>>>>>>>> subsequent >>>>>>>>> expression. >>>>>>>>> >>>>>>>>> >>>>>>>>> ## Usage >>>>>>>>> >>>>>>>>> The scala `dssvd(...)` method can easily be called in any Spark or >>>>>>>>> H2O >>>>>>>>> application built with the `math-scala` library and the >>>>>>>>> corresponding >>>>>>>>> `Spark` or `H2O` engine module as follows: >>>>>>>>> >>>>>>>>> import org.apache.mahout.math._ >>>>>>>>> import org.decompsitions._ >>>>>>>>> >>>>>>>>> >>>>>>>>> val(drmU, drmV, s) = dssvd(drma, k = 40, q = 1) >>>>>>>>> >>>>>>>>> >>>>>>>>> ## References >>>>>>>>> >>>>>>>>> [1]: [Mahout Scala and Mahout Spark Bindings for Linear Algebra >>>>>>>>> Subroutines](http://mahout.apache.org/users/sparkbindings/ >>>>>>>>> ScalaSparkBindings.pdf) >>>>>>>>> >>>>>>>>> [2]: [Halko, Martinsson, Tropp](http://arxiv.org/abs/0909.4061) >>>>>>>>> >>>>>>>>> [3]: [Mahout Spark and Scala Bindings](http://mahout. >>>>>>>>> apache.org/users/ >>>>>>>>> sparkbindings/home.html) >>>>>>>>> >>>>>>>>> [4]: [Randomized methods for computing low-rank >>>>>>>>> approximations of matrices](http://amath. >>>>>>>>> colorado.edu/faculty/martinss/ >>>>>>>>> Pubs/2012_halko_dissertation.pdf) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >> >
