Note also that all these related beasts come in pairs (in-core input <-> distributed input):
ssvd - dssvd spca - dspca On Fri, Mar 27, 2015 at 3:45 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > But MR version of SSVD is more stable because of the QR differences. > > On Fri, Mar 27, 2015 at 3:44 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > >> Yes. Except it doesn't follow same parallel reordered Givens QR but uses >> Cholesky QR (which we call "thin QR") as an easy-to-implement shortcut. But >> this page makes no mention of QR specifics i think >> >> On Fri, Mar 27, 2015 at 12:57 PM, Andrew Palumbo <ap....@outlook.com> >> wrote: >> >>> math-scala dssvd follows the same algorithm as MR ssvd correct? Looking >>> at the code against the algorithm outlined at the bottom of >>> http://mahout.apache.org/users/dim-reduction/ssvd.html it seems the >>> same, but I wanted to make I'm not missing anything before I put the >>> following doc up (with the algorithm taken from the MR ssvd page): >>> >>> # Distributed Stochastic Singular Value Decomposition >>> >>> >>> ## Intro >>> >>> Mahout has a distributed implementation of Stochastic Singular Value >>> Decomposition [1]. >>> >>> ## Modified SSVD Algorithm >>> >>> Given an `\(m\times n\)` >>> matrix `\(\mathbf{A}\)`, a target rank `\(k\in\mathbb{N}_{1}\)` >>> , an oversampling parameter `\(p\in\mathbb{N}_{1}\)`, >>> and the number of additional power iterations `\(q\in\mathbb{N}_{0}\)`, >>> this procedure computes an `\(m\times\left(k+p\right)\)` >>> SVD `\(\mathbf{A\approx U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)`: >>> >>> 1. Create seed for random `\(n\times\left(k+p\right)\)` >>> matrix `\(\boldsymbol{\Omega}\)`. The seed defines matrix >>> `\(\mathbf{\Omega}\)` >>> using Gaussian unit vectors per one of suggestions in [Halko, >>> Martinsson, Tropp]. >>> >>> 2. `\(\mathbf{Y=A\boldsymbol{\Omega}},\,\mathbf{Y}\in\ >>> mathbb{R}^{m\times\left(k+p\right)}\)` >>> >>> 3. Column-orthonormalize `\(\mathbf{Y}\rightarrow\mathbf{Q}\)` >>> by computing thin decomposition `\(\mathbf{Y}=\mathbf{Q}\mathbf{R}\)`. >>> Also, `\(\mathbf{Q}\in\mathbb{R}^{m\times\left(k+p\right)},\,\ >>> mathbf{R}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)`; >>> denoted as `\(\mathbf{Q}=\mbox{qr}\left(\mathbf{Y}\right).\mathbf{Q}\)` >>> >>> 4. `\(\mathbf{B}_{0}=\mathbf{Q}^{\top}\mathbf{A}:\,\,\mathbf{B} >>> \in\mathbb{R}^{\left(k+p\right)\times n}\)`. >>> >>> 5. If `\(q>0\)` >>> repeat: for `\(i=1..q\)`: >>> `\(\mathbf{B}_{i}^{\top}=\mathbf{A}^{\top}\mbox{qr}\ >>> left(\mathbf{A}\mathbf{B}_{i-1}^{\top}\right).\mathbf{Q}\)` >>> (power iterations step). >>> >>> 6. Compute Eigensolution of a small Hermitian >>> `\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}=\mathbf{\hat{U}}\ >>> boldsymbol{\Lambda}\mathbf{\hat{U}}^{\top}\)`, >>> `\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}\in\mathbb{R}^{\left( >>> k+p\right)\times\left(k+p\right)}\)`. >>> >>> 7. Singular values `\(\mathbf{\boldsymbol{\Sigma} >>> }=\boldsymbol{\Lambda}^{0.5}\)`, >>> or, in other words, `\(s_{i}=\sqrt{\sigma_{i}}\)`. >>> >>> 8. If needed, compute `\(\mathbf{U}=\mathbf{Q}\hat{\mathbf{U}}\)`. >>> >>> 9. If needed, compute `\(\mathbf{V}=\mathbf{B}_{q}^{ >>> \top}\hat{\mathbf{U}}\boldsymbol{\Sigma}^{-1}\)`. >>> Another way is `\(\mathbf{V}=\mathbf{A}^{\top}\mathbf{U}\boldsymbol{\ >>> Sigma}^{-1}\)`. >>> >>> >>> >>> >>> ## Implementation >>> >>> Mahout `dssvd(...)` is implemented in the mahout `math-scala` algebraic >>> optimizer which translates Mahout's R-like linear algebra operators into a >>> physical plan for both Spark and H2O distributed engines. >>> >>> def dssvd[K: ClassTag](drmA: DrmLike[K], k: Int, p: Int = 15, q: Int >>> = 0): >>> (DrmLike[K], DrmLike[Int], Vector) = { >>> >>> val drmAcp = drmA.checkpoint() >>> >>> val m = drmAcp.nrow >>> val n = drmAcp.ncol >>> assert(k <= (m min n), "k cannot be greater than smaller of m, >>> n.") >>> val pfxed = safeToNonNegInt((m min n) - k min p) >>> >>> // Actual decomposition rank >>> val r = k + pfxed >>> >>> // We represent Omega by its seed. >>> val omegaSeed = RandomUtils.getRandom().nextInt() >>> >>> // Compute Y = A*Omega. >>> var drmY = drmAcp.mapBlock(ncol = r) { >>> case (keys, blockA) => >>> val blockY = blockA %*% Matrices.symmetricUniformView(n, >>> r, omegaSeed) >>> keys -> blockY >>> } >>> >>> var drmQ = dqrThin(drmY.checkpoint())._1 >>> >>> // Checkpoint Q if last iteration >>> if (q == 0) drmQ = drmQ.checkpoint() >>> >>> var drmBt = drmAcp.t %*% drmQ >>> >>> // Checkpoint B' if last iteration >>> if (q == 0) drmBt = drmBt.checkpoint() >>> >>> for (i <- 0 until q) { >>> drmY = drmAcp %*% drmBt >>> drmQ = dqrThin(drmY.checkpoint())._1 >>> >>> // Checkpoint Q if last iteration >>> if (i == q - 1) drmQ = drmQ.checkpoint() >>> >>> drmBt = drmAcp.t %*% drmQ >>> >>> // Checkpoint B' if last iteration >>> if (i == q - 1) drmBt = drmBt.checkpoint() >>> } >>> >>> val (inCoreUHat, d) = eigen(drmBt.t %*% drmBt) >>> val s = d.sqrt >>> >>> // Since neither drmU nor drmV are actually computed until >>> actually used >>> // we don't need the flags instructing compute (or not compute) >>> either of the U,V outputs >>> val drmU = drmQ %*% inCoreUHat >>> val drmV = drmBt %*% (inCoreUHat %*%: diagv(1 /: s)) >>> >>> (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k)) >>> } >>> >>> Note: As a side effect of checkpointing, U and V values are returned as >>> logical operators (i.e. they are neither checkpointed nor computed). >>> Therefore there is no physical work actually done to compute >>> `\(\mathbf{U}\)` or `\(\mathbf{V}\)` until they are used in a subsequent >>> expression. >>> >>> >>> ## Usage >>> >>> The scala `dssvd(...)` method can easily be called in any Spark or H2O >>> application built with the `math-scala` library and the corresponding >>> `Spark` or `H2O` engine module as follows: >>> >>> import org.apache.mahout.math._ >>> import org.decompsitions._ >>> >>> >>> val(drmU, drmV, s) = dssvd(drma, k = 40, q = 1) >>> >>> >>> ## References >>> >>> [1]: [Mahout Scala and Mahout Spark Bindings for Linear Algebra >>> Subroutines](http://mahout.apache.org/users/sparkbindings/ >>> ScalaSparkBindings.pdf) >>> >>> [2]: [Halko, Martinsson, Tropp](http://arxiv.org/abs/0909.4061) >>> >>> [3]: [Mahout Spark and Scala Bindings](http://mahout.apache.org/users/ >>> sparkbindings/home.html) >>> >>> [4]: [Randomized methods for computing low-rank >>> approximations of matrices](http://amath.colorado.edu/faculty/martinss/ >>> Pubs/2012_halko_dissertation.pdf) >>> >>> >>> >>> >> >