On Wed, Nov 2, 2011 at 6:04 PM, Olivier Grisel <[email protected]> wrote: > 2011/11/2 Radim Rehurek <[email protected]>: >> If you decide to implement the randomized PCA, I can offer some observations: >> >> 1. oversampling does little, accuracy comes mostly from the extra power >> iteration steps >> 2. no power iterations result in miserable accuracy >> 3. extra power iteration steps quickly lead to numerical overflows; but QR >> is pretty fast, so in gensim, I orthonormalize the intermediate matrices H >> after each power iteration step. That's exactly the same method that remark >> 3.3 refers to. > > Interesting. The current implementation in scikit-learn (which is > neither streamed nor parallel) does quite a bit of oversampling (if k > components are rrequired, 2 * k random vectors are used) and uses 3 > power iterations by default but does not do qr inside the power > iteration steps: > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/pca.py#L346 > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py#L126
I played with this a bit a while back. Trivial patch attached. How much the power iteration helps apparently depends on how quickly the singular spectrum decays. The matrices I was playing with had slowly decaying spectra; I found q=10 got acceptable results (judging by whether the singular values matched the non-randomized results). I didn't find that the orthonormalization step helped much for my case, and the QR did slow things down significantly, so I didn't pursue this patch further. Perhaps orthonormalizing every few iterations would be a good balance. In contrast to the comments in fast_svd, I did not find (k+p) > rank(M) to be necessary. The Halko paper suggests the same thing iirc. Also, I found the randomized range finder (algorithm 4.3) to be more generally useful, as Halko et al suggests, so I pulled it into a separate function. Patch also attached. Let me know if you want either of these as a pull request, though both are trivial. -Ken
randomizedsvd_orthonormalize_power_iterations.patch
Description: Binary data
fast_svd_factor_out_randomized_range_finder.patch
Description: Binary data
------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
