On Wed, Nov 2, 2011 at 6:04 PM, Olivier Grisel <[email protected]> wrote:
> 2011/11/2 Radim Rehurek <[email protected]>:
>> If you decide to implement the randomized PCA, I can offer some observations:
>>
>> 1. oversampling does little, accuracy comes mostly from the extra power 
>> iteration steps
>> 2. no power iterations result in miserable accuracy
>> 3. extra power iteration steps quickly lead to numerical overflows; but QR 
>> is pretty fast, so in gensim, I orthonormalize the intermediate matrices H 
>> after each power iteration step. That's exactly the same method that remark 
>> 3.3 refers to.
>
> Interesting. The current implementation in scikit-learn (which is
> neither streamed nor parallel) does quite a bit of oversampling (if k
> components are rrequired,  2 * k random vectors are used) and uses 3
> power iterations by default but does not do qr inside the power
> iteration steps:
>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/pca.py#L346
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py#L126

I played with this a bit a while back. Trivial patch attached.

How much the power iteration helps apparently depends on how quickly
the singular spectrum decays. The matrices I was playing with had
slowly decaying spectra; I found q=10 got acceptable results (judging
by whether the singular values matched the non-randomized results). I
didn't find that the orthonormalization step helped much for my case,
and the QR did slow things down significantly, so I didn't pursue this
patch further. Perhaps orthonormalizing every few iterations would be
a good balance.

In contrast to the comments in fast_svd, I did not find (k+p) >
rank(M) to be necessary. The Halko paper suggests the same thing iirc.

Also, I found the randomized range finder (algorithm 4.3) to be more
generally useful, as Halko et al suggests, so I pulled it into a
separate function. Patch also attached. Let me know if you want either
of these as a pull request, though both are trivial.

-Ken

Attachment: randomizedsvd_orthonormalize_power_iterations.patch
Description: Binary data

Attachment: fast_svd_factor_out_randomized_range_finder.patch
Description: Binary data

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to