[ 
https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965608#action_12965608
 ] 

Dmitriy Lyubimov commented on MAHOUT-376:
-----------------------------------------

yes, it is 100% streaming in terms of A and Y rows. Assumption is that we are 
ok to load one A row into memory at a time and we optimize for tall matrices 
(such as billion by million)  Even if it is dense, one such vector would take 
8MB memory at a time. but sparse sequential vectors should be ok too (it will 
probably require a little tweak during Y computations to scan it one time 
sequentially instead of k+p times as i think it is done now with assumption it 
can be random).

For memory, the concern is random access q blocks which can be no less than k+p 
by k+p (that is, for the case of k+p=500, it gets to be 2 Mb). But this is all 
as far as memory is concerned. (well actually 2 times that, plus there's a Y 
lookahead buffer in order to make sure we can safely form next block. Plus 
there's a packed R. so for k+p=500 it looks like minimum memory requirement is 
rougly in the area of 7-8Mb. which is well below anything). 

CPU may be more of a problem, but i am actually not sure if Givens series would 
produce more crunching than e.g. Householder's . Givens certainly is as 
numerically stable as householder's and better than Gramm-Schmidt. In my tests 
for 100k tall matrix the orthonormality residuals seem to hold at about  no 
less than 10e-13 and surprisingly i did not notice any degradataion at all 
compared to smaller sizes.  Actually I happened to read aobut  LAPack methods 
ithat  prefer Givens for possiblity of re-ordering and thus easier 
parallelization). 

Anyway, speaking of numerical stability, whatever degradation occurs, i think 
it would be dwarfed by stochastic inaccuracy which grows quite significantly in 
my low rank tests. Perhaps for kp=500 it should degrade much less than for 
20-30.

> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
>                 Key: MAHOUT-376
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-376
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.5
>
>         Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for 
> mapreduce.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR 
> decomposition for Map.pdf, sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf, 
> sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf, SSVD working 
> notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz, 
> ssvd-m1.patch.gz, ssvd-m2.patch.gz, ssvd-m3.patch.gz, Stochastic SVD using 
> eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to