[ 
https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966699#action_12966699
 ] 

Dmitriy Lyubimov commented on MAHOUT-376:
-----------------------------------------

Actually i think the biggest issue here is not scale for memory but what i call 
'supersplits'. 

if we have a row-wise matrix format, and by virtue of SSVD algorithm we have to 
consider no less than 500x500 blocks, then even with terasort 40tb 2000 node 
cluster block size parameter setting (128Mb) we are constrained to approx. 
~30-50k densely wide matrices (even then the expectation is that half of the 
mapper's data would have to be downloaded from some other node). Which kind of 
defeats one of the main pitches of MR, code-data collocation. so in case of 1 
mln densely wide matrices, and big cluster, we'd be downloading like 99% of 
data from somewhere else. But we already paid in IO bandwidth when we created 
input matrix file in the first place, so why should it be a giant inefficient 
model of a supercomputer in a cloud? Custom batching approach would be way more 
efficient.

I kind of dubbed the problem above as a 'supersplits problem' in my head. 

-------
I beleive i am largely done with this mahout issue as far as method and code 
are concerned. We, of course, need to test it on something sizable. Benchmarks 
thus are a pending matter, and i expect they will be net io-bound but they  
would be reasonably scaled for memory per discussion (less the issue of 
deficient prebuffering in VectorWritable on wide matrices) but  additional 
remedies are clear if needed. There might be some minor tweaks for outputting 
U,V required. Maybe add one or two more map-only passes over Q data to get 
additional scale for m. Maybe backport for hadoop 0.20 if mahout decides to 
release this code.

Next problem i am going to ponder as a side project is devising an SSVD MR 
method on a block-wise serialized matrices. I think i can devise an SSVD method 
 that can efficiently address "supersplits" problem (with more shuffle and sort 
I/O though but it would be much more mr-like). Since I think Mahout supports 
neither block wise formatted matrices, nor, respectively, any BLAS ops for such 
inputs, an alternative approach to matrix (de-)serialization would have to be 
created. Conceivable scenario would be to reprocess mahout's row-wise matrices 
into such SSVD block-wise input at additional expense, but single-purposed data 
perhaps may well just vectorise block-wise directly.

> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
>                 Key: MAHOUT-376
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-376
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.5
>
>         Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for 
> mapreduce.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR 
> decomposition for Map.pdf, sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf, 
> sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf, SSVD working 
> notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz, 
> ssvd-CDH3-or-0.21.patch.gz, ssvd-CDH3-or-0.21.patch.gz, ssvd-m1.patch.gz, 
> ssvd-m2.patch.gz, ssvd-m3.patch.gz, Stochastic SVD using eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to