[
https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966699#action_12966699
]
Dmitriy Lyubimov commented on MAHOUT-376:
-----------------------------------------
Actually i think the biggest issue here is not scale for memory but what i call
'supersplits'.
if we have a row-wise matrix format, and by virtue of SSVD algorithm we have to
consider no less than 500x500 blocks, then even with terasort 40tb 2000 node
cluster block size parameter setting (128Mb) we are constrained to approx.
~30-50k densely wide matrices (even then the expectation is that half of the
mapper's data would have to be downloaded from some other node). Which kind of
defeats one of the main pitches of MR, code-data collocation. so in case of 1
mln densely wide matrices, and big cluster, we'd be downloading like 99% of
data from somewhere else. But we already paid in IO bandwidth when we created
input matrix file in the first place, so why should it be a giant inefficient
model of a supercomputer in a cloud? Custom batching approach would be way more
efficient.
I kind of dubbed the problem above as a 'supersplits problem' in my head.
-------
I beleive i am largely done with this mahout issue as far as method and code
are concerned. We, of course, need to test it on something sizable. Benchmarks
thus are a pending matter, and i expect they will be net io-bound but they
would be reasonably scaled for memory per discussion (less the issue of
deficient prebuffering in VectorWritable on wide matrices) but additional
remedies are clear if needed. There might be some minor tweaks for outputting
U,V required. Maybe add one or two more map-only passes over Q data to get
additional scale for m. Maybe backport for hadoop 0.20 if mahout decides to
release this code.
Next problem i am going to ponder as a side project is devising an SSVD MR
method on a block-wise serialized matrices. I think i can devise an SSVD method
that can efficiently address "supersplits" problem (with more shuffle and sort
I/O though but it would be much more mr-like). Since I think Mahout supports
neither block wise formatted matrices, nor, respectively, any BLAS ops for such
inputs, an alternative approach to matrix (de-)serialization would have to be
created. Conceivable scenario would be to reprocess mahout's row-wise matrices
into such SSVD block-wise input at additional expense, but single-purposed data
perhaps may well just vectorise block-wise directly.
> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
> Key: MAHOUT-376
> URL: https://issues.apache.org/jira/browse/MAHOUT-376
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Reporter: Ted Dunning
> Assignee: Ted Dunning
> Fix For: 0.5
>
> Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for
> mapreduce.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR
> decomposition for Map.pdf, sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf,
> sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf, SSVD working
> notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz,
> ssvd-CDH3-or-0.21.patch.gz, ssvd-CDH3-or-0.21.patch.gz, ssvd-m1.patch.gz,
> ssvd-m2.patch.gz, ssvd-m3.patch.gz, Stochastic SVD using eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.