[ 
https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965742#action_12965742
 ] 

Dmitriy Lyubimov commented on MAHOUT-376:
-----------------------------------------

There is a catch though. With this implementation, m is memory bound. It is not 
in mapper though but in combiners and mapper of the next step. 

But my reasoning was, with 1G memory and k+p it seems to exist rather wide 
spectrum of admissible combinations of r (block height) and minSplitSize 
(governing essentually # of mappers needed) that would cover billion rows, and 
the sweet spot of this combination seem to exceed 1 bln rows. 

In addition, there are 2 remedies to consider. First is rather straightforward 
application of compression on R sequence. 

Second remedy results from the fact that our QR merge process is hierarchical. 
Right now it's two-level hierarchy. I.e if processes at each stage merge 1000 q 
blocks, then at the next level we can merge another 1000 q blocks, so total 
number of q blocks is thus approx. 1 mln. (for 1G memory the number of some 
600-800k blocks is more plausible). Assuming Q blocks are k+p rows high, that 
gives us aprroximately 300 mln rows for m. But the trick is that if we have 1G 
memory in the mapper, then Q blocks don't have to be 500 rows high, then can 
easily be 500k rows high. Which immediately puts us in th range of 30 bln rows 
or so. 

But if we add another map-only pass over blocked Q data, then we can have 1 mln 
blocks with all the considerations above and that should put us in the range of 
30 trillion rows of A. This number grows exponentially with every added MR pass 
which is why i am saying m is virtually unbound. 

Adding these remedies seems to be pretty straighforward, but for a first stab 
at the problem my estimates for m bound seem to be adequate. With these kind of 
numbers, this may easily become a technology in a search of a problem. We may 
add some analysis on optimality of combination of block size and minSplitSize. 
My initial thought is that finding maximum here is pretty straigtoward, seems 
to be a task of finding maximum on a second degree polynomial function.

It's more likely that much more memory would go into precision effort rather 
than maximizing m bound though. E.g. instead of covering the scale, the 
resources may go into increasing precision and oversampling. In which case 
additional map-only passes over Q will be tremendously useful. (imagine this 
could do k+p=50000 with just one additional map-only pass over Q data.) If this 
is the case, then the next low hanging fruit step is to add map-only 
hierarchical merge of Rs on Q blocks.

However, it's a stochastic algorithm and as such it is probably not good enough 
for processes that would require such precision (e.g certainly not to do math 
work on rocket boosters, i think). That said, k+p=50000 probably doesn't make 
sense. I think applications sharply divide into 2 categories, where precision 
requirements are either much higher than that, or much lower. I can't think of 
much in between. 



> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
>                 Key: MAHOUT-376
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-376
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.5
>
>         Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for 
> mapreduce.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR 
> decomposition for Map.pdf, sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf, 
> sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf, SSVD working 
> notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz, 
> ssvd-m1.patch.gz, ssvd-m2.patch.gz, ssvd-m3.patch.gz, Stochastic SVD using 
> eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to