[jira] [Commented] (MAHOUT-922) SSVD: ABt Job tweaks for extra sparse inputs

[email protected] (Commented) (JIRA) Mon, 19 Dec 2011 12:31:57 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172589#comment-13172589
 ]


[email protected] commented on MAHOUT-922:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3265/
-----------------------------------------------------------

Review request for mahout.


Summary
-------

MAHOUT-922-2: add DistributedCache broadcast to B' files for AB' job and R-hat 
files for B' job, on by default, governed by -br option. 

Notes: Performance: I did not notice the difference between using distributed 
cache vs. opening direct streams, which is understandable since jobs are 
cpu-bound.
I did have to add some functionality to multifile sequence file iterators to 
allow for specifying multiple files coming from distributed cache which is 
neither glob nor directory. I also added fixes for some corner case NPEs there.

Sorry eclipse reformatting for style is a bit different from original Sean's 
formatting in Intellij, it is hard to adjust it exactly. 


This addresses bug MAHOUT-922.
    https://issues.apache.org/jira/browse/MAHOUT-922


Diffs
-----


Diff: https://reviews.apache.org/r/3265/diff


Testing
-------


Thanks,

Dmitriy


                
> SSVD: ABt Job tweaks for extra sparse inputs
> --------------------------------------------
>
>                 Key: MAHOUT-922
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-922
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.6
>
>         Attachments: MAHOUT-922.patch, MAHOUT-922.patch, MAHOUT-922.patch
>
>
> Per tests on Sebastian's extremely sparse large inputs (4.5m x 4.5 m). 
> AB' performance is still a bottleneck if one uses power iterations. For 
> sufficiently sparse inputs it may turn out that mappers cannot form the 
> entire blocked product in memory for Y_i. the Y_i block is going to be of 
> size s x (k+p) where s is number of A rows read in a given split. in cases 
> when A is extra sparse, such blocks may actually take more space than the A 
> input. When this happens, s is constrained by -oh parameter and combiners and 
> reducers get flooded by partial oh x (k+p) outer products and seem to have 
> hard time to sort and shuffle them (especially high pressure on combiners has 
> been seen). 
> So, several improvements in this patch: 
> -- present Y_i blocks as dense (they are beleived to be dense anyway, so 
> keeping them as sparse just eats up RAM by sparse encoding, so at least twice 
> as high blocks can actually be formed);
> -- eliminate combining completely. instead of persisting and sorting and 
> summing up partial product in combiner, sum up map-side. if block height is 
> still insufficient and cannot be extended due RAM constraints (unlikely for 
> Sebastien's 4.5 x 4.5 mln case) just perform additional passes over B'. Since 
> computation is cpu bound, additional passes over B' should not register. 
> However, elimination of combiner phase for high load cases is probably going 
> to have a dramatic effect.
> -- set max block height for Q'A and AB' separately instead of single -oh 
> option. Their scaling seems to be quite different in terms of OOM danger. in 
> my experiments Q'A blocking enters red zone at ~150,000 already whereas AB' 
> block height can freely roam over a million easily for the same RAM. I 
> provide 200,000 (~160Mb for k+p=100) as a default for AB' blocks which should 
> be enough for Sebastien's 4.5 x 4.5 mln sparse case without causing more than 
> one block. 
> Miscellanea: 
> -- Test run time: removed redundant tests and checks for SSVD. reduced test 
> input size.
> -- Per Nathan's suggestion, p parameter is now optional, default is 15 
> (computation time is cubic to it, so I want to be careful not to run it too 
> high by default).
> Current patch branch work is here: 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-922

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-922) SSVD: ABt Job tweaks for extra sparse inputs

Reply via email to