[
https://issues.apache.org/jira/browse/MAHOUT-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167205#comment-13167205
]
Sebastian Schelter commented on MAHOUT-922:
-------------------------------------------
A few details on the testcase: I'm trying to compute the first 10-20
eigenvalues of the symmetric adjacency matrix of the wikipedia page link graph
from http://users.on.net/~henry/home/wikipedia.htm.
> SSVD: ABt Job tweaks for extra sparse inputs
> --------------------------------------------
>
> Key: MAHOUT-922
> URL: https://issues.apache.org/jira/browse/MAHOUT-922
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Affects Versions: 0.6
> Reporter: Dmitriy Lyubimov
> Assignee: Dmitriy Lyubimov
> Fix For: 0.6
>
>
> Per tests on Sebastian's extremely sparse large inputs (4.5m x 4.5 m).
> AB' performance is still a bottleneck if one uses power iterations. For
> sufficiently sparse inputs it may turn out that mappers cannot form the
> entire blocked product in memory for Y_i. the Y_i block is going to be of
> size s x (k+p) where s is number of A rows read in a given split. in cases
> when A is extra sparse, such blocks may actually take more space than the A
> input. When this happens, s is constrained by -oh parameter and combiners and
> reducers get flooded by partial oh x (k+p) outer products and seem to have
> hard time to sort and shuffle them (especially high pressure on combiners has
> been seen).
> So, several improvements in this patch:
> -- present Y_i blocks as dense (they are beleived to be dense anyway, so
> keeping them as sparse just eats up RAM by sparse encoding, so at least twice
> as high blocks can actually be formed);
> -- eliminate combining completely. instead of persisting and sorting and
> summing up partial product in combiner, sum up map-side. if block height is
> still insufficient and cannot be extended due RAM constraints (unlikely for
> Sebastien's 4.5 x 4.5 mln case) just perform additional passes over B'. Since
> computation is cpu bound, additional passes over B' should not register.
> However, elimination of combiner phase for high load cases is probably going
> to have a dramatic effect.
> -- set max block height for Q'A and AB' separately instead of single -oh
> option. Their scaling seems to be quite different in terms of OOM danger. in
> my experiments Q'A blocking enters red zone at ~150,000 already whereas AB'
> block height can freely roam over a million easily for the same RAM. I
> provide 200,000 (~160Mb for k+p=100) as a default for AB' blocks which should
> be enough for Sebastien's 4.5 x 4.5 mln sparse case without causing more than
> one block.
> Miscellanea:
> Test run time: removed redundant tests and checks for SSVD. reduced test
> input size.
> Current patch branch work is here:
> https://github.com/dlyubimov/mahout-commits/tree/ABt-tweaks
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira