[ https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783526#comment-16783526 ]
Rafael RENAUDIN-AVINO commented on SPARK-26881: ----------------------------------------------- Sure, just started working on it. Was thinking about splitting in two commits: 1- add the heuristic: something like: {code:java} def getTreeAggregateIdealDepth(aggregatedObjectSizeInMb: Int): Int {code} to RDD file, next to treeAggregate method. 2- use this heuristic to compute the depth of tree aggregation in `computeGramianMatrix` method of RowMatrix. Anyway since it'll be my first contribution to spark, I guess we'll have to iterate at least once on the PR (even if I'm trying my best to follow guidelines, I'm pretty sure I'll miss something ^^) so I'll push as soon as I can. > Scaling issue with Gramian computation for RowMatrix: too many results sent > to driver > ------------------------------------------------------------------------------------- > > Key: SPARK-26881 > URL: https://issues.apache.org/jira/browse/SPARK-26881 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.2.0 > Reporter: Rafael RENAUDIN-AVINO > Priority: Minor > > This issue hit me when running PCA on large dataset (~1Billion rows, ~30k > columns). > Computing Gramian of a big RowMatrix allows to reproduce the issue. > > The problem arises in the treeAggregate phase of the gramian matrix > computation: results sent to driver are enormous. > A potential solution to this could be to replace the hard coded depth (2) of > the tree aggregation by a heuristic computed based on the number of > partitions, driver max result size, and memory size of the dense vectors that > are being aggregated, cf below for more detail: > (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size > I have a potential fix ready (currently testing it at scale), but I'd like to > hear the community opinion about such a fix to know if it's worth investing > my time into a clean pull request. > > Note that I only faced this issue with spark 2.2 but I suspect it affects > later versions aswell. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org