[ https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771094#comment-16771094 ]
Sean Owen commented on SPARK-26881: ----------------------------------- It's related but not quite the same issue. (You should try 2.3.3 at least with then other fix for a perf improvement.) To be clear this proposes a deeper treeAggregate in some cases? Makes sense. The cost is more overall wall clock time and a little more network traffic. But I agree that it needs to be capped by the max result set size as you say. Go ahead with that. > Scaling issue with Gramian computation for RowMatrix: too many results sent > to driver > ------------------------------------------------------------------------------------- > > Key: SPARK-26881 > URL: https://issues.apache.org/jira/browse/SPARK-26881 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.2.0 > Reporter: Rafael RENAUDIN-AVINO > Priority: Minor > > This issue hit me when running PCA on large dataset (~1Billion rows, ~30k > columns). > Computing Gramian of a big RowMatrix allows to reproduce the issue. > > The problem arises in the treeAggregate phase of the gramian matrix > computation: results sent to driver are enormous. > A potential solution to this could be to replace the hard coded depth (2) of > the tree aggregation by a heuristic computed based on the number of > partitions, driver max result size, and memory size of the dense vectors that > are being aggregated, cf below for more detail: > (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size > I have a potential fix ready (currently testing it at scale), but I'd like to > hear the community opinion about such a fix to know if it's worth investing > my time into a clean pull request. > > Note that I only faced this issue with spark 2.2 but I suspect it affects > later versions aswell. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org