[ https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771138#comment-16771138 ]
Rafael RENAUDIN-AVINO commented on SPARK-26881: ----------------------------------------------- Basically I see two improvements that could be made: 1- Allow the depth of the aggregation to be configurable by the user 2- If not specified by the user, the depth of the treeAggregate can be computed with the heuristic in the ticket description. I'm not sure about the value of 2- without integrating 1-: if a framework uses a heuristic to determine a parameter without giving the user the option of setting the parameter himself, it might be annoying... > Scaling issue with Gramian computation for RowMatrix: too many results sent > to driver > ------------------------------------------------------------------------------------- > > Key: SPARK-26881 > URL: https://issues.apache.org/jira/browse/SPARK-26881 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.2.0 > Reporter: Rafael RENAUDIN-AVINO > Priority: Minor > > This issue hit me when running PCA on large dataset (~1Billion rows, ~30k > columns). > Computing Gramian of a big RowMatrix allows to reproduce the issue. > > The problem arises in the treeAggregate phase of the gramian matrix > computation: results sent to driver are enormous. > A potential solution to this could be to replace the hard coded depth (2) of > the tree aggregation by a heuristic computed based on the number of > partitions, driver max result size, and memory size of the dense vectors that > are being aggregated, cf below for more detail: > (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size > I have a potential fix ready (currently testing it at scale), but I'd like to > hear the community opinion about such a fix to know if it's worth investing > my time into a clean pull request. > > Note that I only faced this issue with spark 2.2 but I suspect it affects > later versions aswell. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org