[ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774750#comment-16774750
 ] 

Sean Owen commented on SPARK-26881:
-----------------------------------

I wouldn't make it configurable. I don't think it makes sense to set it such 
that the collect size is any smaller than the max driver result size, nor 
bigger. It's already kind of configurable by the max result size, although its 
meaning is only indirectly related.

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-26881
>                 URL: https://issues.apache.org/jira/browse/SPARK-26881
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.2.0
>            Reporter: Rafael RENAUDIN-AVINO
>            Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to