[ 
https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21958.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.3.0

Issue resolved by pull request 19191
[https://github.com/apache/spark/pull/19191]

> Attempting to save large Word2Vec model hangs driver in constant GC.
> --------------------------------------------------------------------
>
>                 Key: SPARK-21958
>                 URL: https://issues.apache.org/jira/browse/SPARK-21958
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.2.0
>         Environment: Running spark on yarn, hadoop 2.7.2 provided by the 
> cluster
>            Reporter: Travis Hegner
>              Labels: easyfix, patch, performance
>             Fix For: 2.3.0
>
>
> In the new version of Word2Vec, the model saving was modified to estimate an 
> appropriate number of partitions based on the kryo buffer size. This is a 
> great improvement, but there is a caveat for very large models.
> The {{(word, vector)}} tuple goes through a transformation to a local case 
> class of {{Data(word, vector)}}... I can only assume this is for the kryo 
> serialization process. The new version of the code iterates over the entire 
> vocabulary to do this transformation (the old version wrapped the entire 
> datum) in the driver's heap. Only to have the result then distributed to the 
> cluster to be written into it's parquet files.
> With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, 
> and tri-grams), that local driver transformation is causing the driver to 
> hang indefinitely in GC as I can only assume that it's generating millions of 
> short lived objects which can't be evicted fast enough.
> Perhaps I'm overlooking something, but it seems to me that since the result 
> is distributed over the cluster to be saved _after_ the transformation 
> anyway, we may as well distribute it _first_, allowing the cluster resources 
> to do the transformation more efficiently, and then write the parquet file 
> from there.
> I have a patch implemented, and am in the process of testing it at scale. I 
> will open a pull request when I feel that the patch is successfully resolving 
> the issue, and after making sure that it passes unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to