[ https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Pentreath resolved SPARK-21958. ------------------------------------ Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19191 [https://github.com/apache/spark/pull/19191] > Attempting to save large Word2Vec model hangs driver in constant GC. > -------------------------------------------------------------------- > > Key: SPARK-21958 > URL: https://issues.apache.org/jira/browse/SPARK-21958 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.2.0 > Environment: Running spark on yarn, hadoop 2.7.2 provided by the > cluster > Reporter: Travis Hegner > Labels: easyfix, patch, performance > Fix For: 2.3.0 > > > In the new version of Word2Vec, the model saving was modified to estimate an > appropriate number of partitions based on the kryo buffer size. This is a > great improvement, but there is a caveat for very large models. > The {{(word, vector)}} tuple goes through a transformation to a local case > class of {{Data(word, vector)}}... I can only assume this is for the kryo > serialization process. The new version of the code iterates over the entire > vocabulary to do this transformation (the old version wrapped the entire > datum) in the driver's heap. Only to have the result then distributed to the > cluster to be written into it's parquet files. > With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, > and tri-grams), that local driver transformation is causing the driver to > hang indefinitely in GC as I can only assume that it's generating millions of > short lived objects which can't be evicted fast enough. > Perhaps I'm overlooking something, but it seems to me that since the result > is distributed over the cluster to be saved _after_ the transformation > anyway, we may as well distribute it _first_, allowing the cluster resources > to do the transformation more efficiently, and then write the parquet file > from there. > I have a patch implemented, and am in the process of testing it at scale. I > will open a pull request when I feel that the patch is successfully resolving > the issue, and after making sure that it passes unit tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org