Travis Hegner created SPARK-21958:
-------------------------------------

             Summary: Attempting to save large Word2Vec model hangs driver in 
constant GC.
                 Key: SPARK-21958
                 URL: https://issues.apache.org/jira/browse/SPARK-21958
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.2.0
         Environment: Running spark on yarn, hadoop 2.7.2 provided by the 
cluster
            Reporter: Travis Hegner


In the new version of Word2Vec, the model saving was modified to estimate an 
appropriate number of partitions based on the kryo buffer size. This is a great 
improvement, but there is a caveat for very large models.

The {{(word, vector)}} tuple goes through a transformation to a local case 
class of {{Data(word, vector)}}... I can only assume this is for the kryo 
serialization process. The new version of the code iterates over the entire 
vocabulary to do this transformation (the old version wrapped the entire datum) 
in the driver's heap. Only to have the result then distributed to the cluster 
to be written into it's parquet files.

With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, 
and tri-grams), that local driver transformation is causing the driver to hang 
indefinitely in GC as I can only assume that it's generating millions of short 
lived objects which can't be evicted fast enough.

Perhaps I'm overlooking something, but it seems to me that since the result is 
distributed over the cluster to be saved _after_ the transformation anyway, we 
may as well distribute it _first_, allowing the cluster resources to do the 
transformation more efficiently, and then write the parquet file from there.

I have a patch implemented, and am in the process of testing it at scale. I 
will open a pull request when I feel that the patch is successfully resolving 
the issue, and after making sure that it passes unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to