Jonas Amrich created SPARK-18274:
------------------------------------

             Summary: Memory leak in PySpark StringIndexer
                 Key: SPARK-18274
                 URL: https://issues.apache.org/jira/browse/SPARK-18274
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.1
            Reporter: Jonas Amrich


StringIndexerModel won't get collected by GC in Java even when deleted in 
Python. It can be reproduced by this code, which fails after couple of 
iterations (around 7 if you set driver memory to 600MB): 

{code}
import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for 
_ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
{code}

Explicit call to Python GC fixes the issue - following code runs fine:

{code}
for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
    gc.collect()
{code}

The issue is similar to SPARK-6194 and can be probably fixed by calling jvm 
detach in model's destructor. This is implemented in 
pyspark.mlib.common.JavaModelWrapper but missing in 
pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be affected 
by this memory leak. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to