[jira] [Commented] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

Ben Roling (Jira) Wed, 01 Apr 2020 08:18:08 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072857#comment-17072857
 ]


Ben Roling commented on SPARK-30443:
------------------------------------

Thanks [~xiaojuwu].  Using explain() on the test program [~ltrichter] provides 
does show SortMergeJoin and the same code does not exhibit the issue in Spark 
2.4.5.  It does seem SPARK-21492 is likely to be the cause.

> "Managed memory leak detected" even with no calls to take() or limit()
> ----------------------------------------------------------------------
>
>                 Key: SPARK-30443
>                 URL: https://issues.apache.org/jira/browse/SPARK-30443
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.2, 2.4.4, 3.0.0
>            Reporter: Luke Richter
>            Priority: Major
>         Attachments: a.csv.zip, b.csv.zip, c.csv.zip
>
>
> Our Spark code is causing a "Managed memory leak detected" warning to appear, 
> even though we are not calling take() or limit().
> According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
> managed memory leaks should only be caused by not reading an iterator to 
> completion, i.e. take() or limit()
> Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
> memory leak detected; size = 2097152 bytes, TID = 118"
>  The size of the managed memory leak is always 2MB.
> I have created a minimal test program that reproduces the warning: 
> {code:java}
> import pyspark.sql
> import pyspark.sql.functions as fx
> def main():
>     builder = pyspark.sql.SparkSession.builder
>     builder = builder.appName("spark-jira")
>     spark = builder.getOrCreate()
>     reader = spark.read
>     reader = reader.format("csv")
>     reader = reader.option("inferSchema", "true")
>     reader = reader.option("header", "true")
>     table_c = reader.load("c.csv")
>     table_a = reader.load("a.csv")
>     table_b = reader.load("b.csv")
>     primary_filter = fx.col("some_code").isNull()
>     new_primary_data = table_a.filter(primary_filter)
>     new_ids = new_primary_data.select("some_id")
>     new_data = table_b.join(new_ids, "some_id")
>     new_data = new_data.select("some_id")
>     result = table_c.join(new_data, "some_id", "left")
>     result.repartition(1).write.json("results.json", mode="overwrite")
>     spark.stop()
> if __name__ == "__main__":
>     main()
> {code}
> Our code isn't anything out of the ordinary, just some filters, selects and 
> joins.
> The input data is made up of 3 CSV files. The input data files are quite 
> large, roughly 2.6GB in total uncompressed. I attempted to reduce the number 
> of rows in the CSV input files but this caused the warning to no longer 
> appear. After compressing the files I was able to attach them below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

Reply via email to