[ https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072857#comment-17072857 ]
Ben Roling commented on SPARK-30443: ------------------------------------ Thanks [~xiaojuwu]. Using explain() on the test program [~ltrichter] provides does show SortMergeJoin and the same code does not exhibit the issue in Spark 2.4.5. It does seem SPARK-21492 is likely to be the cause. > "Managed memory leak detected" even with no calls to take() or limit() > ---------------------------------------------------------------------- > > Key: SPARK-30443 > URL: https://issues.apache.org/jira/browse/SPARK-30443 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.2, 2.4.4, 3.0.0 > Reporter: Luke Richter > Priority: Major > Attachments: a.csv.zip, b.csv.zip, c.csv.zip > > > Our Spark code is causing a "Managed memory leak detected" warning to appear, > even though we are not calling take() or limit(). > According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 > managed memory leaks should only be caused by not reading an iterator to > completion, i.e. take() or limit() > Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed > memory leak detected; size = 2097152 bytes, TID = 118" > The size of the managed memory leak is always 2MB. > I have created a minimal test program that reproduces the warning: > {code:java} > import pyspark.sql > import pyspark.sql.functions as fx > def main(): > builder = pyspark.sql.SparkSession.builder > builder = builder.appName("spark-jira") > spark = builder.getOrCreate() > reader = spark.read > reader = reader.format("csv") > reader = reader.option("inferSchema", "true") > reader = reader.option("header", "true") > table_c = reader.load("c.csv") > table_a = reader.load("a.csv") > table_b = reader.load("b.csv") > primary_filter = fx.col("some_code").isNull() > new_primary_data = table_a.filter(primary_filter) > new_ids = new_primary_data.select("some_id") > new_data = table_b.join(new_ids, "some_id") > new_data = new_data.select("some_id") > result = table_c.join(new_data, "some_id", "left") > result.repartition(1).write.json("results.json", mode="overwrite") > spark.stop() > if __name__ == "__main__": > main() > {code} > Our code isn't anything out of the ordinary, just some filters, selects and > joins. > The input data is made up of 3 CSV files. The input data files are quite > large, roughly 2.6GB in total uncompressed. I attempted to reduce the number > of rows in the CSV input files but this caused the warning to no longer > appear. After compressing the files I was able to attach them below. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org