[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769725#comment-16769725 ] t oo commented on SPARK-20760: -- also facing this issue on 2.3 thrift server standalone mode > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao >Priority: Major > Attachments: RDD Blocks .png, RDD blocks in spark 2.1.1.png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655369#comment-16655369 ] Wenchen Fan commented on SPARK-20760: - has this been fixed by https://issues.apache.org/jira/browse/SPARK-24889 ? > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao >Priority: Major > Attachments: RDD Blocks .png, RDD blocks in spark 2.1.1.png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618098#comment-16618098 ] Sandish Kumar HN commented on SPARK-20760: -- I do see the issue in spark 2.1.1 & 2.2.0 and I was able to replicate the issue with above code snippets. > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao >Priority: Major > Attachments: RDD Blocks .png, RDD blocks in spark 2.1.1.png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051942#comment-16051942 ] Jose Soltren commented on SPARK-20760: -- Hello Ravi - current thinking is that this issue is a duplicate of SPARK-18991. I'll have a look at your JIRA and comment there if I have any further information. Thanks. > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD blocks in spark 2.1.1.png, RDD Blocks .png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051897#comment-16051897 ] Ravi Teja commented on SPARK-20760: --- We are also facing similar issue but in spark streaming. Please see this issue [SPARK-19644|https://issues.apache.org/jira/browse/SPARK-19644] thread. Can you please tell me if you found the solution to this issue? > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD blocks in spark 2.1.1.png, RDD Blocks .png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042177#comment-16042177 ] Jose Soltren commented on SPARK-20760: -- Hi Binzi: 1. If this is so then there is some error in my understanding. 2. I don't have a good answer here yet, sorry. 3. Though these RDD operations appear serially in your test program, they are all asynchronous behind the scenes. In particular, unpersisting RDDs really amounts to the driver sending the executors a bunch of messages to do unpersisting. Since they are asynchronous, decoupled operations, it is possible for one to be faster than another. I'm not entirely sure precisely what changed since Spark 1.6 here. My investigation was focused on where all the memory was going, not in digesting all the changes across a major version of Spark. Cheers. > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD blocks in spark 2.1.1.png, RDD Blocks .png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042009#comment-16042009 ] Binzi Cao commented on SPARK-20760: --- Hi Jose, Thanks very much for your detailed explanation. I got some questions as below 1. If the rdd blocks in ui just reflects how many rdds have been created, why does the number of rdd blocks go up and down during the test? 2. How about the storage page? I can see there is a list of RDDs in that page, does that page also show something wrong? 3. In this case, the code flow is : create rdd, cache rdd, do computations and unpersist , and they are in serial. I don't quite understand how the RDD blocks are created faster than they can be unpersisted. In addition, the issue did not happen in spark 1.6. Regards Binzi > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD blocks in spark 2.1.1.png, RDD Blocks .png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041027#comment-16041027 ] Jose Soltren commented on SPARK-20760: -- Hi - I've been researching this issue, and I agree with David above. I'm nearly certain SPARK-18991 fixes this issue. I ran the application until it was killed with OOM, inspected the heap dump, and saw that the ConcurrentLinkedQueue in ContextCleaner was using several GB of RAM. This suggested to me that RDDs were being persisted faster than they could be unpersisted. I wouldn't call this a memory leak so much as a bottleneck. I'm running this again with SPARK-18991 and will have a look at a heap dump in a bit. Binzi, the Spark UI's reporting of RDD Blocks is misleading. This reported number is non-decreasing, and does not accurately reflect the number of RDDs that are currently active, only how many have been created. See executorspage.js. This is eventually backed by numBlocks in StorageUtils.scala, which, AFAICT, can never be negative, so this quantity in the UI is strictly monotonically increasing. I don't think you can use UI reporting of RDD Blocks as an indicator of correct behavior. Bounded memory usage and not being killed by YARN or by an OOM is a better way to validate this. It would be good if there were a unit test for SPARK-18991 validating that RDD blocks cannot be created faster than they can be unpersisted. > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD blocks in spark 2.1.1.png, RDD Blocks .png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040112#comment-16040112 ] Binzi Cao commented on SPARK-20760: --- Hi David, Thanks very much for the message, I did a test with spark 2.1.1 in local mode. The issue seems still happening, while it seems much better than spark 2.0 as the RDD blocks grows much slower. After running the task for 2 hours, I got around 6000 rdd blocks in memory. I attached the screenshots for the 2.1.1 Binzi > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039375#comment-16039375 ] David Lewis commented on SPARK-20760: - I believe this is fixed by this issue: https://issues.apache.org/jira/browse/SPARK-18991 > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034206#comment-16034206 ] Patrick Brown commented on SPARK-20760: --- We have been having the same issue in a spark 2.0.2 cluster on yarn for a while now. > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012213#comment-16012213 ] Binzi Cao commented on SPARK-20760: --- You might need to wait for a bit more time to reproduce it. This time in my test, the rdd blocks is 64 for about 10 mins, and later it is over 100, after that it will increase quickly. Same code in the previous comments without any multi thread lines > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012195#comment-16012195 ] Sean Owen commented on SPARK-20760: --- I can't reproduce that locally. Watch the block count; it goes down over time. There may also be something do with the event bus getting clogged. This isn't a common usage pattern, but possible. You'd probably have to look into just how the blocks are unpersisted and see if you can come up with a theory of the problem. > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012191#comment-16012191 ] Binzi Cao commented on SPARK-20760: --- More information, just did a quick test by removing the Future/Await lines, the issue is still happening. RDD blocks just increased to over 14 thousands after 8 minutes, it is still increasing. Code as below: {code} def run(sc: SparkContext) = { while (true) { val r = scala.util.Random val data = (1 to r.nextInt(100)).toList.map { a => Person(a.toString, a.toString) } val rdd = sc.parallelize(data) rdd.cache println("running") val a = (1 to 1000).toList.map { x => rdd.filter(_.id == x.toString).collect } rdd.unpersist() } } {code} > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012176#comment-16012176 ] Binzi Cao commented on SPARK-20760: --- Ah, I see. I did something like rdd.count to trigger the cache first before the computations, however, that did not work. Is there a workaround for this? > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012169#comment-16012169 ] Sean Owen commented on SPARK-20760: --- I'm not saying different threads unpersist, but that multiple threads all encounter the RDD at once with uncached partitions and need to cache them. > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012161#comment-16012161 ] Binzi Cao commented on SPARK-20760: --- [~srowen] Thank you for the update. Some more informations 1. The issue did not happen in spark 1.6. 2. The rdd cache and unpersist are in the same thread, only the computations are in other threads. The future/await is just to simulate a production application. The unpersist call is after all the computations done, so there are no multiple cache/unpersist running in multiple threads 3. Each cached and unpersist are for different rdds with different ids, there are not multithread running to cache or unpersist same rdd. > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao > Attachments: RDD Blocks .png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org