[ https://issues.apache.org/jira/browse/SPARK-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135036#comment-15135036 ]
leo wu commented on SPARK-13198: -------------------------------- Hi, Sean I am trying to do this similar in an IPython/Jupyter notebook by stopping a sparkContext and then start a new one with new sparkconf over a remote Spark standalone cluster, instead of local master which is originally initialized, like : import sys from random import random from operator import add import atexit import os import platform import py4j import pyspark from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext, HiveContext from pyspark.storagelevel import StorageLevel os.environ["SPARK_HOME"] = "/home/notebook/spark-1.6.0-bin-hadoop2.6" os.environ["PYSPARK_SUBMIT_ARGS"] = "--master spark://10.115.89.219:7077" os.environ["SPARK_LOCAL_HOSTNAME"] = "wzymaster2011" SparkContext.setSystemProperty("spark.master", "spark://10.115.89.219:7077") SparkContext.setSystemProperty("spark.cores.max", "4") SparkContext.setSystemProperty("spark.driver.host", "wzymaster2011") SparkContext.setSystemProperty("spark.driver.port", "9000") SparkContext.setSystemProperty("spark.blockManager.port", "9001") SparkContext.setSystemProperty("spark.fileserver.port", "9002") conf = SparkConf().setAppName("Leo-Python-Test") sc = SparkContext(conf=conf) However, I always get error on executor due to failing to load BlockManager info from driver but "localhost" , instead of setting in "spark.driver.host" like "wzymaster2011": 16/02/05 14:37:32 DEBUG BlockManager: Getting remote block broadcast_0_piece0 from BlockManagerId(driver, localhost, 9002) 16/02/05 14:37:32 DEBUG TransportClientFactory: Creating new connection to localhost/127.0.0.1:9002 16/02/05 14:37:32 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks java.io.IOException: Failed to connect to localhost/127.0.0.1:9002 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) So, I strongly suspect if there is a bug in SparkContext.stop() to clean up all data and resetting SparkConf() doesn't work well within one app. Please advise it. Millions of thanks > sc.stop() does not clean up on driver, causes Java heap OOM. > ------------------------------------------------------------ > > Key: SPARK-13198 > URL: https://issues.apache.org/jira/browse/SPARK-13198 > Project: Spark > Issue Type: Bug > Components: Mesos > Affects Versions: 1.6.0 > Reporter: Herman Schistad > Attachments: Screen Shot 2016-02-04 at 16.31.28.png, Screen Shot > 2016-02-04 at 16.31.40.png, Screen Shot 2016-02-04 at 16.31.51.png > > > When starting and stopping multiple SparkContext's linearly eventually the > driver stops working with a "io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Java heap space" error. > Reproduce by running the following code and loading in ~7MB parquet data each > time. The driver heap space is not changed and thus defaults to 1GB: > {code:java} > def main(args: Array[String]) { > val conf = new SparkConf().setMaster("MASTER_URL").setAppName("") > conf.set("spark.mesos.coarse", "true") > conf.set("spark.cores.max", "10") > for (i <- 1 until 100) { > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val events = sqlContext.read.parquet("hdfs://locahost/tmp/something") > println(s"Context ($i), number of events: " + events.count) > sc.stop() > } > } > {code} > The heap space fills up within 20 loops on my cluster. Increasing the number > of cores to 50 in the above example results in heap space error after 12 > contexts. > Dumping the heap reveals many equally sized "CoarseMesosSchedulerBackend" > objects (see attachments). Digging into the inner objects tells me that the > `executorDataMap` is where 99% of the data in said object is stored. I do > believe though that this is beside the point as I'd expect this whole object > to be garbage collected or freed on sc.stop(). > Additionally I can see in the Spark web UI that each time a new context is > created the number of the "SQL" tab increments by one (i.e. last iteration > would have SQL99). After doing stop and creating a completely new context I > was expecting this number to be reset to 1 ("SQL"). > I'm submitting the jar file with `spark-submit` and no special flags. The > cluster is running Mesos 0.23. I'm running Spark 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org