I'm running a Jupyter-Spark setup and I want to benchmark my cluster with different input parameters. To make sure the enivorment stays the same I'm trying to reset(restart) the SparkContext, here is some code:
* temp_result_parquet = os.path.normpath('/home/spark_tmp_parquet') i = 0 while i < max_i: i += 1 if os.path.exists(temp_result_parquet): shutil.rmtree(temp_result_parquet) # I know I could simply overwrite the parquet My_DF = do_something(i) My_DF.write.parquet(temp_result_parquet) sc.stop() time.sleep(10) sc = pyspark.SparkContext(master='spark://ip:here', appName='PySparkShell')* when I do this in the first iteration it runs fine but in the second I get the following error: * Py4JJavaError: An error occurred while calling o1876.parquet. : org.apache.spark.SparkException: Job aborted. [...] Caused by: java.lang.IllegalStateException: SparkContext has been shutdown at org.apache.spark.SparkContext.runJob(SparkContext.scala:2014) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)* I tried running the code without the SparkContext restart but this results in memory issues. So to wipe the slate clean before every iteration I'm trying this. With the weird result that parquet "thinks" SparkContext is down.