Hi, I'm working on a spark job that frequently iterates over huge RDDs and matches the elements against some Maps that easily fit into memory. So what I do is to broadcast that Map and reference it from my RDD.
Works like a charm, until at some point it doesn't, and I can't figure out why... Please have a look at this: def fun(sc: SparkContext, someRDD: RDD[(String)], someMap: RDD[(String, Double)]) = { // I want to access the Map multiple times, so I broadcast it val broadcast = sc.broadcast(someMap.collectAsMap()) // the next line creates one job per element and executes collectAsMap() over and over again println(someRDD.take(100).map(s => broadcast.value.getOrElse(s, 0.0)).toList.mkString("\n")) // the next line creates a new spark context and crashes (only one spark context per JVM...) println(someRDD.map(s => broadcast.value.getOrElse(s, 0.0)).collect().mkString("\n")) } Here I'm doing just what I've described above: broadcast a Map and access the broadcast value while iterating over another RDD. Now when I take a subset of the RDD (`take(100)`), Spark creates one job per ELEMENT (that's 100 jobs) where `collectAsMap` is called. Obviously, this takes quite a lot of time (~500 ms per element). When I actually want to map over the entire RDD, Spark tries to launch another Spark context and crashes the whole application. org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 37.0 failed 1 times, most recent failure: Lost task 2.0 in stage 37.0 (TID 106, localhost): org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. I couldn't reproduce this error in a minimal working example, so there must be something in my pipeline that is messing things up. The error is 100% reproducible in my environment and the application runs fine as soon as I don't access this specific Map from this specific RDD. Any idea what might cause this problem? Can I provide you with any other Information (besides posting >500 lines of code)? cheers Sebastian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Access-a-Broadcast-variable-causes-Spark-to-launch-a-second-context-tp24595.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org