Access a Broadcast variable causes Spark to launch a second context

sstraub Mon, 07 Sep 2015 06:59:29 -0700

Hi,

I'm working on a spark job that frequently iterates over huge RDDs and
matches the elements against some Maps that easily fit into memory. So what
I do is to broadcast that Map and reference it from my RDD.


Works like a charm, until at some point it doesn't, and I can't figure out
why...
Please have a look at this:

  def fun(sc: SparkContext, someRDD: RDD[(String)], someMap: RDD[(String,
Double)]) = {
    // I want to access the Map multiple times, so I broadcast it
    val broadcast = sc.broadcast(someMap.collectAsMap())
    // the next line creates one job per element and executes collectAsMap()
over and over again
    println(someRDD.take(100).map(s => broadcast.value.getOrElse(s,
0.0)).toList.mkString("\n"))
    // the next line creates a new spark context and crashes (only one spark
context per JVM...)
    println(someRDD.map(s => broadcast.value.getOrElse(s,
0.0)).collect().mkString("\n"))
  }

Here I'm doing just what I've described above: broadcast a Map and access
the broadcast value while iterating over another RDD.

Now when I take a subset of the RDD (`take(100)`), Spark creates one job per
ELEMENT (that's 100 jobs) where `collectAsMap` is called. Obviously, this
takes quite a lot of time (~500 ms per element).
When I actually want to map over the entire RDD, Spark tries to launch
another Spark context and crashes the whole application.

    org.apache.spark.SparkException: Job aborted due to stage failure: Task
2 in stage 37.0 failed 1 times, most recent failure: Lost task 2.0 in stage
37.0 (TID 106, localhost): org.apache.spark.SparkException: Only one
SparkContext may be running in this JVM (see SPARK-2243). To ignore this
error, set spark.driver.allowMultipleContexts = true.

I couldn't reproduce this error in a minimal working example, so there must
be something in my pipeline that is messing things up. The error is 100%
reproducible in my environment and the application runs fine as soon as I
don't access this specific Map from this specific RDD.

Any idea what might cause this problem?
Can I provide you with any other Information (besides posting >500 lines of
code)?

cheers
Sebastian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Access-a-Broadcast-variable-causes-Spark-to-launch-a-second-context-tp24595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Access a Broadcast variable causes Spark to launch a second context

Reply via email to