SparkContext & Threading

Lee McFadden Fri, 05 Jun 2015 11:49:22 -0700

Hi all,

I'm having some issues finding any kind of best practices when attempting
to create Spark applications which launch jobs from a thread pool.


Initially I had issues passing the SparkContext to other threads as it is
not serializable.  Eventually I found that adding the @transient annotation
prevents a NotSerializableException.

```
class SecondRollup(@transient sc: SparkContext, connector:
CassandraConnector, scanPoint: DateTime) extends Runnable with Serializable
{
...
}
```

However, now I am running into a different exception:

```
15/06/05 11:35:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
localhost): java.lang.NoSuchMethodError:
org.apache.spark.executor.TaskMetrics.inputMetrics_$eq(Lscala/Option;)V
at
com.datastax.spark.connector.metrics.InputMetricsUpdater$.apply(InputMetricsUpdater.scala:61)
at
com.datastax.spark.connector.rdd.CassandraTableScanRDD.compute(CassandraTableScanRDD.scala:196)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```

The documentation (https://spark.apache.org/docs/latest/job-scheduling.html)
explicitly states that jobs can be submitted by multiple threads but I seem
to be doing *something* incorrectly and haven't found any docs to point me
in the right direction.

Does anyone have any advice on how to get jobs submitted by multiple
threads?  The jobs are fairly simple and work when I run them serially, so
I'm not exactly sure what I'm doing wrong.

Thanks,

Lee

SparkContext & Threading

Reply via email to