Hi all, I'm having some issues finding any kind of best practices when attempting to create Spark applications which launch jobs from a thread pool.
Initially I had issues passing the SparkContext to other threads as it is not serializable. Eventually I found that adding the @transient annotation prevents a NotSerializableException. ``` class SecondRollup(@transient sc: SparkContext, connector: CassandraConnector, scanPoint: DateTime) extends Runnable with Serializable { ... } ``` However, now I am running into a different exception: ``` 15/06/05 11:35:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NoSuchMethodError: org.apache.spark.executor.TaskMetrics.inputMetrics_$eq(Lscala/Option;)V at com.datastax.spark.connector.metrics.InputMetricsUpdater$.apply(InputMetricsUpdater.scala:61) at com.datastax.spark.connector.rdd.CassandraTableScanRDD.compute(CassandraTableScanRDD.scala:196) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` The documentation (https://spark.apache.org/docs/latest/job-scheduling.html) explicitly states that jobs can be submitted by multiple threads but I seem to be doing *something* incorrectly and haven't found any docs to point me in the right direction. Does anyone have any advice on how to get jobs submitted by multiple threads? The jobs are fairly simple and work when I run them serially, so I'm not exactly sure what I'm doing wrong. Thanks, Lee