Richard, That's exactly the strategy I've been trying, which is a wrapper singleton class. But I was seeing the inner object being created multiple times.
I wonder if the problem has to do with the way I'm processing the RDD's. I'm using JavaDStream to stream data (from Kafka). Then I'm processing the RDD's like so JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(...) JavaDStream<String> messageBodies = messages.map(...) messageBodies.foreachRDD(new MyFunction()); where MyFunction implements Function<JavaRDD<String>, Void> { ... rdd.map / rdd.filter ... rdd.foreach(... perform final action ...) } Perhaps the multiple singletons I'm seeing are the per-executor instances? Judging by the streaming programming guide, perhaps I should follow the connection sharing example: dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => val connection = createNewConnection() partitionOfRecords.foreach(record => connection.send(record)) connection.close() } } So I'd pre-create my singletons in the foreachPartition call which would let them be the per-JVM singletons, to be passed into MyFunction which would now be a partition processing function rather than an RDD processing function. I wonder whether these singletons would still be created on every call as the master sends RDD data over to the workers ? I also wonder whether using foreachPartition would be more efficient anyway and prevent some of the over-network data shuffling effects that I imagine may happen with just doing a foreachRDD ? On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher <rmarsc...@localytics.com> wrote: > Would it be possible to have a wrapper class that just represents a > reference to a singleton holding the 3rd party object? It could proxy over > calls to the singleton object which will instantiate a private instance of > the 3rd party object lazily? I think something like this might work if the > workers have the singleton object in their classpath. > > here's a rough sketch of what I was thinking: > > object ThirdPartySingleton { > private lazy val thirdPartyObj = ... > > def someProxyFunction() = thirdPartyObj.() > } > > class ThirdPartyReference extends Serializable { > def someProxyFunction() = ThirdPartySingleton.someProxyFunction() > } > > also found this SO post: > http://stackoverflow.com/questions/26369916/what-is-the-right-way-to-have-a-static-object-on-all-workers > > > On Tue, Jul 7, 2015 at 11:04 AM, dgoldenberg <dgoldenberg...@gmail.com> > wrote: > >> Hi, >> >> I am seeing a lot of posts on singletons vs. broadcast variables, such as >> * >> >> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html >> * >> >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tt11048.html#a21219 >> >> What's the best approach to instantiate an object once and have it be >> reused >> by the worker(s). >> >> E.g. I have an object that loads some static state such as e.g. a >> dictionary/map, is a part of 3rd party API and is not serializable. I >> can't >> seem to get it to be a singleton on the worker side as the JVM appears to >> be >> wiped on every request so I get a new instance. So the singleton doesn't >> stick. >> >> Is there an approach where I could have this object or a wrapper of it be >> a >> broadcast var? Can Kryo get me there? would that basically mean writing a >> custom serializer? However, the 3rd party object may have a bunch of >> member >> vars hanging off it, so serializing it properly may be non-trivial... >> >> Any pointers/hints greatly appreciated. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-using-singletons-on-workers-seems-unanswered-tp23692.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >