Considering memory footprint of param as mentioned by Dmitry, option b seems better.
Cheers On Wed, Jun 3, 2015 at 6:27 AM, Evo Eftimov <evo.efti...@isecc.com> wrote: > Hmmm a spark streaming app code doesn't execute in the linear fashion > assumed in your previous code snippet - to achieve your objectives you > should do something like the following > > in terms of your second objective - saving the initialization and > serialization of the params you can: > > a) broadcast them > b) have them as a Singleton (initialized from e.g. params in a file on > HDFS) > on each Executor > > messageBodies.foreachRDD(new Function<JavaRDD<String>, Void>() { > > Param param = new Param(); > param.initialize(); > > @Override > public Void call(JavaRDD<String> rdd) throws Exception { > ProcessPartitionFunction func = new > ProcessPartitionFunction(param); > rdd.foreachPartition(func); > return null; > } > > }); > > //put this in e.g. the object destructor > param.deinitialize(); > > -----Original Message----- > From: dgoldenberg [mailto:dgoldenberg...@gmail.com] > Sent: Wednesday, June 3, 2015 1:56 PM > To: user@spark.apache.org > Subject: Objects serialized before foreachRDD/foreachPartition ? > > I'm looking at https://spark.apache.org/docs/latest/tuning.html. > Basically > the takeaway is that all objects passed into the code processing RDD's must > be serializable. So if I've got a few objects that I'd rather initialize > once and deinitialize once outside of the logic processing the RDD's, I'd > need to think twice about the costs of serializing such objects, it would > seem. > > In the below, does the Spark serialization happen before calling foreachRDD > or before calling foreachPartition? > > Param param = new Param(); > param.initialize(); > messageBodies.foreachRDD(new Function<JavaRDD<String>, Void>() { > @Override > public Void call(JavaRDD<String> rdd) throws Exception { > ProcessPartitionFunction func = new > ProcessPartitionFunction(param); > rdd.foreachPartition(func); > return null; > } > }); > param.deinitialize(); > > If param gets initialized to a significant memory footprint, are we better > off creating/initializing it before calling new ProcessPartitionFunction() > or perhaps in the 'call' method within that function? > > I'm trying to avoid calling expensive init()/deinit() methods while > balancing against the serialization costs. Thanks. > > > > -- > View this message in context: > > http://apache-spark-user-list.1001560.n3.nabble.com/Objects-serialized-befor > e-foreachRDD-foreachPartition-tp23134.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >