I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically the takeaway is that all objects passed into the code processing RDD's must be serializable. So if I've got a few objects that I'd rather initialize once and deinitialize once outside of the logic processing the RDD's, I'd need to think twice about the costs of serializing such objects, it would seem.
In the below, does the Spark serialization happen before calling foreachRDD or before calling foreachPartition? Param param = new Param(); param.initialize(); messageBodies.foreachRDD(new Function<JavaRDD<String>, Void>() { @Override public Void call(JavaRDD<String> rdd) throws Exception { ProcessPartitionFunction func = new ProcessPartitionFunction(param); rdd.foreachPartition(func); return null; } }); param.deinitialize(); If param gets initialized to a significant memory footprint, are we better off creating/initializing it before calling new ProcessPartitionFunction() or perhaps in the 'call' method within that function? I'm trying to avoid calling expensive init()/deinit() methods while balancing against the serialization costs. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Objects-serialized-before-foreachRDD-foreachPartition-tp23134.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org