You want to persist the state between the execution of two rdds. So, I believe what you need is serialization of your model and not JavaRDD. If you can serialize your model, you can persist that in HDFS or some other datastore to be used by the next RDDs.
If you are using Spark Streaming, doing this would be easy. On Wed, Sep 2, 2015 at 4:54 PM, Raja Reddy <klnrajare...@gmail.com> wrote: > Hi All, > > *Context:* > I am exploring topic modelling with LDA with Spark MLLib. However, I need > my model to enhance as more batches of documents come in. > > As of now I see no way of doing something like this, which gensim > <https://radimrehurek.com/gensim/models/ldamodel.html> does: > > lda.update(other_corpus) > > The only way I can enhance my model is essentially to recompute the > LDAModel over all the documents accumulated after a new batch arrives. > > *Question:* > One of the time consuming steps before performing topic modelling would be > to construct the corpus as JavaRDD object, while reading through the actual > documents. > > Capability to serialize a JavaRDD instance and reconstructing JavaRDD from > the serialized snapshot would be helpful in this case. Suppose say I > construct and serialize JavaRDD after reading Batch-1 of documents. When > the Batch-2 arrives, I would like to deserialize the previously serialized > RDD and mutate it with contents of new batch of documents. Could someone > please let me know if serialization and deserialization of a JavaRDD > instance is possible? I will have more questions if serialization is > possible, mostly to do with changing spark configuration in between a > serialization operation and deserialization operation. > > Thanks and Regards, > Raja. >