Hi All, *Context:* I am exploring topic modelling with LDA with Spark MLLib. However, I need my model to enhance as more batches of documents come in.
As of now I see no way of doing something like this, which gensim <https://radimrehurek.com/gensim/models/ldamodel.html> does: lda.update(other_corpus) The only way I can enhance my model is essentially to recompute the LDAModel over all the documents accumulated after a new batch arrives. *Question:* One of the time consuming steps before performing topic modelling would be to construct the corpus as JavaRDD object, while reading through the actual documents. Capability to serialize a JavaRDD instance and reconstructing JavaRDD from the serialized snapshot would be helpful in this case. Suppose say I construct and serialize JavaRDD after reading Batch-1 of documents. When the Batch-2 arrives, I would like to deserialize the previously serialized RDD and mutate it with contents of new batch of documents. Could someone please let me know if serialization and deserialization of a JavaRDD instance is possible? I will have more questions if serialization is possible, mostly to do with changing spark configuration in between a serialization operation and deserialization operation. Thanks and Regards, Raja.