Hi All,

*Context:*
I am exploring topic modelling with LDA with Spark MLLib. However, I need
my model to enhance as more batches of documents come in.

As of now I see no way of doing something like this, which gensim
<https://radimrehurek.com/gensim/models/ldamodel.html> does:

lda.update(other_corpus)

The only way I can enhance my model is essentially to recompute the
LDAModel over all the documents accumulated after a new batch arrives.

*Question:*
One of the time consuming steps before performing topic modelling would be
to construct the corpus as JavaRDD object, while reading through the actual
documents.

Capability to serialize a JavaRDD instance and reconstructing JavaRDD from
the serialized snapshot would be helpful in this case. Suppose say I
construct and serialize JavaRDD after reading Batch-1 of documents. When
the Batch-2 arrives, I would like to deserialize the previously serialized
RDD and mutate it with contents of new batch of documents. Could someone
please let me know if serialization and deserialization of a JavaRDD
instance is possible? I will have more questions if serialization is
possible, mostly to do with changing spark configuration in between a
serialization operation and deserialization operation.

Thanks and Regards,
Raja.

Reply via email to