I am hoping someone can confirm this is a bug and/or provide a solution. I am trying to serialize an LDA model to disk for later use, but upon deserialization the model is not fully functional. In particular, transformation of data throws a NullPointerException. Here is a minimal example (just run in 'spark-shell') that exercises the behavior:
https://gist.github.com/bjedwards/14e9bb876381910bc525063bee342b41 The problem is here https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L456 The issue is that sparkSession is defined as a transient, so is not serialized and there is never a check to make sure it exists. Weirdly, I can't find where it is set in the first place. I think that line should read: ... val transformer = oldLocalModel.getTopicDistributionMethod(dataset.sparkSession.sparkContext) ... ie sparkSession -> dataset.sparkSession, as in all the other places the dataset's spark session is used. The model is functional again if I patch up the class via reflection (the last bit of the gist), as a work around. Is this a bug? Or are LocalLDA models not really meant to be serializable? Ben Edwards Postdoctoral Researcher IBM Research