May be this is a bug. The source can be found at:
https://github.com/purijatin/spark-retrain-bug
*Issue:*
The program takes input a set of documents. Where each document is in a
separate file.
The spark program tf-idf of the terms (Tokenizer -> Stopword remover ->
stemming -> tf -> tfidf).
Once the training is complete, it unpersists and restarts the operation
again.
def main(args: Array[String]): Unit = {
for {
iter <- Stream.from(1, 1)
} {
val data = new DataReader(spark).data
val pipelineData: Pipeline = new
Pipeline().setStages(english("reviewtext") ++ english("summary"))
val pipelineModel = pipelineData.fit(data)
val all: DataFrame = pipelineModel.transform(data)
.withColumn("rowid", functions.monotonically_increasing_id())
//evaluate the pipeline
all.rdd.foreach(x => x)
println(s"$iter - ${all.count()}. ${new Date()}")
data.unpersist()
}
}
When run with `-Xmx1500M` memory, it fails with OOME after about 5
iterations.
*Temporary Fix:*
When all the input documents are merged to a single file, then the issue is
no longer found.
Spark version: 2.3.0. Dependency information can be found here:
https://github.com/purijatin/spark-retrain-bug/blob/master/project/Dependencies.scala
Thanks,
Jatin