I am using Spark ML's pipeline to classify text documents with the following steps: Tokenizer -> CountVectorizer -> LogisticRegression I want to be able to print the words with the highest weights. Can this be done? So far I have been able to extract the LR coefficients, but can those be tied up to the actual words? import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}import org.apache.spark.ml.feature.{CountVectorizer, Tokenizer}val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val countVectorizer = new CountVectorizer() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01)val pipeline = new Pipeline() .setStages(Array(tokenizer, countVectorizer, lr))// Fit the pipeline to training documents.val model = pipeline.fit(training)val results = model.transform(test)val lrm: LogisticRegressionModel = model.stages.last.asInstanceOf[LogisticRegressionModel]// PRINT COEFFICIENTSprintln(s"LR Model coefficients:\n${lrm.coefficients.toArray.mkString("\n")}")(lrm.intercept, lrm.coefficients)
-- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/