Hello,I have a model, which uses CountVectorizer and LogisticRegression. *Everything seems to work fine, except that when I am running the last step to get results and predictions, the document ids (doc_id) are being changed completely. Do you know why that is? Am I doing something wrong?* import org.apache.spark.ml.classification.LogisticRegressionimport org.apache.spark.ml.feature.{CountVectorizer, Tokenizer}val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val countVectorizer = new CountVectorizer() //.setVocabSize(50000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01)val pipeline = new Pipeline() .setStages(Array(tokenizer, countVectorizer, lr))// Fit the pipeline to training documents.val model = pipeline.fit(training)val results = model.transform(test) Training and test are two DFs with the following structure: root |-- doc_id: string (nullable = true) |-- text: string (nullable = true) |-- label: integer (nullable = false) Thanks in advance!
-- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/