Hello, I have the following code that trains a mapping of review text to ratings. I use a tokenizer to get all the words from the review, and use a count vectorizer to get all the words. However, when I train the classifier I get a match error. Any pointers will be very helpful.
The code is below, val spark = SparkSession.builder().appName("Logistic Regression").master("local").getOrCreate() import spark.implicits._ val df = spark.read.option("header", "true").option("inferSchema", "true").csv("data/amazon_baby.csv") val tk = new Tokenizer().setInputCol("review").setOutputCol("words") val cv = new CountVectorizer().setInputCol("words").setOutputCol("features") val isGood = udf((x: Int) => if (x >= 4) 1 else 0) val words = tk.transform(df.withColumn("label", isGood('rating))) val Array(training, test) = cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1) val classifier = new LogisticRegression() training.show(10) val simpleModel = classifier.fit(training) simpleModel.evaluate(test).predictions.select("words", "label", "prediction", "probability").show(10) And the error I get is below. 16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 9) scala.MatchError: [null,1.0,(257358,[0,1,2,3,4,5,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,58,68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,169,208,219,221,235,249,255,260,353,355,371,431,442,641,711,972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,6288,7294,8951,9758,12203,18319,21779,48525,72732,75420,146476,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,2.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266) at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) Many thanks -- *Meeraj Kunnumpurath* *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597* *00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*