Hello,

I have the following code that trains a mapping of review text to ratings.
I use a tokenizer to get all the words from the review, and use a count
vectorizer to get all the words. However, when I train the classifier I get
a match error. Any pointers will be very helpful.

The code is below,

val spark = SparkSession.builder().appName("Logistic
Regression").master("local").getOrCreate()
import spark.implicits._

val df = spark.read.option("header", "true").option("inferSchema",
"true").csv("data/amazon_baby.csv")
val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")

val isGood = udf((x: Int) => if (x >= 4) 1 else 0)

val words = tk.transform(df.withColumn("label", isGood('rating)))
val Array(training, test) =
cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)

val classifier = new LogisticRegression()

training.show(10)

val simpleModel = classifier.fit(training)
simpleModel.evaluate(test).predictions.select("words", "label",
"prediction", "probability").show(10)


And the error I get is below.

16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 9)
scala.MatchError:
[null,1.0,(257358,[0,1,2,3,4,5,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,58,68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,169,208,219,221,235,249,255,260,353,355,371,431,442,641,711,972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,6288,7294,8951,9758,12203,18319,21779,48525,72732,75420,146476,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,2.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])]
(of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at
org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at
org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)

Many thanks
-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*

Reply via email to