Digging through it looks like an issue with reading CSV. Some of the data
have embedded commas in them, these fields are rightly quoted. However, the
CSV reader seems to be getting to a pickle, when the records contain quoted
and unquoted data. Fields are only quoted, when there are commas within the
fields, otherwise they are unquoted.

Regards
Meeraj

On Sat, Nov 19, 2016 at 10:10 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> Hello,
>
> I have the following code that trains a mapping of review text to ratings.
> I use a tokenizer to get all the words from the review, and use a count
> vectorizer to get all the words. However, when I train the classifier I get
> a match error. Any pointers will be very helpful.
>
> The code is below,
>
> val spark = SparkSession.builder().appName("Logistic 
> Regression").master("local").getOrCreate()
> import spark.implicits._
>
> val df = spark.read.option("header", "true").option("inferSchema", 
> "true").csv("data/amazon_baby.csv")
> val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
> val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")
>
> val isGood = udf((x: Int) => if (x >= 4) 1 else 0)
>
> val words = tk.transform(df.withColumn("label", isGood('rating)))
> val Array(training, test) = 
> cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)
>
> val classifier = new LogisticRegression()
>
> training.show(10)
>
> val simpleModel = classifier.fit(training)
> simpleModel.evaluate(test).predictions.select("words", "label", "prediction", 
> "probability").show(10)
>
>
> And the error I get is below.
>
> 16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID
> 9)
> scala.MatchError: [null,1.0,(257358,[0,1,2,3,4,
> 5,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,
> 58,68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,
> 169,208,219,221,235,249,255,260,353,355,371,431,442,641,
> 711,972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,
> 6288,7294,8951,9758,12203,18319,21779,48525,72732,75420,
> 146476,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,
> 2.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,
> 1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
> at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.
> apply(LogisticRegression.scala:266)
> at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.
> apply(LogisticRegression.scala:266)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(
> MemoryStore.scala:214)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:919)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:910)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
> at org.apache.spark.storage.BlockManager.doPutIterator(
> BlockManager.scala:910)
> at org.apache.spark.storage.BlockManager.getOrElseUpdate(
> BlockManager.scala:668)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>
> Many thanks
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*
>



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*

Reply via email to