Hi, Just after I sent the mail, I realized that the error might be with the training-dataset not the test-dataset.
1. it might be that you are feeding the full Y vector for training. 2. Which could mean, you are using ~50-50 training-test split. 3. Take a good look at the code that does the data split and the datasets where they are allocated to. Cheers <k/> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com> wrote: > Hi, > Looks like the test-dataset has different sizes for X & Y. Possible > steps: > > 1. What is the test-data-size ? > - If it is 15,909, check the prediction variable vector - it is now > 29,471, should be 15,909 > - If you expect it to be 29,471, then the X Matrix is not right. > 2. It is also probable that the size of the test-data is something > else. If so, check the data pipeline. > 3. If you print the count() of the various vectors, I think you can > find the error. > > Cheers & Good Luck > <k/> > > On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty <janardhan...@gmail.com> > wrote: > >> Hi, >> >> I have built the logistic regression model using training-dataset. >> When I am predicting on a test-dataset, it is throwing the below error of >> size mismatch. >> >> Steps done: >> 1. String indexers on categorical features. >> 2. One hot encoding on these indexed features. >> >> Any help is appreciated to resolve this issue or is it a bug ? >> >> SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0 >> failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID >> 19421, localhost): java.lang.IllegalArgumentException: requirement failed: >> BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: >> x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224) >> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at >> org.apache.spark.ml.classification.LogisticRegressionModel$$ >> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml >> .classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504) >> at org.apache.spark.ml.classification.LogisticRegressionModel.p >> redictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.classifica >> tion.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484) at >> org.apache.spark.ml.classification.ProbabilisticClassificati >> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at >> org.apache.spark.ml.classification.ProbabilisticClassificati >> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe >> cificUnsafeProjection.evalExpr137$(Unknown Source) at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe >> cificUnsafeProjection.apply(Unknown Source) at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe >> cificUnsafeProjection.apply(Unknown Source) at >> scala.collection.Iterator$$anon$11.next(Iterator.scala:409) >> > >