Thanks Krishna for your response.
Features in the training set has more categories than test set so when
vectorAssembler is used these numbers are usually different and I believe
it is as expected right ?

Test dataset usually will not have so many categories in their features as
Train is the belief here.

On Sun, Aug 21, 2016 at 4:44 PM, Krishna Sankar <> wrote:

> Hi,
>    Just after I sent the mail, I realized that the error might be with the
> training-dataset not the test-dataset.
>    1. it might be that you are feeding the full Y vector for training.
>    2. Which could mean, you are using ~50-50 training-test split.
>    3. Take a good look at the code that does the data split and the
>    datasets where they are allocated to.
> Cheers
> <k/>
> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <>
> wrote:
>> Hi,
>>   Looks like the test-dataset has different sizes for X & Y. Possible
>> steps:
>>    1. What is the test-data-size ?
>>       - If it is 15,909, check the prediction variable vector - it is
>>       now 29,471, should be 15,909
>>       - If you expect it to be 29,471, then the X Matrix is not right.
>>       2. It is also probable that the size of the test-data is something
>>    else. If so, check the data pipeline.
>>    3. If you print the count() of the various vectors, I think you can
>>    find the error.
>> Cheers & Good Luck
>> <k/>
>> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty <
>> > wrote:
>>> Hi,
>>> I have built the logistic regression model using training-dataset.
>>> When I am predicting on a test-dataset, it is throwing the below error
>>> of size mismatch.
>>> Steps done:
>>> 1. String indexers on categorical features.
>>> 2. One hot encoding on these indexed features.
>>> Any help is appreciated to resolve this issue or is it a bug ?
>>> SparkException: *Job aborted due to stage failure: Task 0 in stage
>>> 635.0 failed 1 times, most recent failure: Lost task 0.0 in stage 635.0
>>> (TID 19421, localhost): java.lang.IllegalArgumentException: requirement
>>> failed: Vector, y:Vector) was given Vectors with non-matching
>>> sizes: x.size = 15909, y.size = 29471* at 
>>> scala.Predef$.require(Predef.scala:224)
>>> at$.dot(BLAS.scala:104) at
>>> anonfun$19.apply(LogisticRegression.scala:505) at
>>> .classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
>>> at
>>> redictRaw(LogisticRegression.scala:594) at
>>> .classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
>>> at
>>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
>>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.evalExpr137$(Unknown Source) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.apply(Unknown Source) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.apply(Unknown Source) at
>>> scala.collection.Iterator$$anon$

Reply via email to