I believe it may be because of this issue (
https://issues.apache.org/jira/browse/SPARK-13030). OHE is not an estimator
- hence in cases where the number of categories differ between train and
test, it's not usable in the current form.

It's tricky to work around, though one option is to use feature hashing
instead of the StringIndexer -> OHE combo (see
https://lists.apache.org/thread.html/a7e06426fd958665985d2c4218ea2f9bf9ba136ddefe83e1ad6f1727@%3Cuser.spark.apache.org%3E
for
some details).



On Mon, 22 Aug 2016 at 03:20 janardhan shetty <janardhan...@gmail.com>
wrote:

> Thanks Krishna for your response.
> Features in the training set has more categories than test set so when
> vectorAssembler is used these numbers are usually different and I believe
> it is as expected right ?
>
> Test dataset usually will not have so many categories in their features as
> Train is the belief here.
>
> On Sun, Aug 21, 2016 at 4:44 PM, Krishna Sankar <ksanka...@gmail.com>
> wrote:
>
>> Hi,
>>    Just after I sent the mail, I realized that the error might be with
>> the training-dataset not the test-dataset.
>>
>>    1. it might be that you are feeding the full Y vector for training.
>>    2. Which could mean, you are using ~50-50 training-test split.
>>    3. Take a good look at the code that does the data split and the
>>    datasets where they are allocated to.
>>
>> Cheers
>> <k/>
>>
>> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>   Looks like the test-dataset has different sizes for X & Y. Possible
>>> steps:
>>>
>>>    1. What is the test-data-size ?
>>>       - If it is 15,909, check the prediction variable vector - it is
>>>       now 29,471, should be 15,909
>>>       - If you expect it to be 29,471, then the X Matrix is not right.
>>>       2. It is also probable that the size of the test-data is
>>>    something else. If so, check the data pipeline.
>>>    3. If you print the count() of the various vectors, I think you can
>>>    find the error.
>>>
>>> Cheers & Good Luck
>>> <k/>
>>>
>>> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have built the logistic regression model using training-dataset.
>>>> When I am predicting on a test-dataset, it is throwing the below error
>>>> of size mismatch.
>>>>
>>>> Steps done:
>>>> 1. String indexers on categorical features.
>>>> 2. One hot encoding on these indexed features.
>>>>
>>>> Any help is appreciated to resolve this issue or is it a bug ?
>>>>
>>>> SparkException: *Job aborted due to stage failure: Task 0 in stage
>>>> 635.0 failed 1 times, most recent failure: Lost task 0.0 in stage 635.0
>>>> (TID 19421, localhost): java.lang.IllegalArgumentException: requirement
>>>> failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching
>>>> sizes: x.size = 15909, y.size = 29471* at
>>>> scala.Predef$.require(Predef.scala:224) at
>>>> org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
>>>> org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:505)
>>>> at 
>>>> org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
>>>> at 
>>>> org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:594)
>>>> at 
>>>> org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
>>>> at 
>>>> org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112)
>>>> at 
>>>> org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111)
>>>> at
>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr137$(Unknown
>>>> Source) at
>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>>>> Source) at
>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>>>> Source) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>>>
>>>
>>>
>>
>

Reply via email to