Re: Vector size mismatch in logistic regression - Spark ML 2.0

janardhan shetty Mon, 22 Aug 2016 09:43:16 -0700

thanks Nick.
This Jira seems to be in stagnant state for a while any update when this
will be released ?


On Mon, Aug 22, 2016 at 5:07 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> I believe it may be because of this issue (https://issues.apache.org/
> jira/browse/SPARK-13030). OHE is not an estimator - hence in cases where
> the number of categories differ between train and test, it's not usable in
> the current form.
>
> It's tricky to work around, though one option is to use feature hashing
> instead of the StringIndexer -> OHE combo (see https://lists.apache.org/
> thread.html/a7e06426fd958665985d2c4218ea2f9bf9ba136ddefe83e1ad6f1727@%
> 3Cuser.spark.apache.org%3E for some details).
>
>
>
> On Mon, 22 Aug 2016 at 03:20 janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Thanks Krishna for your response.
>> Features in the training set has more categories than test set so when
>> vectorAssembler is used these numbers are usually different and I believe
>> it is as expected right ?
>>
>> Test dataset usually will not have so many categories in their features
>> as Train is the belief here.
>>
>> On Sun, Aug 21, 2016 at 4:44 PM, Krishna Sankar <ksanka...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>    Just after I sent the mail, I realized that the error might be with
>>> the training-dataset not the test-dataset.
>>>
>>>    1. it might be that you are feeding the full Y vector for training.
>>>    2. Which could mean, you are using ~50-50 training-test split.
>>>    3. Take a good look at the code that does the data split and the
>>>    datasets where they are allocated to.
>>>
>>> Cheers
>>> <k/>
>>>
>>> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>   Looks like the test-dataset has different sizes for X & Y. Possible
>>>> steps:
>>>>
>>>>    1. What is the test-data-size ?
>>>>       - If it is 15,909, check the prediction variable vector - it is
>>>>       now 29,471, should be 15,909
>>>>       - If you expect it to be 29,471, then the X Matrix is not right.
>>>>       2. It is also probable that the size of the test-data is
>>>>    something else. If so, check the data pipeline.
>>>>    3. If you print the count() of the various vectors, I think you can
>>>>    find the error.
>>>>
>>>> Cheers & Good Luck
>>>> <k/>
>>>>
>>>> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty <
>>>> janardhan...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have built the logistic regression model using training-dataset.
>>>>> When I am predicting on a test-dataset, it is throwing the below error
>>>>> of size mismatch.
>>>>>
>>>>> Steps done:
>>>>> 1. String indexers on categorical features.
>>>>> 2. One hot encoding on these indexed features.
>>>>>
>>>>> Any help is appreciated to resolve this issue or is it a bug ?
>>>>>
>>>>> SparkException: *Job aborted due to stage failure: Task 0 in stage
>>>>> 635.0 failed 1 times, most recent failure: Lost task 0.0 in stage 635.0
>>>>> (TID 19421, localhost): java.lang.IllegalArgumentException: requirement
>>>>> failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching
>>>>> sizes: x.size = 15909, y.size = 29471* at
>>>>> scala.Predef$.require(Predef.scala:224) at 
>>>>> org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104)
>>>>> at org.apache.spark.ml.classification.LogisticRegressionModel$$
>>>>> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml.
>>>>> classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
>>>>> at org.apache.spark.ml.classification.LogisticRegressionModel.
>>>>> predictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.
>>>>> classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
>>>>> at org.apache.spark.ml.classification.ProbabilisticClassificationMod
>>>>> el$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
>>>>> org.apache.spark.ml.classification.ProbabilisticClassificationMod
>>>>> el$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
>>>>> SpecificUnsafeProjection.evalExpr137$(Unknown Source) at
>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
>>>>> SpecificUnsafeProjection.apply(Unknown Source) at
>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
>>>>> SpecificUnsafeProjection.apply(Unknown Source) at
>>>>> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>>>>
>>>>
>>>>
>>>
>>

Re: Vector size mismatch in logistic regression - Spark ML 2.0

Reply via email to