Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-22 Thread janardhan shetty
thanks Nick.
This Jira seems to be in stagnant state for a while any update when this
will be released ?

On Mon, Aug 22, 2016 at 5:07 AM, Nick Pentreath 
wrote:

> I believe it may be because of this issue (https://issues.apache.org/
> jira/browse/SPARK-13030). OHE is not an estimator - hence in cases where
> the number of categories differ between train and test, it's not usable in
> the current form.
>
> It's tricky to work around, though one option is to use feature hashing
> instead of the StringIndexer -> OHE combo (see https://lists.apache.org/
> thread.html/a7e06426fd958665985d2c4218ea2f9bf9ba136ddefe83e1ad6f1727@%
> 3Cuser.spark.apache.org%3E for some details).
>
>
>
> On Mon, 22 Aug 2016 at 03:20 janardhan shetty 
> wrote:
>
>> Thanks Krishna for your response.
>> Features in the training set has more categories than test set so when
>> vectorAssembler is used these numbers are usually different and I believe
>> it is as expected right ?
>>
>> Test dataset usually will not have so many categories in their features
>> as Train is the belief here.
>>
>> On Sun, Aug 21, 2016 at 4:44 PM, Krishna Sankar 
>> wrote:
>>
>>> Hi,
>>>Just after I sent the mail, I realized that the error might be with
>>> the training-dataset not the test-dataset.
>>>
>>>1. it might be that you are feeding the full Y vector for training.
>>>2. Which could mean, you are using ~50-50 training-test split.
>>>3. Take a good look at the code that does the data split and the
>>>datasets where they are allocated to.
>>>
>>> Cheers
>>> 
>>>
>>> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar 
>>> wrote:
>>>
 Hi,
   Looks like the test-dataset has different sizes for X & Y. Possible
 steps:

1. What is the test-data-size ?
   - If it is 15,909, check the prediction variable vector - it is
   now 29,471, should be 15,909
   - If you expect it to be 29,471, then the X Matrix is not right.
   2. It is also probable that the size of the test-data is
something else. If so, check the data pipeline.
3. If you print the count() of the various vectors, I think you can
find the error.

 Cheers & Good Luck
 

 On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty <
 janardhan...@gmail.com> wrote:

> Hi,
>
> I have built the logistic regression model using training-dataset.
> When I am predicting on a test-dataset, it is throwing the below error
> of size mismatch.
>
> Steps done:
> 1. String indexers on categorical features.
> 2. One hot encoding on these indexed features.
>
> Any help is appreciated to resolve this issue or is it a bug ?
>
> SparkException: *Job aborted due to stage failure: Task 0 in stage
> 635.0 failed 1 times, most recent failure: Lost task 0.0 in stage 635.0
> (TID 19421, localhost): java.lang.IllegalArgumentException: requirement
> failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching
> sizes: x.size = 15909, y.size = 29471* at
> scala.Predef$.require(Predef.scala:224) at 
> org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104)
> at org.apache.spark.ml.classification.LogisticRegressionModel$$
> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml.
> classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
> at org.apache.spark.ml.classification.LogisticRegressionModel.
> predictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.
> classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
> at org.apache.spark.ml.classification.ProbabilisticClassificationMod
> el$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
> org.apache.spark.ml.classification.ProbabilisticClassificationMod
> el$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.evalExpr137$(Unknown Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.apply(Unknown Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.apply(Unknown Source) at
> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>


>>>
>>


Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-22 Thread Nick Pentreath
I believe it may be because of this issue (
https://issues.apache.org/jira/browse/SPARK-13030). OHE is not an estimator
- hence in cases where the number of categories differ between train and
test, it's not usable in the current form.

It's tricky to work around, though one option is to use feature hashing
instead of the StringIndexer -> OHE combo (see
https://lists.apache.org/thread.html/a7e06426fd958665985d2c4218ea2f9bf9ba136ddefe83e1ad6f1727@%3Cuser.spark.apache.org%3E
for
some details).



On Mon, 22 Aug 2016 at 03:20 janardhan shetty 
wrote:

> Thanks Krishna for your response.
> Features in the training set has more categories than test set so when
> vectorAssembler is used these numbers are usually different and I believe
> it is as expected right ?
>
> Test dataset usually will not have so many categories in their features as
> Train is the belief here.
>
> On Sun, Aug 21, 2016 at 4:44 PM, Krishna Sankar 
> wrote:
>
>> Hi,
>>Just after I sent the mail, I realized that the error might be with
>> the training-dataset not the test-dataset.
>>
>>1. it might be that you are feeding the full Y vector for training.
>>2. Which could mean, you are using ~50-50 training-test split.
>>3. Take a good look at the code that does the data split and the
>>datasets where they are allocated to.
>>
>> Cheers
>> 
>>
>> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar 
>> wrote:
>>
>>> Hi,
>>>   Looks like the test-dataset has different sizes for X & Y. Possible
>>> steps:
>>>
>>>1. What is the test-data-size ?
>>>   - If it is 15,909, check the prediction variable vector - it is
>>>   now 29,471, should be 15,909
>>>   - If you expect it to be 29,471, then the X Matrix is not right.
>>>   2. It is also probable that the size of the test-data is
>>>something else. If so, check the data pipeline.
>>>3. If you print the count() of the various vectors, I think you can
>>>find the error.
>>>
>>> Cheers & Good Luck
>>> 
>>>
>>> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
 Hi,

 I have built the logistic regression model using training-dataset.
 When I am predicting on a test-dataset, it is throwing the below error
 of size mismatch.

 Steps done:
 1. String indexers on categorical features.
 2. One hot encoding on these indexed features.

 Any help is appreciated to resolve this issue or is it a bug ?

 SparkException: *Job aborted due to stage failure: Task 0 in stage
 635.0 failed 1 times, most recent failure: Lost task 0.0 in stage 635.0
 (TID 19421, localhost): java.lang.IllegalArgumentException: requirement
 failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching
 sizes: x.size = 15909, y.size = 29471* at
 scala.Predef$.require(Predef.scala:224) at
 org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
 org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:505)
 at 
 org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
 at 
 org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:594)
 at 
 org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
 at 
 org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112)
 at 
 org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111)
 at
 org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr137$(Unknown
 Source) at
 org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) at
 org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

>>>
>>>
>>
>


Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread janardhan shetty
Thanks Krishna for your response.
Features in the training set has more categories than test set so when
vectorAssembler is used these numbers are usually different and I believe
it is as expected right ?

Test dataset usually will not have so many categories in their features as
Train is the belief here.

On Sun, Aug 21, 2016 at 4:44 PM, Krishna Sankar  wrote:

> Hi,
>Just after I sent the mail, I realized that the error might be with the
> training-dataset not the test-dataset.
>
>1. it might be that you are feeding the full Y vector for training.
>2. Which could mean, you are using ~50-50 training-test split.
>3. Take a good look at the code that does the data split and the
>datasets where they are allocated to.
>
> Cheers
> 
>
> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar 
> wrote:
>
>> Hi,
>>   Looks like the test-dataset has different sizes for X & Y. Possible
>> steps:
>>
>>1. What is the test-data-size ?
>>   - If it is 15,909, check the prediction variable vector - it is
>>   now 29,471, should be 15,909
>>   - If you expect it to be 29,471, then the X Matrix is not right.
>>   2. It is also probable that the size of the test-data is something
>>else. If so, check the data pipeline.
>>3. If you print the count() of the various vectors, I think you can
>>find the error.
>>
>> Cheers & Good Luck
>> 
>>
>> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty > > wrote:
>>
>>> Hi,
>>>
>>> I have built the logistic regression model using training-dataset.
>>> When I am predicting on a test-dataset, it is throwing the below error
>>> of size mismatch.
>>>
>>> Steps done:
>>> 1. String indexers on categorical features.
>>> 2. One hot encoding on these indexed features.
>>>
>>> Any help is appreciated to resolve this issue or is it a bug ?
>>>
>>> SparkException: *Job aborted due to stage failure: Task 0 in stage
>>> 635.0 failed 1 times, most recent failure: Lost task 0.0 in stage 635.0
>>> (TID 19421, localhost): java.lang.IllegalArgumentException: requirement
>>> failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching
>>> sizes: x.size = 15909, y.size = 29471* at 
>>> scala.Predef$.require(Predef.scala:224)
>>> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
>>> org.apache.spark.ml.classification.LogisticRegressionModel$$
>>> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml
>>> .classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
>>> at org.apache.spark.ml.classification.LogisticRegressionModel.p
>>> redictRaw(LogisticRegression.scala:594) at org.apache.spark.ml
>>> .classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
>>> at org.apache.spark.ml.classification.ProbabilisticClassificati
>>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
>>> org.apache.spark.ml.classification.ProbabilisticClassificati
>>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.evalExpr137$(Unknown Source) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.apply(Unknown Source) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.apply(Unknown Source) at
>>> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>>
>>
>>
>


Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi,
   Just after I sent the mail, I realized that the error might be with the
training-dataset not the test-dataset.

   1. it might be that you are feeding the full Y vector for training.
   2. Which could mean, you are using ~50-50 training-test split.
   3. Take a good look at the code that does the data split and the
   datasets where they are allocated to.

Cheers


On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar  wrote:

> Hi,
>   Looks like the test-dataset has different sizes for X & Y. Possible
> steps:
>
>1. What is the test-data-size ?
>   - If it is 15,909, check the prediction variable vector - it is now
>   29,471, should be 15,909
>   - If you expect it to be 29,471, then the X Matrix is not right.
>   2. It is also probable that the size of the test-data is something
>else. If so, check the data pipeline.
>3. If you print the count() of the various vectors, I think you can
>find the error.
>
> Cheers & Good Luck
> 
>
> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty 
> wrote:
>
>> Hi,
>>
>> I have built the logistic regression model using training-dataset.
>> When I am predicting on a test-dataset, it is throwing the below error of
>> size mismatch.
>>
>> Steps done:
>> 1. String indexers on categorical features.
>> 2. One hot encoding on these indexed features.
>>
>> Any help is appreciated to resolve this issue or is it a bug ?
>>
>> SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0
>> failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID
>> 19421, localhost): java.lang.IllegalArgumentException: requirement failed:
>> BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:
>> x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224)
>> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
>> org.apache.spark.ml.classification.LogisticRegressionModel$$
>> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml
>> .classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
>> at org.apache.spark.ml.classification.LogisticRegressionModel.p
>> redictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.classifica
>> tion.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484) at
>> org.apache.spark.ml.classification.ProbabilisticClassificati
>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
>> org.apache.spark.ml.classification.ProbabilisticClassificati
>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>> cificUnsafeProjection.evalExpr137$(Unknown Source) at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>> cificUnsafeProjection.apply(Unknown Source) at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>> cificUnsafeProjection.apply(Unknown Source) at
>> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>
>
>


Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi,
  Looks like the test-dataset has different sizes for X & Y. Possible steps:

   1. What is the test-data-size ?
  - If it is 15,909, check the prediction variable vector - it is now
  29,471, should be 15,909
  - If you expect it to be 29,471, then the X Matrix is not right.
  2. It is also probable that the size of the test-data is something
   else. If so, check the data pipeline.
   3. If you print the count() of the various vectors, I think you can find
   the error.

Cheers & Good Luck


On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty 
wrote:

> Hi,
>
> I have built the logistic regression model using training-dataset.
> When I am predicting on a test-dataset, it is throwing the below error of
> size mismatch.
>
> Steps done:
> 1. String indexers on categorical features.
> 2. One hot encoding on these indexed features.
>
> Any help is appreciated to resolve this issue or is it a bug ?
>
> SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0
> failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID
> 19421, localhost): java.lang.IllegalArgumentException: requirement failed:
> BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:
> x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
> org.apache.spark.ml.classification.LogisticRegressionModel$$
> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml.
> classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
> at org.apache.spark.ml.classification.LogisticRegressionModel.
> predictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.
> classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
> at org.apache.spark.ml.classification.ProbabilisticClassificationMod
> el$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
> org.apache.spark.ml.classification.ProbabilisticClassificationMod
> el$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.evalExpr137$(Unknown Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.apply(Unknown Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificUnsafeProjection.apply(Unknown Source) at
> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>