thanks Nick. This Jira seems to be in stagnant state for a while any update when this will be released ?
On Mon, Aug 22, 2016 at 5:07 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > I believe it may be because of this issue (https://issues.apache.org/ > jira/browse/SPARK-13030). OHE is not an estimator - hence in cases where > the number of categories differ between train and test, it's not usable in > the current form. > > It's tricky to work around, though one option is to use feature hashing > instead of the StringIndexer -> OHE combo (see https://lists.apache.org/ > thread.html/a7e06426fd958665985d2c4218ea2f9bf9ba136ddefe83e1ad6f1727@% > 3Cuser.spark.apache.org%3E for some details). > > > > On Mon, 22 Aug 2016 at 03:20 janardhan shetty <janardhan...@gmail.com> > wrote: > >> Thanks Krishna for your response. >> Features in the training set has more categories than test set so when >> vectorAssembler is used these numbers are usually different and I believe >> it is as expected right ? >> >> Test dataset usually will not have so many categories in their features >> as Train is the belief here. >> >> On Sun, Aug 21, 2016 at 4:44 PM, Krishna Sankar <ksanka...@gmail.com> >> wrote: >> >>> Hi, >>> Just after I sent the mail, I realized that the error might be with >>> the training-dataset not the test-dataset. >>> >>> 1. it might be that you are feeding the full Y vector for training. >>> 2. Which could mean, you are using ~50-50 training-test split. >>> 3. Take a good look at the code that does the data split and the >>> datasets where they are allocated to. >>> >>> Cheers >>> <k/> >>> >>> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> Looks like the test-dataset has different sizes for X & Y. Possible >>>> steps: >>>> >>>> 1. What is the test-data-size ? >>>> - If it is 15,909, check the prediction variable vector - it is >>>> now 29,471, should be 15,909 >>>> - If you expect it to be 29,471, then the X Matrix is not right. >>>> 2. It is also probable that the size of the test-data is >>>> something else. If so, check the data pipeline. >>>> 3. If you print the count() of the various vectors, I think you can >>>> find the error. >>>> >>>> Cheers & Good Luck >>>> <k/> >>>> >>>> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty < >>>> janardhan...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have built the logistic regression model using training-dataset. >>>>> When I am predicting on a test-dataset, it is throwing the below error >>>>> of size mismatch. >>>>> >>>>> Steps done: >>>>> 1. String indexers on categorical features. >>>>> 2. One hot encoding on these indexed features. >>>>> >>>>> Any help is appreciated to resolve this issue or is it a bug ? >>>>> >>>>> SparkException: *Job aborted due to stage failure: Task 0 in stage >>>>> 635.0 failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 >>>>> (TID 19421, localhost): java.lang.IllegalArgumentException: requirement >>>>> failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching >>>>> sizes: x.size = 15909, y.size = 29471* at >>>>> scala.Predef$.require(Predef.scala:224) at >>>>> org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) >>>>> at org.apache.spark.ml.classification.LogisticRegressionModel$$ >>>>> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml. >>>>> classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504) >>>>> at org.apache.spark.ml.classification.LogisticRegressionModel. >>>>> predictRaw(LogisticRegression.scala:594) at org.apache.spark.ml. >>>>> classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484) >>>>> at org.apache.spark.ml.classification.ProbabilisticClassificationMod >>>>> el$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at >>>>> org.apache.spark.ml.classification.ProbabilisticClassificationMod >>>>> el$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at >>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$ >>>>> SpecificUnsafeProjection.evalExpr137$(Unknown Source) at >>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$ >>>>> SpecificUnsafeProjection.apply(Unknown Source) at >>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$ >>>>> SpecificUnsafeProjection.apply(Unknown Source) at >>>>> scala.collection.Iterator$$anon$11.next(Iterator.scala:409) >>>>> >>>> >>>> >>> >>