Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-22 Thread janardhan shetty
thanks Nick. This Jira seems to be in stagnant state for a while any update when this will be released ? On Mon, Aug 22, 2016 at 5:07 AM, Nick Pentreath wrote: > I believe it may be because of this issue (https://issues.apache.org/ > jira/browse/SPARK-13030). OHE is

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-22 Thread Nick Pentreath
I believe it may be because of this issue ( https://issues.apache.org/jira/browse/SPARK-13030). OHE is not an estimator - hence in cases where the number of categories differ between train and test, it's not usable in the current form. It's tricky to work around, though one option is to use

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread janardhan shetty
Thanks Krishna for your response. Features in the training set has more categories than test set so when vectorAssembler is used these numbers are usually different and I believe it is as expected right ? Test dataset usually will not have so many categories in their features as Train is the

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi, Just after I sent the mail, I realized that the error might be with the training-dataset not the test-dataset. 1. it might be that you are feeding the full Y vector for training. 2. Which could mean, you are using ~50-50 training-test split. 3. Take a good look at the code that

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi, Looks like the test-dataset has different sizes for X & Y. Possible steps: 1. What is the test-data-size ? - If it is 15,909, check the prediction variable vector - it is now 29,471, should be 15,909 - If you expect it to be 29,471, then the X Matrix is not right.

Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread janardhan shetty
Hi, I have built the logistic regression model using training-dataset. When I am predicting on a test-dataset, it is throwing the below error of size mismatch. Steps done: 1. String indexers on categorical features. 2. One hot encoding on these indexed features. Any help is appreciated to