StringIndexer is an estimator which would train a model to be used both in training & prediction. So it is consistent between training & prediction.
You may want to read this section of spark ml doc http://spark.apache.org/docs/latest/ml-guide.html#how-it-works On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath < vishnu.viswanat...@gmail.com> wrote: > Thanks for the reply Yanbo. > > I understand that the model will be trained using the indexer map created > during the training stage. > > But since I am getting a new set of data during prediction, and I have to > do StringIndexing on the new data also, > Right now I am using a new StringIndexer for this purpose, or is there any > way that I can reuse the Indexer used for training stage. > > Note: I am having a pipeline with StringIndexer in it, and I am fitting my > train data in it and building the model. Then later when i get the new data > for prediction, I am using the same pipeline to fit the data again and do > the prediction. > > Thanks and Regards, > Vishnu Viswanath > > > On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com> wrote: > >> Hi Vishnu, >> >> The string and indexer map is generated at model training step and >> used at model prediction step. >> It means that the string and indexer map will not changed when >> prediction. You will use the original trained model when you do >> prediction. >> >> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com >> >: >> > Hi All, >> > >> > I have a general question on using StringIndexer. >> > StringIndexer gives an index to each label in the feature starting from >> 0 ( >> > 0 for least frequent word). >> > >> > Suppose I am building a model, and I use StringIndexer for transforming >> on >> > of my column. >> > e.g., suppose A was most frequent word followed by B and C. >> > >> > So the StringIndexer will generate >> > >> > A 0.0 >> > B 1.0 >> > C 2.0 >> > >> > After building the model, I am going to do some prediction using this >> model, >> > So I do the same transformation on my new data which I need to predict. >> And >> > suppose the new dataset has C as the most frequent word, followed by B >> and >> > A. So the StringIndexer will assign index as >> > >> > C 0.0 >> > B 1.0 >> > A 2.0 >> > >> > These indexes are different from what we used for modeling. So won’t >> this >> > give me a wrong prediction if I use StringIndexer? >> > >> > -- >> > Thanks and Regards, >> > Vishnu Viswanath, >> > www.vishnuviswanath.com >> > > > > -- > Thanks and Regards, > Vishnu Viswanath, > *www.vishnuviswanath.com <http://www.vishnuviswanath.com>* > -- Best Regards Jeff Zhang