Re: General question on using StringIndexer in SparkML

Jeff Zhang Sun, 29 Nov 2015 17:37:03 -0800

StringIndexer is an estimator which would train a model to be used both in
training & prediction. So it is consistent between training & prediction.


You may want to read this section of spark ml doc
http://spark.apache.org/docs/latest/ml-guide.html#how-it-works



On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
vishnu.viswanat...@gmail.com> wrote:

> Thanks for the reply Yanbo.
>
> I understand that the model will be trained using the indexer map created
> during the training stage.
>
> But since I am getting a new set of data during prediction, and I have to
> do StringIndexing on the new data also,
> Right now I am using a new StringIndexer for this purpose, or is there any
> way that I can reuse the Indexer used for training stage.
>
> Note: I am having a pipeline with StringIndexer in it, and I am fitting my
> train data in it and building the model. Then later when i get the new data
> for prediction, I am using the same pipeline to fit the data again and do
> the prediction.
>
> Thanks and Regards,
> Vishnu Viswanath
>
>
> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>
>> Hi Vishnu,
>>
>> The string and indexer map is generated at model training step and
>> used at model prediction step.
>> It means that the string and indexer map will not changed when
>> prediction. You will use the original trained model when you do
>> prediction.
>>
>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com
>> >:
>> > Hi All,
>> >
>> > I have a general question on using StringIndexer.
>> > StringIndexer gives an index to each label in the feature starting from
>> 0 (
>> > 0 for least frequent word).
>> >
>> > Suppose I am building a model, and I use StringIndexer for transforming
>> on
>> > of my column.
>> > e.g., suppose A was most frequent word followed by B and C.
>> >
>> > So the StringIndexer will generate
>> >
>> > A  0.0
>> > B  1.0
>> > C  2.0
>> >
>> > After building the model, I am going to do some prediction using this
>> model,
>> > So I do the same transformation on my new data which I need to predict.
>> And
>> > suppose the new dataset has C as the most frequent word, followed by B
>> and
>> > A. So the StringIndexer will assign index as
>> >
>> > C 0.0
>> > B 1.0
>> > A 2.0
>> >
>> > These indexes are different from what we used for modeling. So won’t
>> this
>> > give me a wrong prediction if I use StringIndexer?
>> >
>> > --
>> > Thanks and Regards,
>> > Vishnu Viswanath,
>> > www.vishnuviswanath.com
>>
>
>
>
> --
> Thanks and Regards,
> Vishnu Viswanath,
> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
>



-- 
Best Regards

Jeff Zhang

Re: General question on using StringIndexer in SparkML

Reply via email to