Re: General question on using StringIndexer in SparkML

Yanbo Liang Wed, 02 Dec 2015 18:13:17 -0800

You can get 1.6.0-RC1 from
http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
currently, but it's not the last release version.


2015-12-02 23:57 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com>:

> Thank you Yanbo,
>
> It looks like this is available in 1.6 version only.
> Can you tell me how/when can I download version 1.6?
>
> Thanks and Regards,
> Vishnu Viswanath,
>
> On Wed, Dec 2, 2015 at 4:37 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>
>> You can set "handleInvalid" to "skip" which help you skip the labels
>> which not exist in training dataset.
>>
>> 2015-12-02 14:31 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com
>> >:
>>
>>> Hi Jeff,
>>>
>>> I went through the link you provided and I could understand how the
>>> fit() and transform() work.
>>> I tried to use the pipeline in my code and I am getting exception  Caused
>>> by: org.apache.spark.SparkException: Unseen label:
>>>
>>> The reason for this error as per my understanding is:
>>> For the column on which I am doing StringIndexing, the test data is
>>> having values which was not there in train data.
>>> Since fit() is done only on the train data, the indexing is failing.
>>>
>>> Can you suggest me what can be done in this situation.
>>>
>>> Thanks,
>>>
>>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
>>> vishnu.viswanat...@gmail.com> wrote:
>>>
>>> Thank you Jeff.
>>>>
>>>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>
>>>>> StringIndexer is an estimator which would train a model to be used
>>>>> both in training & prediction. So it is consistent between training &
>>>>> prediction.
>>>>>
>>>>> You may want to read this section of spark ml doc
>>>>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
>>>>> vishnu.viswanat...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for the reply Yanbo.
>>>>>>
>>>>>> I understand that the model will be trained using the indexer map
>>>>>> created during the training stage.
>>>>>>
>>>>>> But since I am getting a new set of data during prediction, and I
>>>>>> have to do StringIndexing on the new data also,
>>>>>> Right now I am using a new StringIndexer for this purpose, or is
>>>>>> there any way that I can reuse the Indexer used for training stage.
>>>>>>
>>>>>> Note: I am having a pipeline with StringIndexer in it, and I am
>>>>>> fitting my train data in it and building the model. Then later when i get
>>>>>> the new data for prediction, I am using the same pipeline to fit the data
>>>>>> again and do the prediction.
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Vishnu Viswanath
>>>>>>
>>>>>>
>>>>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Vishnu,
>>>>>>>
>>>>>>> The string and indexer map is generated at model training step and
>>>>>>> used at model prediction step.
>>>>>>> It means that the string and indexer map will not changed when
>>>>>>> prediction. You will use the original trained model when you do
>>>>>>> prediction.
>>>>>>>
>>>>>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <
>>>>>>> vishnu.viswanat...@gmail.com>:
>>>>>>> > Hi All,
>>>>>>> >
>>>>>>> > I have a general question on using StringIndexer.
>>>>>>> > StringIndexer gives an index to each label in the feature starting
>>>>>>> from 0 (
>>>>>>> > 0 for least frequent word).
>>>>>>> >
>>>>>>> > Suppose I am building a model, and I use StringIndexer for
>>>>>>> transforming on
>>>>>>> > of my column.
>>>>>>> > e.g., suppose A was most frequent word followed by B and C.
>>>>>>> >
>>>>>>> > So the StringIndexer will generate
>>>>>>> >
>>>>>>> > A  0.0
>>>>>>> > B  1.0
>>>>>>> > C  2.0
>>>>>>> >
>>>>>>> > After building the model, I am going to do some prediction using
>>>>>>> this model,
>>>>>>> > So I do the same transformation on my new data which I need to
>>>>>>> predict. And
>>>>>>> > suppose the new dataset has C as the most frequent word, followed
>>>>>>> by B and
>>>>>>> > A. So the StringIndexer will assign index as
>>>>>>> >
>>>>>>> > C 0.0
>>>>>>> > B 1.0
>>>>>>> > A 2.0
>>>>>>> >
>>>>>>> > These indexes are different from what we used for modeling. So
>>>>>>> won’t this
>>>>>>> > give me a wrong prediction if I use StringIndexer?
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>>
>>>>
>>>> 
>>>
>>
>>
>

Re: General question on using StringIndexer in SparkML

Reply via email to