Re: Spark ML Random Forest output.

2015-12-05 Thread Eugene Morozov
Benjamin, thanks a lot!

--
Be well!
Jean Morozov

On Sat, Dec 5, 2015 at 3:46 PM, Benjamin Fradet 
wrote:

> Hi,
>
> To get back the original labels after indexing them with StringIndexer, I
> usually use IndexToString
> 
> to retrieve my original labels like so:
>
> val labelIndexer = new StringIndexer()
>   .setInputCol(myInputLabelColumnName)
>   .setOutputCol(myIndexedLabelColumnName)
>   .fit(myData)
>
> val randomForest = new RandomForestClassifier()
>   .setLabelCol(myIndexedLabelColumnName)
>   .setFeaturesCol(myFeaturesColumnName)
>
> val labelConverter = new IndexToString()
>   .setInputCol("prediction")
>   .setOutputCol(myPredictionColumnWithTheOriginalLabels)
>   .setLabels(labelIndexer.labels)
>
>  val pipeline = new Pipeline()
>   .setStages(Array(labelIndexer, randomForest, labelConverter))
>
> Hoping that helps,
> Ben.
>
> On Sat, Dec 5, 2015 at 12:26 PM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Figured that out.
>>
>> StringIndexerModel has field / method labels(), which returns array of
>> labels.
>> Currently prediction return indices of that array. Which is the subject
>> to change: https://issues.apache.org/jira/browse/SPARK-7126.
>>
>> Having my pipeline model serialized to file and beaing read from it:
>>
>> ((StringIndexerModel)readModel.stages()[0]).labels()
>>
>> readModel here is a PipelineModel.
>>
>>
>> --
>> Be well!
>> Jean Morozov
>>
>> On Sat, Dec 5, 2015 at 12:06 PM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Vishnu, thanks for the response.
>>>
>>> The problem is that I actually do not have index labels, they are hidden
>>> in the dataframe as a metadata. And anyone, who'd like to use that have to
>>> apply an ugly hack.
>>>
>>> The issue might be even worse in case I serialize my model into a file
>>> for a delayed use. When I later on read it from the file, I do not have
>>> such a map at all. The only workaround is to store the map along with
>>> serialized model, which is not really great.
>>>
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>> On Sat, Dec 5, 2015 at 2:24 AM, Vishnu Viswanath <
>>> vishnu.viswanat...@gmail.com> wrote:
>>>
 Hi,

 As per my understanding the probability matrix is giving the
 probability that that particular item can belong to each class. So the one
 with highest probability is your predicted class.

 Since you have converted you label to index label, according the model
 the classes are 0.0 to 9.0 and I see you are getting prediction as a value
 which is in [0.0,1.0,,9.0] -  which is correct.

 So what you want is a reverse map that can convert your predicted class
 back to the String. I don't know if  StringIndexer has such an option, may
 be you can create your own map and reverse map of (label to index) and
 (index to label) and use this for getting back your original label.

 May be there is better way to do this..

 Regards,
 Vishnu

 On Fri, Dec 4, 2015 at 4:56 PM, Eugene Morozov <
 evgeny.a.moro...@gmail.com> wrote:

> Hello,
>
> I've got an input dataset of handwritten digits and working java code
> that uses random forest classification algorithm to determine the numbers.
> My test set is just some lines from the same input dataset - just to be
> sure I'm doing the right thing. My understanding is that having correct
> classifier in this case would give me the correct prediction.
> At the moment overfitting is not an issue.
>
> After applying StringIndexer to my input DataFrame I've applied an
> ugly trick and got "indexedLabel" metadata:
>
> {"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}
>
>
> So, my algorithm gives me the following result. The question is I'm
> not sure I understand the meaning of the "prediction" here in the output.
> It looks like it's just an index of the highest probability value in the
> "prob" array. Shouldn't "prediction" be the actual class, i.e. one of the
> "0.0", "1.0", ..., "9.0"? If the prediction is just an ordinal number, 
> then
> I have to manually correspond it to my classes, but for that I have to
> either specify classes manually to know their order or somehow be able to
> get them out of the classifier. Both of these way seem to be are not
> accessible.
>
> (4.0 -> prediction=7.0,
> prob=[0.004708283878223195,0.08478236104777455,0.03594642191080532,0.19286506771018885,0.038304389235523435,0.02841307797386,0.003334431932056404,0.5685242322346109,0.018564705500837587,0.024557028569980155]
> (9.0 -> prediction=3.0,
> prob=[0.018432404716456248,0.16837195846781422,0.05995559403934031,0.32282148259583565,0.018374168600855455,0.047922851143

Re: Spark ML Random Forest output.

2015-12-05 Thread Benjamin Fradet
Hi,

To get back the original labels after indexing them with StringIndexer, I
usually use IndexToString

to retrieve my original labels like so:

val labelIndexer = new StringIndexer()
  .setInputCol(myInputLabelColumnName)
  .setOutputCol(myIndexedLabelColumnName)
  .fit(myData)

val randomForest = new RandomForestClassifier()
  .setLabelCol(myIndexedLabelColumnName)
  .setFeaturesCol(myFeaturesColumnName)

val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol(myPredictionColumnWithTheOriginalLabels)
  .setLabels(labelIndexer.labels)

 val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, randomForest, labelConverter))

Hoping that helps,
Ben.

On Sat, Dec 5, 2015 at 12:26 PM, Eugene Morozov 
wrote:

> Figured that out.
>
> StringIndexerModel has field / method labels(), which returns array of
> labels.
> Currently prediction return indices of that array. Which is the subject to
> change: https://issues.apache.org/jira/browse/SPARK-7126.
>
> Having my pipeline model serialized to file and beaing read from it:
>
> ((StringIndexerModel)readModel.stages()[0]).labels()
>
> readModel here is a PipelineModel.
>
>
> --
> Be well!
> Jean Morozov
>
> On Sat, Dec 5, 2015 at 12:06 PM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Vishnu, thanks for the response.
>>
>> The problem is that I actually do not have index labels, they are hidden
>> in the dataframe as a metadata. And anyone, who'd like to use that have to
>> apply an ugly hack.
>>
>> The issue might be even worse in case I serialize my model into a file
>> for a delayed use. When I later on read it from the file, I do not have
>> such a map at all. The only workaround is to store the map along with
>> serialized model, which is not really great.
>>
>> --
>> Be well!
>> Jean Morozov
>>
>> On Sat, Dec 5, 2015 at 2:24 AM, Vishnu Viswanath <
>> vishnu.viswanat...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> As per my understanding the probability matrix is giving the probability
>>> that that particular item can belong to each class. So the one with highest
>>> probability is your predicted class.
>>>
>>> Since you have converted you label to index label, according the model
>>> the classes are 0.0 to 9.0 and I see you are getting prediction as a value
>>> which is in [0.0,1.0,,9.0] -  which is correct.
>>>
>>> So what you want is a reverse map that can convert your predicted class
>>> back to the String. I don't know if  StringIndexer has such an option, may
>>> be you can create your own map and reverse map of (label to index) and
>>> (index to label) and use this for getting back your original label.
>>>
>>> May be there is better way to do this..
>>>
>>> Regards,
>>> Vishnu
>>>
>>> On Fri, Dec 4, 2015 at 4:56 PM, Eugene Morozov <
>>> evgeny.a.moro...@gmail.com> wrote:
>>>
 Hello,

 I've got an input dataset of handwritten digits and working java code
 that uses random forest classification algorithm to determine the numbers.
 My test set is just some lines from the same input dataset - just to be
 sure I'm doing the right thing. My understanding is that having correct
 classifier in this case would give me the correct prediction.
 At the moment overfitting is not an issue.

 After applying StringIndexer to my input DataFrame I've applied an ugly
 trick and got "indexedLabel" metadata:

 {"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}


 So, my algorithm gives me the following result. The question is I'm not
 sure I understand the meaning of the "prediction" here in the output. It
 looks like it's just an index of the highest probability value in the
 "prob" array. Shouldn't "prediction" be the actual class, i.e. one of the
 "0.0", "1.0", ..., "9.0"? If the prediction is just an ordinal number, then
 I have to manually correspond it to my classes, but for that I have to
 either specify classes manually to know their order or somehow be able to
 get them out of the classifier. Both of these way seem to be are not
 accessible.

 (4.0 -> prediction=7.0,
 prob=[0.004708283878223195,0.08478236104777455,0.03594642191080532,0.19286506771018885,0.038304389235523435,0.02841307797386,0.003334431932056404,0.5685242322346109,0.018564705500837587,0.024557028569980155]
 (9.0 -> prediction=3.0,
 prob=[0.018432404716456248,0.16837195846781422,0.05995559403934031,0.32282148259583565,0.018374168600855455,0.04792285114398864,0.018226352623526704,0.1611650363085499,0.11703073969440755,0.06769941180922535]
 (2.0 -> prediction=4.0,
 prob=[0.017918245251872154,0.029243677407669404,0.06228050320552064,0.03633295481094746,0.45707974962418885,0.09675606366289394,0.03921437851648226,0.0439170573907434

Re: Spark ML Random Forest output.

2015-12-05 Thread Eugene Morozov
Figured that out.

StringIndexerModel has field / method labels(), which returns array of
labels.
Currently prediction return indices of that array. Which is the subject to
change: https://issues.apache.org/jira/browse/SPARK-7126.

Having my pipeline model serialized to file and beaing read from it:

((StringIndexerModel)readModel.stages()[0]).labels()

readModel here is a PipelineModel.


--
Be well!
Jean Morozov

On Sat, Dec 5, 2015 at 12:06 PM, Eugene Morozov 
wrote:

> Vishnu, thanks for the response.
>
> The problem is that I actually do not have index labels, they are hidden
> in the dataframe as a metadata. And anyone, who'd like to use that have to
> apply an ugly hack.
>
> The issue might be even worse in case I serialize my model into a file for
> a delayed use. When I later on read it from the file, I do not have such a
> map at all. The only workaround is to store the map along with serialized
> model, which is not really great.
>
> --
> Be well!
> Jean Morozov
>
> On Sat, Dec 5, 2015 at 2:24 AM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>> Hi,
>>
>> As per my understanding the probability matrix is giving the probability
>> that that particular item can belong to each class. So the one with highest
>> probability is your predicted class.
>>
>> Since you have converted you label to index label, according the model
>> the classes are 0.0 to 9.0 and I see you are getting prediction as a value
>> which is in [0.0,1.0,,9.0] -  which is correct.
>>
>> So what you want is a reverse map that can convert your predicted class
>> back to the String. I don't know if  StringIndexer has such an option, may
>> be you can create your own map and reverse map of (label to index) and
>> (index to label) and use this for getting back your original label.
>>
>> May be there is better way to do this..
>>
>> Regards,
>> Vishnu
>>
>> On Fri, Dec 4, 2015 at 4:56 PM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I've got an input dataset of handwritten digits and working java code
>>> that uses random forest classification algorithm to determine the numbers.
>>> My test set is just some lines from the same input dataset - just to be
>>> sure I'm doing the right thing. My understanding is that having correct
>>> classifier in this case would give me the correct prediction.
>>> At the moment overfitting is not an issue.
>>>
>>> After applying StringIndexer to my input DataFrame I've applied an ugly
>>> trick and got "indexedLabel" metadata:
>>>
>>> {"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}
>>>
>>>
>>> So, my algorithm gives me the following result. The question is I'm not
>>> sure I understand the meaning of the "prediction" here in the output. It
>>> looks like it's just an index of the highest probability value in the
>>> "prob" array. Shouldn't "prediction" be the actual class, i.e. one of the
>>> "0.0", "1.0", ..., "9.0"? If the prediction is just an ordinal number, then
>>> I have to manually correspond it to my classes, but for that I have to
>>> either specify classes manually to know their order or somehow be able to
>>> get them out of the classifier. Both of these way seem to be are not
>>> accessible.
>>>
>>> (4.0 -> prediction=7.0,
>>> prob=[0.004708283878223195,0.08478236104777455,0.03594642191080532,0.19286506771018885,0.038304389235523435,0.02841307797386,0.003334431932056404,0.5685242322346109,0.018564705500837587,0.024557028569980155]
>>> (9.0 -> prediction=3.0,
>>> prob=[0.018432404716456248,0.16837195846781422,0.05995559403934031,0.32282148259583565,0.018374168600855455,0.04792285114398864,0.018226352623526704,0.1611650363085499,0.11703073969440755,0.06769941180922535]
>>> (2.0 -> prediction=4.0,
>>> prob=[0.017918245251872154,0.029243677407669404,0.06228050320552064,0.03633295481094746,0.45707974962418885,0.09675606366289394,0.03921437851648226,0.043917057390743426,0.14132883075087405,0.0759285393788078]
>>>
>>> So, what is the prediction here? How can I specify classes manually or
>>> get the valid access to them?
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>
>>
>>
>


Re: Spark ML Random Forest output.

2015-12-05 Thread Eugene Morozov
Vishnu, thanks for the response.

The problem is that I actually do not have index labels, they are hidden in
the dataframe as a metadata. And anyone, who'd like to use that have to
apply an ugly hack.

The issue might be even worse in case I serialize my model into a file for
a delayed use. When I later on read it from the file, I do not have such a
map at all. The only workaround is to store the map along with serialized
model, which is not really great.

--
Be well!
Jean Morozov

On Sat, Dec 5, 2015 at 2:24 AM, Vishnu Viswanath <
vishnu.viswanat...@gmail.com> wrote:

> Hi,
>
> As per my understanding the probability matrix is giving the probability
> that that particular item can belong to each class. So the one with highest
> probability is your predicted class.
>
> Since you have converted you label to index label, according the model the
> classes are 0.0 to 9.0 and I see you are getting prediction as a value
> which is in [0.0,1.0,,9.0] -  which is correct.
>
> So what you want is a reverse map that can convert your predicted class
> back to the String. I don't know if  StringIndexer has such an option, may
> be you can create your own map and reverse map of (label to index) and
> (index to label) and use this for getting back your original label.
>
> May be there is better way to do this..
>
> Regards,
> Vishnu
>
> On Fri, Dec 4, 2015 at 4:56 PM, Eugene Morozov  > wrote:
>
>> Hello,
>>
>> I've got an input dataset of handwritten digits and working java code
>> that uses random forest classification algorithm to determine the numbers.
>> My test set is just some lines from the same input dataset - just to be
>> sure I'm doing the right thing. My understanding is that having correct
>> classifier in this case would give me the correct prediction.
>> At the moment overfitting is not an issue.
>>
>> After applying StringIndexer to my input DataFrame I've applied an ugly
>> trick and got "indexedLabel" metadata:
>>
>> {"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}
>>
>>
>> So, my algorithm gives me the following result. The question is I'm not
>> sure I understand the meaning of the "prediction" here in the output. It
>> looks like it's just an index of the highest probability value in the
>> "prob" array. Shouldn't "prediction" be the actual class, i.e. one of the
>> "0.0", "1.0", ..., "9.0"? If the prediction is just an ordinal number, then
>> I have to manually correspond it to my classes, but for that I have to
>> either specify classes manually to know their order or somehow be able to
>> get them out of the classifier. Both of these way seem to be are not
>> accessible.
>>
>> (4.0 -> prediction=7.0,
>> prob=[0.004708283878223195,0.08478236104777455,0.03594642191080532,0.19286506771018885,0.038304389235523435,0.02841307797386,0.003334431932056404,0.5685242322346109,0.018564705500837587,0.024557028569980155]
>> (9.0 -> prediction=3.0,
>> prob=[0.018432404716456248,0.16837195846781422,0.05995559403934031,0.32282148259583565,0.018374168600855455,0.04792285114398864,0.018226352623526704,0.1611650363085499,0.11703073969440755,0.06769941180922535]
>> (2.0 -> prediction=4.0,
>> prob=[0.017918245251872154,0.029243677407669404,0.06228050320552064,0.03633295481094746,0.45707974962418885,0.09675606366289394,0.03921437851648226,0.043917057390743426,0.14132883075087405,0.0759285393788078]
>>
>> So, what is the prediction here? How can I specify classes manually or
>> get the valid access to them?
>> --
>> Be well!
>> Jean Morozov
>>
>
>
>


Re: Spark ML Random Forest output.

2015-12-04 Thread Vishnu Viswanath
Hi,

As per my understanding the probability matrix is giving the probability
that that particular item can belong to each class. So the one with highest
probability is your predicted class.

Since you have converted you label to index label, according the model the
classes are 0.0 to 9.0 and I see you are getting prediction as a value
which is in [0.0,1.0,,9.0] -  which is correct.

So what you want is a reverse map that can convert your predicted class
back to the String. I don't know if  StringIndexer has such an option, may
be you can create your own map and reverse map of (label to index) and
(index to label) and use this for getting back your original label.

May be there is better way to do this..

Regards,
Vishnu

On Fri, Dec 4, 2015 at 4:56 PM, Eugene Morozov 
wrote:

> Hello,
>
> I've got an input dataset of handwritten digits and working java code that
> uses random forest classification algorithm to determine the numbers. My
> test set is just some lines from the same input dataset - just to be sure
> I'm doing the right thing. My understanding is that having correct
> classifier in this case would give me the correct prediction.
> At the moment overfitting is not an issue.
>
> After applying StringIndexer to my input DataFrame I've applied an ugly
> trick and got "indexedLabel" metadata:
>
> {"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}
>
>
> So, my algorithm gives me the following result. The question is I'm not
> sure I understand the meaning of the "prediction" here in the output. It
> looks like it's just an index of the highest probability value in the
> "prob" array. Shouldn't "prediction" be the actual class, i.e. one of the
> "0.0", "1.0", ..., "9.0"? If the prediction is just an ordinal number, then
> I have to manually correspond it to my classes, but for that I have to
> either specify classes manually to know their order or somehow be able to
> get them out of the classifier. Both of these way seem to be are not
> accessible.
>
> (4.0 -> prediction=7.0,
> prob=[0.004708283878223195,0.08478236104777455,0.03594642191080532,0.19286506771018885,0.038304389235523435,0.02841307797386,0.003334431932056404,0.5685242322346109,0.018564705500837587,0.024557028569980155]
> (9.0 -> prediction=3.0,
> prob=[0.018432404716456248,0.16837195846781422,0.05995559403934031,0.32282148259583565,0.018374168600855455,0.04792285114398864,0.018226352623526704,0.1611650363085499,0.11703073969440755,0.06769941180922535]
> (2.0 -> prediction=4.0,
> prob=[0.017918245251872154,0.029243677407669404,0.06228050320552064,0.03633295481094746,0.45707974962418885,0.09675606366289394,0.03921437851648226,0.043917057390743426,0.14132883075087405,0.0759285393788078]
>
> So, what is the prediction here? How can I specify classes manually or get
> the valid access to them?
> --
> Be well!
> Jean Morozov
>


Spark ML Random Forest output.

2015-12-04 Thread Eugene Morozov
Hello,

I've got an input dataset of handwritten digits and working java code that
uses random forest classification algorithm to determine the numbers. My
test set is just some lines from the same input dataset - just to be sure
I'm doing the right thing. My understanding is that having correct
classifier in this case would give me the correct prediction.
At the moment overfitting is not an issue.

After applying StringIndexer to my input DataFrame I've applied an ugly
trick and got "indexedLabel" metadata:
{"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}


So, my algorithm gives me the following result. The question is I'm not
sure I understand the meaning of the "prediction" here in the output. It
looks like it's just an index of the highest probability value in the
"prob" array. Shouldn't "prediction" be the actual class, i.e. one of the
"0.0", "1.0", ..., "9.0"? If the prediction is just an ordinal number, then
I have to manually correspond it to my classes, but for that I have to
either specify classes manually to know their order or somehow be able to
get them out of the classifier. Both of these way seem to be are not
accessible.

(4.0 -> prediction=7.0,
prob=[0.004708283878223195,0.08478236104777455,0.03594642191080532,0.19286506771018885,0.038304389235523435,0.02841307797386,0.003334431932056404,0.5685242322346109,0.018564705500837587,0.024557028569980155]
(9.0 -> prediction=3.0,
prob=[0.018432404716456248,0.16837195846781422,0.05995559403934031,0.32282148259583565,0.018374168600855455,0.04792285114398864,0.018226352623526704,0.1611650363085499,0.11703073969440755,0.06769941180922535]
(2.0 -> prediction=4.0,
prob=[0.017918245251872154,0.029243677407669404,0.06228050320552064,0.03633295481094746,0.45707974962418885,0.09675606366289394,0.03921437851648226,0.043917057390743426,0.14132883075087405,0.0759285393788078]

So, what is the prediction here? How can I specify classes manually or get
the valid access to them?
--
Be well!
Jean Morozov