RE: ML Random Forest Classifier
It looks like all of that is building up to spark 2.0 (for random forests / gbts / etc.). Ah well...thanks for your help. Was interesting digging into the depths. Date: Wed, 13 Apr 2016 09:48:32 +0100 Subject: Re: ML Random Forest Classifier From: ja...@gluru.co To: as...@live.com CC: user@spark.apache.org Hi Ashic, Unfortunately I don't know how to work around that - I suggested this line as it looked promising (I had considered it once before deciding to use a different algorithm) but I never actually tried it. Regards, James On 13 April 2016 at 02:29, Ashic Mahtab wrote: It looks like the issue is around impurity stats. After converting an rf model to old, and back to new (without disk storage or anything), and specifying the same num of features, same categorical features map, etc., DecisionTreeClassifier::predictRaw throws a null pointer exception here: override protected def predictRaw(features: Vector): Vector = { Vectors.dense(rootNode.predictImpl(features).impurityStats.stats.clone()) } It appears impurityStats is always null (even though impurity does have a value). Any known workarounds? It's looking like I'll have to revert to using mllib instead :( -Ashic. From: as...@live.com To: ja...@gluru.co CC: user@spark.apache.org Subject: RE: ML Random Forest Classifier Date: Wed, 13 Apr 2016 02:20:53 +0100 I managed to get to the map using MetadataUtils (it's a private ml package). There are still some issues, around feature names, etc. Trying to pin them down. From: as...@live.com To: ja...@gluru.co CC: user@spark.apache.org Subject: RE: ML Random Forest Classifier Date: Wed, 13 Apr 2016 00:50:31 +0100 Hi James,Following on from the previous email, is there a way to get the categoricalFeatures of a Spark ML Random Forest? Essentially something I can pass to RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, numClasses, numFeatures) I could construct it by hand, but I was hoping for a more automated way of getting the map. Since the trained model already knows about the value, perhaps it's possible to grab it for storage? Thanks,Ashic. From: as...@live.com To: ja...@gluru.co CC: user@spark.apache.org Subject: RE: ML Random Forest Classifier Date: Mon, 11 Apr 2016 23:21:53 +0100 Thanks, James. That looks promising. Date: Mon, 11 Apr 2016 10:41:07 +0100 Subject: Re: ML Random Forest Classifier From: ja...@gluru.co To: as...@live.com CC: user@spark.apache.org To add a bit more detail perhaps something like this might work: package org.apache.spark.ml import org.apache.spark.ml.classification.RandomForestClassificationModel import org.apache.spark.ml.classification.DecisionTreeClassificationModel import org.apache.spark.ml.classification.LogisticRegressionModel import org.apache.spark.mllib.tree.model.{ RandomForestModel => OldRandomForestModel } import org.apache.spark.ml.classification.RandomForestClassifier object RandomForestModelConverter { def fromOld(oldModel: OldRandomForestModel, parent: RandomForestClassifier = null, categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int = -1): RandomForestClassificationModel = { RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, numClasses, numFeatures) } def toOld(newModel: RandomForestClassificationModel): OldRandomForestModel = { newModel.toOld } } Regards, James On 11 April 2016 at 10:36, James Hammerton wrote: There are methods for converting the dataframe based random forest models to the old RDD based models and vice versa. Perhaps using these will help given that the old models can be saved and loaded? In order to use them however you will need to write code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a write function. There's a pull request that's been merged that enables this for Spark 2.0 (any dates around when that'll release?). I am, however, using the Spark Cassandra Connector which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot builds. Seeing that ML Lib's random forest classifier supports storing and loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with a random forest classifier that'll allow me to store and load the model? The model takes significant amount of time to train, and I really don't want to have to train it every time my application launches. Thanks,Ashic.
Re: ML Random Forest Classifier
Hi Ashic, Unfortunately I don't know how to work around that - I suggested this line as it looked promising (I had considered it once before deciding to use a different algorithm) but I never actually tried it. Regards, James On 13 April 2016 at 02:29, Ashic Mahtab wrote: > It looks like the issue is around impurity stats. After converting an rf > model to old, and back to new (without disk storage or anything), and > specifying the same num of features, same categorical features map, etc., > DecisionTreeClassifier::predictRaw throws a null pointer exception here: > > override protected def predictRaw(features: Vector): Vector = { > Vectors.dense(rootNode.predictImpl(features).*impurityStats.* > stats.clone()) > } > > It appears impurityStats is always null (even though impurity does have a > value). Any known workarounds? It's looking like I'll have to revert to > using mllib instead :( > > -Ashic. > > -- > From: as...@live.com > To: ja...@gluru.co > CC: user@spark.apache.org > Subject: RE: ML Random Forest Classifier > Date: Wed, 13 Apr 2016 02:20:53 +0100 > > > I managed to get to the map using MetadataUtils (it's a private ml > package). There are still some issues, around feature names, etc. Trying to > pin them down. > > -------------- > From: as...@live.com > To: ja...@gluru.co > CC: user@spark.apache.org > Subject: RE: ML Random Forest Classifier > Date: Wed, 13 Apr 2016 00:50:31 +0100 > > Hi James, > Following on from the previous email, is there a way to get the > categoricalFeatures of a Spark ML Random Forest? Essentially something I > can pass to > > RandomForestClassificationModel.fromOld(oldModel, parent, > *categoricalFeatures*, numClasses, numFeatures) > > I could construct it by hand, but I was hoping for a more automated way of > getting the map. Since the trained model already knows about the value, > perhaps it's possible to grab it for storage? > > Thanks, > Ashic. > > -- > From: as...@live.com > To: ja...@gluru.co > CC: user@spark.apache.org > Subject: RE: ML Random Forest Classifier > Date: Mon, 11 Apr 2016 23:21:53 +0100 > > Thanks, James. That looks promising. > > -- > Date: Mon, 11 Apr 2016 10:41:07 +0100 > Subject: Re: ML Random Forest Classifier > From: ja...@gluru.co > To: as...@live.com > CC: user@spark.apache.org > > To add a bit more detail perhaps something like this might work: > > package org.apache.spark.ml > > > import org.apache.spark.ml.classification.RandomForestClassificationModel > import org.apache.spark.ml.classification.DecisionTreeClassificationModel > import org.apache.spark.ml.classification.LogisticRegressionModel > import org.apache.spark.mllib.tree.model.{ RandomForestModel => > OldRandomForestModel } > import org.apache.spark.ml.classification.RandomForestClassifier > > > object RandomForestModelConverter { > > > def fromOld(oldModel: OldRandomForestModel, parent: > RandomForestClassifier = null, > categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int > = -1): RandomForestClassificationModel = { > RandomForestClassificationModel.fromOld(oldModel, parent, > categoricalFeatures, numClasses, numFeatures) > } > > > def toOld(newModel: RandomForestClassificationModel): > OldRandomForestModel = { > newModel.toOld > } > } > > > Regards, > > James > > > On 11 April 2016 at 10:36, James Hammerton wrote: > > There are methods for converting the dataframe based random forest models > to the old RDD based models and vice versa. Perhaps using these will help > given that the old models can be saved and loaded? > > In order to use them however you will need to write code in the > org.apache.spark.ml package. > > I've not actually tried doing this myself but it looks as if it might work. > > Regards, > > James > > On 11 April 2016 at 10:29, Ashic Mahtab wrote: > > Hello, > I'm trying to save a pipeline with a random forest classifier. If I try to > save the pipeline, it complains that the classifier is not Writable, and > indeed the classifier itself doesn't have a write function. There's a pull > request that's been merged that enables this for Spark 2.0 (any dates > around when that'll release?). I am, however, using the Spark Cassandra > Connector which doesn't seem to be able to create a CqlContext with spark > 2.0 snapshot builds. Seeing that ML Lib's random forest classifier supports > storing and loading models, is there a way to create a Spark ML pipeline in > Spark 1.6 with a random forest classifier that'll allow me to store and > load the model? The model takes significant amount of time to train, and I > really don't want to have to train it every time my application launches. > > Thanks, > Ashic. > > > >
RE: ML Random Forest Classifier
It looks like the issue is around impurity stats. After converting an rf model to old, and back to new (without disk storage or anything), and specifying the same num of features, same categorical features map, etc., DecisionTreeClassifier::predictRaw throws a null pointer exception here: override protected def predictRaw(features: Vector): Vector = { Vectors.dense(rootNode.predictImpl(features).impurityStats.stats.clone()) } It appears impurityStats is always null (even though impurity does have a value). Any known workarounds? It's looking like I'll have to revert to using mllib instead :( -Ashic. From: as...@live.com To: ja...@gluru.co CC: user@spark.apache.org Subject: RE: ML Random Forest Classifier Date: Wed, 13 Apr 2016 02:20:53 +0100 I managed to get to the map using MetadataUtils (it's a private ml package). There are still some issues, around feature names, etc. Trying to pin them down. From: as...@live.com To: ja...@gluru.co CC: user@spark.apache.org Subject: RE: ML Random Forest Classifier Date: Wed, 13 Apr 2016 00:50:31 +0100 Hi James,Following on from the previous email, is there a way to get the categoricalFeatures of a Spark ML Random Forest? Essentially something I can pass to RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, numClasses, numFeatures) I could construct it by hand, but I was hoping for a more automated way of getting the map. Since the trained model already knows about the value, perhaps it's possible to grab it for storage? Thanks,Ashic. From: as...@live.com To: ja...@gluru.co CC: user@spark.apache.org Subject: RE: ML Random Forest Classifier Date: Mon, 11 Apr 2016 23:21:53 +0100 Thanks, James. That looks promising. Date: Mon, 11 Apr 2016 10:41:07 +0100 Subject: Re: ML Random Forest Classifier From: ja...@gluru.co To: as...@live.com CC: user@spark.apache.org To add a bit more detail perhaps something like this might work: package org.apache.spark.ml import org.apache.spark.ml.classification.RandomForestClassificationModel import org.apache.spark.ml.classification.DecisionTreeClassificationModel import org.apache.spark.ml.classification.LogisticRegressionModel import org.apache.spark.mllib.tree.model.{ RandomForestModel => OldRandomForestModel } import org.apache.spark.ml.classification.RandomForestClassifier object RandomForestModelConverter { def fromOld(oldModel: OldRandomForestModel, parent: RandomForestClassifier = null, categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int = -1): RandomForestClassificationModel = { RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, numClasses, numFeatures) } def toOld(newModel: RandomForestClassificationModel): OldRandomForestModel = { newModel.toOld } } Regards, James On 11 April 2016 at 10:36, James Hammerton wrote: There are methods for converting the dataframe based random forest models to the old RDD based models and vice versa. Perhaps using these will help given that the old models can be saved and loaded? In order to use them however you will need to write code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a write function. There's a pull request that's been merged that enables this for Spark 2.0 (any dates around when that'll release?). I am, however, using the Spark Cassandra Connector which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot builds. Seeing that ML Lib's random forest classifier supports storing and loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with a random forest classifier that'll allow me to store and load the model? The model takes significant amount of time to train, and I really don't want to have to train it every time my application launches. Thanks,Ashic.
RE: ML Random Forest Classifier
I managed to get to the map using MetadataUtils (it's a private ml package). There are still some issues, around feature names, etc. Trying to pin them down. From: as...@live.com To: ja...@gluru.co CC: user@spark.apache.org Subject: RE: ML Random Forest Classifier Date: Wed, 13 Apr 2016 00:50:31 +0100 Hi James,Following on from the previous email, is there a way to get the categoricalFeatures of a Spark ML Random Forest? Essentially something I can pass to RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, numClasses, numFeatures) I could construct it by hand, but I was hoping for a more automated way of getting the map. Since the trained model already knows about the value, perhaps it's possible to grab it for storage? Thanks,Ashic. From: as...@live.com To: ja...@gluru.co CC: user@spark.apache.org Subject: RE: ML Random Forest Classifier Date: Mon, 11 Apr 2016 23:21:53 +0100 Thanks, James. That looks promising. Date: Mon, 11 Apr 2016 10:41:07 +0100 Subject: Re: ML Random Forest Classifier From: ja...@gluru.co To: as...@live.com CC: user@spark.apache.org To add a bit more detail perhaps something like this might work: package org.apache.spark.ml import org.apache.spark.ml.classification.RandomForestClassificationModel import org.apache.spark.ml.classification.DecisionTreeClassificationModel import org.apache.spark.ml.classification.LogisticRegressionModel import org.apache.spark.mllib.tree.model.{ RandomForestModel => OldRandomForestModel } import org.apache.spark.ml.classification.RandomForestClassifier object RandomForestModelConverter { def fromOld(oldModel: OldRandomForestModel, parent: RandomForestClassifier = null, categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int = -1): RandomForestClassificationModel = { RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, numClasses, numFeatures) } def toOld(newModel: RandomForestClassificationModel): OldRandomForestModel = { newModel.toOld } } Regards, James On 11 April 2016 at 10:36, James Hammerton wrote: There are methods for converting the dataframe based random forest models to the old RDD based models and vice versa. Perhaps using these will help given that the old models can be saved and loaded? In order to use them however you will need to write code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a write function. There's a pull request that's been merged that enables this for Spark 2.0 (any dates around when that'll release?). I am, however, using the Spark Cassandra Connector which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot builds. Seeing that ML Lib's random forest classifier supports storing and loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with a random forest classifier that'll allow me to store and load the model? The model takes significant amount of time to train, and I really don't want to have to train it every time my application launches. Thanks,Ashic.
RE: ML Random Forest Classifier
Hi James,Following on from the previous email, is there a way to get the categoricalFeatures of a Spark ML Random Forest? Essentially something I can pass to RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, numClasses, numFeatures) I could construct it by hand, but I was hoping for a more automated way of getting the map. Since the trained model already knows about the value, perhaps it's possible to grab it for storage? Thanks,Ashic. From: as...@live.com To: ja...@gluru.co CC: user@spark.apache.org Subject: RE: ML Random Forest Classifier Date: Mon, 11 Apr 2016 23:21:53 +0100 Thanks, James. That looks promising. Date: Mon, 11 Apr 2016 10:41:07 +0100 Subject: Re: ML Random Forest Classifier From: ja...@gluru.co To: as...@live.com CC: user@spark.apache.org To add a bit more detail perhaps something like this might work: package org.apache.spark.ml import org.apache.spark.ml.classification.RandomForestClassificationModel import org.apache.spark.ml.classification.DecisionTreeClassificationModel import org.apache.spark.ml.classification.LogisticRegressionModel import org.apache.spark.mllib.tree.model.{ RandomForestModel => OldRandomForestModel } import org.apache.spark.ml.classification.RandomForestClassifier object RandomForestModelConverter { def fromOld(oldModel: OldRandomForestModel, parent: RandomForestClassifier = null, categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int = -1): RandomForestClassificationModel = { RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, numClasses, numFeatures) } def toOld(newModel: RandomForestClassificationModel): OldRandomForestModel = { newModel.toOld } } Regards, James On 11 April 2016 at 10:36, James Hammerton wrote: There are methods for converting the dataframe based random forest models to the old RDD based models and vice versa. Perhaps using these will help given that the old models can be saved and loaded? In order to use them however you will need to write code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a write function. There's a pull request that's been merged that enables this for Spark 2.0 (any dates around when that'll release?). I am, however, using the Spark Cassandra Connector which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot builds. Seeing that ML Lib's random forest classifier supports storing and loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with a random forest classifier that'll allow me to store and load the model? The model takes significant amount of time to train, and I really don't want to have to train it every time my application launches. Thanks,Ashic.
RE: ML Random Forest Classifier
Thanks, James. That looks promising. Date: Mon, 11 Apr 2016 10:41:07 +0100 Subject: Re: ML Random Forest Classifier From: ja...@gluru.co To: as...@live.com CC: user@spark.apache.org To add a bit more detail perhaps something like this might work: package org.apache.spark.ml import org.apache.spark.ml.classification.RandomForestClassificationModel import org.apache.spark.ml.classification.DecisionTreeClassificationModel import org.apache.spark.ml.classification.LogisticRegressionModel import org.apache.spark.mllib.tree.model.{ RandomForestModel => OldRandomForestModel } import org.apache.spark.ml.classification.RandomForestClassifier object RandomForestModelConverter { def fromOld(oldModel: OldRandomForestModel, parent: RandomForestClassifier = null, categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int = -1): RandomForestClassificationModel = { RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, numClasses, numFeatures) } def toOld(newModel: RandomForestClassificationModel): OldRandomForestModel = { newModel.toOld } } Regards, James On 11 April 2016 at 10:36, James Hammerton wrote: There are methods for converting the dataframe based random forest models to the old RDD based models and vice versa. Perhaps using these will help given that the old models can be saved and loaded? In order to use them however you will need to write code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a write function. There's a pull request that's been merged that enables this for Spark 2.0 (any dates around when that'll release?). I am, however, using the Spark Cassandra Connector which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot builds. Seeing that ML Lib's random forest classifier supports storing and loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with a random forest classifier that'll allow me to store and load the model? The model takes significant amount of time to train, and I really don't want to have to train it every time my application launches. Thanks,Ashic.
Re: ML Random Forest Classifier
To add a bit more detail perhaps something like this might work: package org.apache.spark.ml > > > import org.apache.spark.ml.classification.RandomForestClassificationModel > > import org.apache.spark.ml.classification.DecisionTreeClassificationModel > > import org.apache.spark.ml.classification.LogisticRegressionModel > > import org.apache.spark.mllib.tree.model.{ RandomForestModel => > OldRandomForestModel } > > import org.apache.spark.ml.classification.RandomForestClassifier > > > object RandomForestModelConverter { > > > def fromOld(oldModel: OldRandomForestModel, parent: > RandomForestClassifier = null, > > categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int > = -1): RandomForestClassificationModel = { > > RandomForestClassificationModel.fromOld(oldModel, parent, > categoricalFeatures, numClasses, numFeatures) > > } > > > def toOld(newModel: RandomForestClassificationModel): > OldRandomForestModel = { > > newModel.toOld > > } > > } > Regards, James On 11 April 2016 at 10:36, James Hammerton wrote: > There are methods for converting the dataframe based random forest models > to the old RDD based models and vice versa. Perhaps using these will help > given that the old models can be saved and loaded? > > In order to use them however you will need to write code in the > org.apache.spark.ml package. > > I've not actually tried doing this myself but it looks as if it might work. > > Regards, > > James > > On 11 April 2016 at 10:29, Ashic Mahtab wrote: > >> Hello, >> I'm trying to save a pipeline with a random forest classifier. If I try >> to save the pipeline, it complains that the classifier is not Writable, and >> indeed the classifier itself doesn't have a write function. There's a pull >> request that's been merged that enables this for Spark 2.0 (any dates >> around when that'll release?). I am, however, using the Spark Cassandra >> Connector which doesn't seem to be able to create a CqlContext with spark >> 2.0 snapshot builds. Seeing that ML Lib's random forest classifier supports >> storing and loading models, is there a way to create a Spark ML pipeline in >> Spark 1.6 with a random forest classifier that'll allow me to store and >> load the model? The model takes significant amount of time to train, and I >> really don't want to have to train it every time my application launches. >> >> Thanks, >> Ashic. >> > >
Re: ML Random Forest Classifier
There are methods for converting the dataframe based random forest models to the old RDD based models and vice versa. Perhaps using these will help given that the old models can be saved and loaded? In order to use them however you will need to write code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab wrote: > Hello, > I'm trying to save a pipeline with a random forest classifier. If I try to > save the pipeline, it complains that the classifier is not Writable, and > indeed the classifier itself doesn't have a write function. There's a pull > request that's been merged that enables this for Spark 2.0 (any dates > around when that'll release?). I am, however, using the Spark Cassandra > Connector which doesn't seem to be able to create a CqlContext with spark > 2.0 snapshot builds. Seeing that ML Lib's random forest classifier supports > storing and loading models, is there a way to create a Spark ML pipeline in > Spark 1.6 with a random forest classifier that'll allow me to store and > load the model? The model takes significant amount of time to train, and I > really don't want to have to train it every time my application launches. > > Thanks, > Ashic. >
ML Random Forest Classifier
Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a write function. There's a pull request that's been merged that enables this for Spark 2.0 (any dates around when that'll release?). I am, however, using the Spark Cassandra Connector which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot builds. Seeing that ML Lib's random forest classifier supports storing and loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with a random forest classifier that'll allow me to store and load the model? The model takes significant amount of time to train, and I really don't want to have to train it every time my application launches. Thanks,Ashic.