RE: ML Random Forest Classifier

Ashic Mahtab Wed, 13 Apr 2016 03:22:03 -0700

It looks like all of that is building up to spark 2.0 (for random forests / 
gbts / etc.). Ah well...thanks for your help. Was interesting digging into the 
depths.

Date: Wed, 13 Apr 2016 09:48:32 +0100
Subject: Re: ML Random Forest Classifier
From: ja...@gluru.co
To: as...@live.com
CC: user@spark.apache.org

Hi Ashic,
Unfortunately I don't know how to work around that - I suggested this line as 
it looked promising (I had considered it once before deciding to use a 
different algorithm) but I never actually tried it. 
Regards,
James
On 13 April 2016 at 02:29, Ashic Mahtab <as...@live.com> wrote:

It looks like the issue is around impurity stats. After converting an rf model 
to old, and back to new (without disk storage or anything), and specifying the 
same num of features, same categorical features map, etc., 
DecisionTreeClassifier::predictRaw throws a null pointer exception here:
 override protected def predictRaw(features: Vector): Vector = {    
Vectors.dense(rootNode.predictImpl(features).impurityStats.stats.clone())  }
It appears impurityStats is always null (even though impurity does have a 
value). Any known workarounds? It's looking like I'll have to revert to using 
mllib instead :(
-Ashic.
From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Wed, 13 Apr 2016 02:20:53 +0100

I managed to get to the map using MetadataUtils (it's a private ml package). 
There are still some issues, around feature names, etc. Trying to pin them down.

From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Wed, 13 Apr 2016 00:50:31 +0100

Hi James,Following on from the previous email, is there a way to get the 
categoricalFeatures of a Spark ML Random Forest? Essentially something I can 
pass to
RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, 
numClasses, numFeatures)
I could construct it by hand, but I was hoping for a more automated way of 
getting the map. Since the trained model already knows about the value, perhaps 
it's possible to grab it for storage?
Thanks,Ashic.

From: as...@live.com
To: ja...@gluru.co
CC: user@spark.apache.org
Subject: RE: ML Random Forest Classifier
Date: Mon, 11 Apr 2016 23:21:53 +0100

Thanks, James. That looks promising. 

Date: Mon, 11 Apr 2016 10:41:07 +0100
Subject: Re: ML Random Forest Classifier
From: ja...@gluru.co
To: as...@live.com
CC: user@spark.apache.org

To add a bit more detail perhaps something like this might work:

package org.apache.spark.ml

import org.apache.spark.ml.classification.RandomForestClassificationModel

import org.apache.spark.ml.classification.DecisionTreeClassificationModel

import org.apache.spark.ml.classification.LogisticRegressionModel

import org.apache.spark.mllib.tree.model.{ RandomForestModel => 
OldRandomForestModel }

import org.apache.spark.ml.classification.RandomForestClassifier

object RandomForestModelConverter {

  def fromOld(oldModel: OldRandomForestModel, parent: RandomForestClassifier = 
null,

    categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int = 
-1): RandomForestClassificationModel = {

    RandomForestClassificationModel.fromOld(oldModel, parent, 
categoricalFeatures, numClasses, numFeatures)

  }

  def toOld(newModel: RandomForestClassificationModel): OldRandomForestModel = {

    newModel.toOld

  }

}

Regards,
James 
On 11 April 2016 at 10:36, James Hammerton <ja...@gluru.co> wrote:
There are methods for converting the dataframe based random forest models to 
the old RDD based models and vice versa. Perhaps using these will help given 
that the old models can be saved and loaded?
In order to use them however you will need to write code in the 
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James

On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote:

Hello,I'm trying to save a pipeline with a random forest classifier. If I try 
to save the pipeline, it complains that the classifier is not Writable, and 
indeed the classifier itself doesn't have a write function. There's a pull 
request that's been merged that enables this for Spark 2.0 (any dates around 
when that'll release?). I am, however, using the Spark Cassandra Connector 
which doesn't seem to be able to create a CqlContext with spark 2.0 snapshot 
builds. Seeing that ML Lib's random forest classifier supports storing and 
loading models, is there a way to create a Spark ML pipeline in Spark 1.6 with 
a random forest classifier that'll allow me to store and load the model? The 
model takes significant amount of time to train, and I really don't want to 
have to train it every time my application launches.
Thanks,Ashic.

RE: ML Random Forest Classifier

Reply via email to