[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aseem Bansal updated SPARK-19449: --------------------------------- Description: I worked on some code to convert ml package RandomForestClassificationModel to mllib package RandomForestModel. It was needed because we need to make predictions on the order of ms. I found that the results are inconsistent although the underlying DecisionTreeModel are exactly the same. So the behavior between the 2 implementations is inconsistent which should not be the case. The below code can be used to reproduce the issue. Can run this as a simple Java app as long as you have spark dependencies set up properly. {noformat} import org.apache.spark.ml.Transformer; import org.apache.spark.ml.classification.*; import org.apache.spark.ml.linalg.*; import org.apache.spark.ml.regression.RandomForestRegressionModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Enumeration; import java.util.ArrayList; import java.util.List; import java.util.Random; abstract class Predictor { abstract double predict(Vector vector); } public class MainConvertModels { public static final int seed = 42; public static void main(String[] args) { int numRows = 1000; int numFeatures = 3; int numClasses = 2; double trainFraction = 0.8; double testFraction = 0.2; SparkSession spark = SparkSession.builder() .appName("conversion app") .master("local") .getOrCreate(); Dataset<Row> data = getDummyData(spark, numRows, numFeatures, numClasses); Dataset<Row>[] splits = data.randomSplit(new double[]{trainFraction, testFraction}, seed); Dataset<Row> trainingData = splits[0]; Dataset<Row> testData = splits[1]; testData.cache(); List<Double> labels = getLabels(testData); List<DenseVector> features = getFeatures(testData); DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); DecisionTreeClassificationModel model1 = classifier1.fit(trainingData); final DecisionTreeModel convertedModel1 = convertDecisionTreeModel(model1, Algo.Classification()); RandomForestClassifier classifier = new RandomForestClassifier(); RandomForestClassificationModel model2 = classifier.fit(trainingData); final RandomForestModel convertedModel2 = convertRandomForestModel(model2); System.out.println( "****** DecisionTreeClassifier\n" + "** Original **" + getInfo(model1, testData) + "\n" + "** New **" + getInfo(new Predictor() { double predict(Vector vector) {return convertedModel1.predict(vector);} }, labels, features) + "\n" + "\n" + "****** RandomForestClassifier\n" + "** Original **" + getInfo(model2, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, features) + "\n" + "\n" + ""); } static Dataset<Row> getDummyData(SparkSession spark, int numberRows, int numberFeatures, int labelUpperBound) { StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) }); double[][] vectors = prepareData(numberRows, numberFeatures); Random random = new Random(seed); List<Row> dataTest = new ArrayList<>(); for (double[] vector : vectors) { double label = (double) random.nextInt(2); dataTest.add(RowFactory.create(label, Vectors.dense(vector))); } return spark.createDataFrame(dataTest, schema); } static double[][] prepareData(int numRows, int numFeatures) { Random random = new Random(seed); double[][] result = new double[numRows][numFeatures]; for (int row = 0; row < numRows; row++) { for (int feature = 0; feature < numFeatures; feature++) { result[row][feature] = random.nextDouble(); } } return result; } static String getInfo(Predictor predictor, List<Double> labels, List<DenseVector> features) { Long startTime = System.currentTimeMillis(); List<Double> predictions = new ArrayList<>(); for (DenseVector feature : features) { predictions.add(predictor.predict(feature)); } return getInfo(startTime, labels, predictions); } static List<Double> getLabels(Dataset<Row> testData) { List<Double> labels = new ArrayList<>(); List<DenseVector> vectors = new ArrayList<>(); for (Row row : testData.collectAsList()) { vectors.add(new DenseVector(((org.apache.spark.ml.linalg.Vector) row.get(1)).toArray())); labels.add(row.getDouble(0)); } return labels; } static List<DenseVector> getFeatures(Dataset<Row> testData) { List<DenseVector> features = new ArrayList<>(); for (Row row : testData.collectAsList()) { features.add(new DenseVector(((org.apache.spark.ml.linalg.Vector) row.get(1)).toArray())); } return features; } static String getInfo(Transformer model, Dataset<Row> testData) { Dataset<Row> predictions = model.transform(testData); predictions.cache(); Dataset<Row> correctPredictions = predictions.filter("label == prediction"); correctPredictions.cache(); Dataset<Row> incorrectPredictions = predictions.filter("label != prediction"); incorrectPredictions.cache(); Long truePositives = correctPredictions.filter("prediction == 1.0").count(); Long trueNegatives = correctPredictions.filter("prediction == 0.0").count(); Long falsePositives = incorrectPredictions.filter("prediction == 1.0").count(); Long falseNegatives = incorrectPredictions.filter("prediction == 0.0").count(); return getInfo(null, truePositives, trueNegatives, falsePositives, falseNegatives); } static String getInfo(Long startTime, List<Double> labels, List<Double> predictions) { Long endTime = System.currentTimeMillis(); if (labels.size() != predictions.size()) { throw new RuntimeException("labels size is " + labels.size() + " but predictions size is " + predictions.size()); } Long truePositives = 0L; Long trueNegatives = 0L; Long falsePositives = 0L; Long falseNegatives = 0L; for (int i = 0; i < labels.size(); i++) { double label = labels.get(i); double prediction = predictions.get(i); if (label == prediction) { if (prediction == 1.0) { truePositives += 1; } else { trueNegatives += 1; } } else { if (prediction == 1.0) { falsePositives += 1; } else { falseNegatives += 1; } } } return getInfo(endTime - startTime, truePositives, trueNegatives, falsePositives, falseNegatives); } static double ratio(Long numerator, Long denominator) { if (numerator == 0 || denominator == 0) { return 0; } return ((double) numerator) / denominator; } static String getInfo(Long timeTakenMilliseconds, Long truePositives, Long trueNegatives, Long falsePositives, Long falseNegatives) { Long testDataCount = truePositives + trueNegatives + falsePositives + falseNegatives; double accuracy = ratio(truePositives + trueNegatives, testDataCount); double precision = ratio(truePositives, truePositives + falsePositives); double recall = ratio(truePositives, truePositives + falseNegatives); String last = ""; if (timeTakenMilliseconds != null) { last = ", Average time taken (ms) " + ratio(timeTakenMilliseconds, testDataCount); } return ( "true positives " + truePositives + ", true negatives " + trueNegatives + ", false positives " + falsePositives + ", false negatives " + falseNegatives + ", total " + testDataCount + "\n\t accuracy " + accuracy + ", precision " + precision + ", recall " + recall + last ); } static DecisionTreeModel convertDecisionTreeModel(org.apache.spark.ml.tree.DecisionTreeModel model, Enumeration.Value algo) { return new DecisionTreeModel(model.rootNode().toOld(1), algo); } static RandomForestModel convertRandomForestModel(org.apache.spark.ml.tree.TreeEnsembleModel model) { Enumeration.Value algo; if (model instanceof RandomForestRegressionModel) { algo = Algo.Regression(); } else { algo = Algo.Classification(); } Object[] decisionTreeModels = model.trees(); DecisionTreeModel[] convertedDecisionTreeModels = new DecisionTreeModel[decisionTreeModels.length]; for (int i = 0; i < decisionTreeModels.length; i++) { org.apache.spark.ml.tree.DecisionTreeModel originalModel = (org.apache.spark.ml.tree.DecisionTreeModel) decisionTreeModels[i]; DecisionTreeModel convertedModel = convertDecisionTreeModel(originalModel, algo); convertedDecisionTreeModels[i] = convertedModel; } RandomForestModel result = new RandomForestModel(algo, convertedDecisionTreeModels); return result; } } {noformat} The output looks like the below. In the below the Original refers to ml package version and New refers to mllib package version. - I converted the mllib version Decision tree to ml version Decision tree. Gave both versions same input and I received the exact same output. - Then converted the mllib version Random Forest to ml version Random Forest giving both the same underlying Decision trees (using the previoeus conversion method). Gave both versions same input but I received different output. {noformat} ****** DecisionTreeClassifier ** Original **true positives 8128, true negatives 1923, false positives 7942, false negatives 1897, total 19890 accuracy 0.5053293112116641, precision 0.5057871810827629, recall 0.8107730673316709 ** New **true positives 8128, true negatives 1923, false positives 7942, false negatives 1897, total 19890 accuracy 0.5053293112116641, precision 0.5057871810827629, recall 0.8107730673316709, Average time taken (ms) 0.001558572146807441 ****** RandomForestClassifier ** Original **true positives 3940, true negatives 5915, false positives 3950, false negatives 6085, total 19890 accuracy 0.49547511312217196, precision 0.49936628643852976, recall 0.39301745635910224 ** New **true positives 2461, true negatives 7350, false positives 2515, false negatives 7564, total 19890 accuracy 0.4932629462041227, precision 0.4945739549839228, recall 0.2454862842892768, Average time taken (ms) 0.01085972850678733 {noformat} was: I worked on some code to convert ml package RandomForestClassificationModel to mllib package RandomForestModel. It was needed because we need to make predictions on the order of ms. I found that the results are inconsistent although the underlying DecisionTreeModel are exactly the same. So the behavior between the 2 implementations is inconsistent which should not be the case. The below code can be used to reproduce the issue. Can run this as a simple Java app as long as you have spark dependencies set up properly. {noformat} import org.apache.spark.ml.Transformer; import org.apache.spark.ml.classification.*; import org.apache.spark.ml.linalg.*; import org.apache.spark.ml.regression.RandomForestRegressionModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Enumeration; import java.util.ArrayList; import java.util.List; import java.util.Random; abstract class Predictor { abstract double predict(Vector vector); } public class MainConvertModels { public static final int seed = 42; public static void main(String[] args) { int numRows = 1000; int numFeatures = 3; int numClasses = 2; double trainFraction = 0.8; double testFraction = 0.2; SparkSession spark = SparkSession.builder() .appName("conversion app") .master("local") .getOrCreate(); // Dataset<Row> data = getData(spark, "libsvm", "/opt/spark2/data/mllib/sample_libsvm_data.txt"); Dataset<Row> data = getDummyData(spark, numRows, numFeatures, numClasses); Dataset<Row>[] splits = data.randomSplit(new double[]{trainFraction, testFraction}, seed); Dataset<Row> trainingData = splits[0]; Dataset<Row> testData = splits[1]; testData.cache(); List<Double> labels = getLabels(testData); List<DenseVector> features = getFeatures(testData); DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); DecisionTreeClassificationModel model1 = classifier1.fit(trainingData); final DecisionTreeModel convertedModel1 = convertDecisionTreeModel(model1, Algo.Classification()); RandomForestClassifier classifier = new RandomForestClassifier(); RandomForestClassificationModel model2 = classifier.fit(trainingData); final RandomForestModel convertedModel2 = convertRandomForestModel(model2); LogisticRegression lr = new LogisticRegression(); LogisticRegressionModel model3 = lr.fit(trainingData); final org.apache.spark.mllib.classification.LogisticRegressionModel convertedModel3 = convertLogisticRegressionModel(model3); System.out.println( "****** DecisionTreeClassifier\n" + "** Original **" + getInfo(model1, testData) + "\n" + "** New **" + getInfo(new Predictor() { double predict(Vector vector) {return convertedModel1.predict(vector);} }, labels, features) + "\n" + "\n" + "****** RandomForestClassifier\n" + "** Original **" + getInfo(model2, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, features) + "\n" + "\n" + "****** LogisticRegression\n" + "** Original **" + getInfo(model3, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, features) + "\n" + ""); } static Dataset<Row> getData(SparkSession spark, String format, String location) { return spark.read() .format(format) .load(location); } static Dataset<Row> getDummyData(SparkSession spark, int numberRows, int numberFeatures, int labelUpperBound) { StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) }); double[][] vectors = prepareData(numberRows, numberFeatures); Random random = new Random(seed); List<Row> dataTest = new ArrayList<>(); for (double[] vector : vectors) { double label = (double) random.nextInt(2); dataTest.add(RowFactory.create(label, Vectors.dense(vector))); } return spark.createDataFrame(dataTest, schema); } static double[][] prepareData(int numRows, int numFeatures) { Random random = new Random(seed); double[][] result = new double[numRows][numFeatures]; for (int row = 0; row < numRows; row++) { for (int feature = 0; feature < numFeatures; feature++) { result[row][feature] = random.nextDouble(); } } return result; } static String getInfo(Predictor predictor, List<Double> labels, List<DenseVector> features) { Long startTime = System.currentTimeMillis(); List<Double> predictions = new ArrayList<>(); for (DenseVector feature : features) { predictions.add(predictor.predict(feature)); } return getInfo(startTime, labels, predictions); } static List<Double> getLabels(Dataset<Row> testData) { List<Double> labels = new ArrayList<>(); List<DenseVector> vectors = new ArrayList<>(); for (Row row : testData.collectAsList()) { vectors.add(new DenseVector(((org.apache.spark.ml.linalg.Vector) row.get(1)).toArray())); labels.add(row.getDouble(0)); } return labels; } static List<DenseVector> getFeatures(Dataset<Row> testData) { List<DenseVector> features = new ArrayList<>(); for (Row row : testData.collectAsList()) { features.add(new DenseVector(((org.apache.spark.ml.linalg.Vector) row.get(1)).toArray())); } return features; } static String getInfo(Transformer model, Dataset<Row> testData) { Dataset<Row> predictions = model.transform(testData); predictions.cache(); Dataset<Row> correctPredictions = predictions.filter("label == prediction"); correctPredictions.cache(); Dataset<Row> incorrectPredictions = predictions.filter("label != prediction"); incorrectPredictions.cache(); Long truePositives = correctPredictions.filter("prediction == 1.0").count(); Long trueNegatives = correctPredictions.filter("prediction == 0.0").count(); Long falsePositives = incorrectPredictions.filter("prediction == 1.0").count(); Long falseNegatives = incorrectPredictions.filter("prediction == 0.0").count(); return getInfo(null, truePositives, trueNegatives, falsePositives, falseNegatives); } static String getInfo(Long startTime, List<Double> labels, List<Double> predictions) { Long endTime = System.currentTimeMillis(); if (labels.size() != predictions.size()) { throw new RuntimeException("labels size is " + labels.size() + " but predictions size is " + predictions.size()); } Long truePositives = 0L; Long trueNegatives = 0L; Long falsePositives = 0L; Long falseNegatives = 0L; for (int i = 0; i < labels.size(); i++) { double label = labels.get(i); double prediction = predictions.get(i); if (label == prediction) { if (prediction == 1.0) { truePositives += 1; } else { trueNegatives += 1; } } else { if (prediction == 1.0) { falsePositives += 1; } else { falseNegatives += 1; } } } return getInfo(endTime - startTime, truePositives, trueNegatives, falsePositives, falseNegatives); } static double ratio(Long numerator, Long denominator) { if (numerator == 0 || denominator == 0) { return 0; } return ((double) numerator) / denominator; } static String getInfo(Long timeTakenMilliseconds, Long truePositives, Long trueNegatives, Long falsePositives, Long falseNegatives) { Long testDataCount = truePositives + trueNegatives + falsePositives + falseNegatives; double accuracy = ratio(truePositives + trueNegatives, testDataCount); double precision = ratio(truePositives, truePositives + falsePositives); double recall = ratio(truePositives, truePositives + falseNegatives); String last = ""; if (timeTakenMilliseconds != null) { last = ", Average time taken (ms) " + ratio(timeTakenMilliseconds, testDataCount); } return ( "true positives " + truePositives + ", true negatives " + trueNegatives + ", false positives " + falsePositives + ", false negatives " + falseNegatives + ", total " + testDataCount + "\n\t accuracy " + accuracy + ", precision " + precision + ", recall " + recall + last ); } static DecisionTreeModel convertDecisionTreeModel(org.apache.spark.ml.tree.DecisionTreeModel model, Enumeration.Value algo) { return new DecisionTreeModel(model.rootNode().toOld(1), algo); } static RandomForestModel convertRandomForestModel(org.apache.spark.ml.tree.TreeEnsembleModel model) { Enumeration.Value algo; if (model instanceof RandomForestRegressionModel) { algo = Algo.Regression(); } else { algo = Algo.Classification(); } Object[] decisionTreeModels = model.trees(); DecisionTreeModel[] convertedDecisionTreeModels = new DecisionTreeModel[decisionTreeModels.length]; for (int i = 0; i < decisionTreeModels.length; i++) { org.apache.spark.ml.tree.DecisionTreeModel originalModel = (org.apache.spark.ml.tree.DecisionTreeModel) decisionTreeModels[i]; DecisionTreeModel convertedModel = convertDecisionTreeModel(originalModel, algo); convertedDecisionTreeModels[i] = convertedModel; } RandomForestModel result = new RandomForestModel(algo, convertedDecisionTreeModels); return result; } static org.apache.spark.mllib.classification.LogisticRegressionModel convertLogisticRegressionModel(LogisticRegressionModel model) { org.apache.spark.mllib.classification.LogisticRegressionModel convertedModel; try { convertedModel = new org.apache.spark.mllib.classification.LogisticRegressionModel( new DenseVector(model.coefficients().toArray()), model.intercept(), model.numFeatures(), model.numClasses() ); } catch (Exception e) { //Should be SparkException but that does not compile. // Raised when we have Multinomial Linear Regression // Cannot check as the relevant variable is private Vector coefficients = matrixToVector(model.coefficientMatrix()); for (double v : coefficients.toArray()) { System.out.println(v); } System.out.println("**********"); for (double v : model.interceptVector().toDense().values()) { System.out.println(v); } convertedModel = new org.apache.spark.mllib.classification.LogisticRegressionModel( coefficients, model.interceptVector().toDense().values()[0], //TODO fix this. model.numFeatures(), model.numClasses() ); } convertedModel.setThreshold(model.getThreshold()); return convertedModel; } static Vector matrixToVector(Matrix matrix) { Vector result; if (matrix instanceof DenseMatrix) { result = new DenseVector(((DenseMatrix) matrix).values()); } else { SparseMatrix _matrix = (SparseMatrix) matrix; result = new org.apache.spark.mllib.linalg.SparseVector( _matrix.numActives(), _matrix.rowIndices(), _matrix.values() ); } return result; } } {noformat} The output looks like the below. In the below the Original refers to ml package version and New refers to mllib package version. - I converted the mllib version Decision tree to ml version Decision tree. Gave both versions same input and I received the exact same output. - Then converted the mllib version Random Forest to ml version Random Forest giving both the same underlying Decision trees (using the previoeus conversion method). Gave both versions same input but I received different output. {noformat} ****** DecisionTreeClassifier ** Original **true positives 8128, true negatives 1923, false positives 7942, false negatives 1897, total 19890 accuracy 0.5053293112116641, precision 0.5057871810827629, recall 0.8107730673316709 ** New **true positives 8128, true negatives 1923, false positives 7942, false negatives 1897, total 19890 accuracy 0.5053293112116641, precision 0.5057871810827629, recall 0.8107730673316709, Average time taken (ms) 0.001558572146807441 ****** RandomForestClassifier ** Original **true positives 3940, true negatives 5915, false positives 3950, false negatives 6085, total 19890 accuracy 0.49547511312217196, precision 0.49936628643852976, recall 0.39301745635910224 ** New **true positives 2461, true negatives 7350, false positives 2515, false negatives 7564, total 19890 accuracy 0.4932629462041227, precision 0.4945739549839228, recall 0.2454862842892768, Average time taken (ms) 0.01085972850678733 ****** LogisticRegression ** Original **true positives 6728, true negatives 3321, false positives 6544, false negatives 3297, total 19890 accuracy 0.5052287581699346, precision 0.5069318866787221, recall 0.6711221945137157 ** New **true positives 6728, true negatives 3321, false positives 6544, false negatives 3297, total 19890 accuracy 0.5052287581699346, precision 0.5069318866787221, recall 0.6711221945137157, Average time taken (ms) 0.001558572146807441 {noformat} > Inconsistent results between ml package RandomForestClassificationModel and > mllib package RandomForestModel > ----------------------------------------------------------------------------------------------------------- > > Key: SPARK-19449 > URL: https://issues.apache.org/jira/browse/SPARK-19449 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.1.0 > Reporter: Aseem Bansal > > I worked on some code to convert ml package RandomForestClassificationModel > to mllib package RandomForestModel. It was needed because we need to make > predictions on the order of ms. I found that the results are inconsistent > although the underlying DecisionTreeModel are exactly the same. So the > behavior between the 2 implementations is inconsistent which should not be > the case. > The below code can be used to reproduce the issue. Can run this as a simple > Java app as long as you have spark dependencies set up properly. > {noformat} > import org.apache.spark.ml.Transformer; > import org.apache.spark.ml.classification.*; > import org.apache.spark.ml.linalg.*; > import org.apache.spark.ml.regression.RandomForestRegressionModel; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.tree.configuration.Algo; > import org.apache.spark.mllib.tree.model.DecisionTreeModel; > import org.apache.spark.mllib.tree.model.RandomForestModel; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import scala.Enumeration; > import java.util.ArrayList; > import java.util.List; > import java.util.Random; > abstract class Predictor { > abstract double predict(Vector vector); > } > public class MainConvertModels { > public static final int seed = 42; > public static void main(String[] args) { > int numRows = 1000; > int numFeatures = 3; > int numClasses = 2; > double trainFraction = 0.8; > double testFraction = 0.2; > SparkSession spark = SparkSession.builder() > .appName("conversion app") > .master("local") > .getOrCreate(); > Dataset<Row> data = getDummyData(spark, numRows, numFeatures, > numClasses); > Dataset<Row>[] splits = data.randomSplit(new double[]{trainFraction, > testFraction}, seed); > Dataset<Row> trainingData = splits[0]; > Dataset<Row> testData = splits[1]; > testData.cache(); > List<Double> labels = getLabels(testData); > List<DenseVector> features = getFeatures(testData); > DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); > DecisionTreeClassificationModel model1 = > classifier1.fit(trainingData); > final DecisionTreeModel convertedModel1 = > convertDecisionTreeModel(model1, Algo.Classification()); > RandomForestClassifier classifier = new RandomForestClassifier(); > RandomForestClassificationModel model2 = classifier.fit(trainingData); > final RandomForestModel convertedModel2 = > convertRandomForestModel(model2); > System.out.println( > "****** DecisionTreeClassifier\n" + > "** Original **" + getInfo(model1, testData) + "\n" + > "** New **" + getInfo(new Predictor() { > double predict(Vector vector) {return > convertedModel1.predict(vector);} > }, labels, features) + "\n" + > "\n" + > "****** RandomForestClassifier\n" + > "** Original **" + getInfo(model2, testData) + "\n" + > "** New **" + getInfo(new Predictor() {double > predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, > features) + "\n" + > "\n" + > ""); > } > static Dataset<Row> getDummyData(SparkSession spark, int numberRows, int > numberFeatures, int labelUpperBound) { > StructType schema = new StructType(new StructField[]{ > new StructField("label", DataTypes.DoubleType, false, > Metadata.empty()), > new StructField("features", new VectorUDT(), false, > Metadata.empty()) > }); > double[][] vectors = prepareData(numberRows, numberFeatures); > Random random = new Random(seed); > List<Row> dataTest = new ArrayList<>(); > for (double[] vector : vectors) { > double label = (double) random.nextInt(2); > dataTest.add(RowFactory.create(label, Vectors.dense(vector))); > } > return spark.createDataFrame(dataTest, schema); > } > static double[][] prepareData(int numRows, int numFeatures) { > Random random = new Random(seed); > double[][] result = new double[numRows][numFeatures]; > for (int row = 0; row < numRows; row++) { > for (int feature = 0; feature < numFeatures; feature++) { > result[row][feature] = random.nextDouble(); > } > } > return result; > } > static String getInfo(Predictor predictor, > List<Double> labels, > List<DenseVector> features) { > Long startTime = System.currentTimeMillis(); > List<Double> predictions = new ArrayList<>(); > for (DenseVector feature : features) { > predictions.add(predictor.predict(feature)); > } > return getInfo(startTime, labels, predictions); > } > static List<Double> getLabels(Dataset<Row> testData) { > List<Double> labels = new ArrayList<>(); > List<DenseVector> vectors = new ArrayList<>(); > for (Row row : testData.collectAsList()) { > vectors.add(new DenseVector(((org.apache.spark.ml.linalg.Vector) > row.get(1)).toArray())); > labels.add(row.getDouble(0)); > } > return labels; > } > static List<DenseVector> getFeatures(Dataset<Row> testData) { > List<DenseVector> features = new ArrayList<>(); > for (Row row : testData.collectAsList()) { > features.add(new DenseVector(((org.apache.spark.ml.linalg.Vector) > row.get(1)).toArray())); > } > return features; > } > static String getInfo(Transformer model, Dataset<Row> testData) { > Dataset<Row> predictions = model.transform(testData); > predictions.cache(); > Dataset<Row> correctPredictions = predictions.filter("label == > prediction"); > correctPredictions.cache(); > Dataset<Row> incorrectPredictions = predictions.filter("label != > prediction"); > incorrectPredictions.cache(); > Long truePositives = correctPredictions.filter("prediction == > 1.0").count(); > Long trueNegatives = correctPredictions.filter("prediction == > 0.0").count(); > Long falsePositives = incorrectPredictions.filter("prediction == > 1.0").count(); > Long falseNegatives = incorrectPredictions.filter("prediction == > 0.0").count(); > return getInfo(null, truePositives, trueNegatives, falsePositives, > falseNegatives); > } > static String getInfo(Long startTime, List<Double> labels, List<Double> > predictions) { > Long endTime = System.currentTimeMillis(); > if (labels.size() != predictions.size()) { > throw new RuntimeException("labels size is " + labels.size() + > " but predictions size is " + predictions.size()); > } > Long truePositives = 0L; > Long trueNegatives = 0L; > Long falsePositives = 0L; > Long falseNegatives = 0L; > for (int i = 0; i < labels.size(); i++) { > double label = labels.get(i); > double prediction = predictions.get(i); > if (label == prediction) { > if (prediction == 1.0) { > truePositives += 1; > } else { > trueNegatives += 1; > } > } else { > if (prediction == 1.0) { > falsePositives += 1; > } else { > falseNegatives += 1; > } > } > } > return getInfo(endTime - startTime, truePositives, trueNegatives, > falsePositives, falseNegatives); > } > static double ratio(Long numerator, Long denominator) { > if (numerator == 0 || denominator == 0) { > return 0; > } > return ((double) numerator) / denominator; > } > static String getInfo(Long timeTakenMilliseconds, Long truePositives, > Long trueNegatives, Long falsePositives, > Long falseNegatives) { > Long testDataCount = truePositives + trueNegatives + falsePositives + > falseNegatives; > double accuracy = ratio(truePositives + trueNegatives, testDataCount); > double precision = ratio(truePositives, truePositives + > falsePositives); > double recall = ratio(truePositives, truePositives + falseNegatives); > String last = ""; > if (timeTakenMilliseconds != null) { > last = ", Average time taken (ms) " + > ratio(timeTakenMilliseconds, testDataCount); > } > return ( > "true positives " + truePositives > + ", true negatives " + trueNegatives > + ", false positives " + falsePositives > + ", false negatives " + falseNegatives > + ", total " + testDataCount > + "\n\t accuracy " + accuracy > + ", precision " + precision > + ", recall " + recall > + last > ); > } > static DecisionTreeModel > convertDecisionTreeModel(org.apache.spark.ml.tree.DecisionTreeModel model, > Enumeration.Value algo) > { > return new DecisionTreeModel(model.rootNode().toOld(1), algo); > } > static RandomForestModel > convertRandomForestModel(org.apache.spark.ml.tree.TreeEnsembleModel model) { > Enumeration.Value algo; > if (model instanceof RandomForestRegressionModel) { > algo = Algo.Regression(); > } else { > algo = Algo.Classification(); > } > Object[] decisionTreeModels = model.trees(); > DecisionTreeModel[] convertedDecisionTreeModels = new > DecisionTreeModel[decisionTreeModels.length]; > for (int i = 0; i < decisionTreeModels.length; i++) { > org.apache.spark.ml.tree.DecisionTreeModel originalModel = > (org.apache.spark.ml.tree.DecisionTreeModel) decisionTreeModels[i]; > DecisionTreeModel convertedModel = > convertDecisionTreeModel(originalModel, algo); > convertedDecisionTreeModels[i] = convertedModel; > } > RandomForestModel result = new RandomForestModel(algo, > convertedDecisionTreeModels); > return result; > } > } > {noformat} > The output looks like the below. In the below the Original refers to ml > package version and New refers to mllib package version. > - I converted the mllib version Decision tree to ml version Decision tree. > Gave both versions same input and I received the exact same output. > - Then converted the mllib version Random Forest to ml version Random Forest > giving both the same underlying Decision trees (using the previoeus > conversion method). Gave both versions same input but I received different > output. > {noformat} > ****** DecisionTreeClassifier > ** Original **true positives 8128, true negatives 1923, false positives 7942, > false negatives 1897, total 19890 > accuracy 0.5053293112116641, precision 0.5057871810827629, recall > 0.8107730673316709 > ** New **true positives 8128, true negatives 1923, false positives 7942, > false negatives 1897, total 19890 > accuracy 0.5053293112116641, precision 0.5057871810827629, recall > 0.8107730673316709, Average time taken (ms) 0.001558572146807441 > ****** RandomForestClassifier > ** Original **true positives 3940, true negatives 5915, false positives 3950, > false negatives 6085, total 19890 > accuracy 0.49547511312217196, precision 0.49936628643852976, recall > 0.39301745635910224 > ** New **true positives 2461, true negatives 7350, false positives 2515, > false negatives 7564, total 19890 > accuracy 0.4932629462041227, precision 0.4945739549839228, recall > 0.2454862842892768, Average time taken (ms) 0.01085972850678733 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org