PySpark ML: Get best set of parameters from TrainValidationSplit

Aakash Basu Mon, 16 Apr 2018 07:53:09 -0700

Hi,

I am running a Random Forest model on a dataset using hyper parameter
tuning with Spark's paramGrid and Train Validation Split.


Can anyone tell me how to get the best set for all the four parameters?

I used:

model.bestModel()
model.metrics()


But none of them seem to work.


Below is the code chunk:

paramGrid = ParamGridBuilder() \
        .addGrid(rf.numTrees, [50, 100, 150, 200]) \
        .addGrid(rf.maxDepth, [5, 10, 15, 20]) \
        .addGrid(rf.minInfoGain, [0.001, 0.01, 0.1, 0.6]) \
        .addGrid(rf.minInstancesPerNode, [5, 15, 30, 50, 100]) \
        .build()

tvs = TrainValidationSplit(estimator=pipeline,
                           estimatorParamMaps=paramGrid,
                           evaluator=MulticlassClassificationEvaluator(),
                           # 80% of the data will be used for
training, 20% for validation.
                           trainRatio=0.8)

model = tvs.fit(trainingData)

predictions = model.transform(testData)

evaluator = MulticlassClassificationEvaluator(
        labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))


Any help?


Thanks,
Aakash.

PySpark ML: Get best set of parameters from TrainValidationSplit

Reply via email to