Hi there,

I've posted this question on StackOverflow as well but I got no answers,
maybe you guys can help me out.

I'm building a Random Forest model using Spark and I want to save it to use
again later. I'm running this on pyspark (Spark 2.0.1) without HDFS, so the
files are saved to the local file system.

I've tried to do it like so:

import pyspark.sql.types as T
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

data = [[0, 0, 0.],
        [0, 1, 1.],
        [1, 0, 1.],
        [1, 1, 0.]]

schema = T.StructType([
    T.StructField('a', T.IntegerType(), True),
    T.StructField('b', T.IntegerType(), True),
    T.StructField('label', T.DoubleType(), True)])

df = sqlContext.createDataFrame(data, schema)

assembler = VectorAssembler(inputCols=['a', 'b'], outputCol='features')
df = assembler.transform(df)

classifier = RandomForestClassifier(numTrees=10, maxDepth=15,
labelCol='label', featuresCol='features')
model = classifier.fit(df)

model.write().overwrite().save('saved_model')


And then, to load the model:

from pyspark.ml.classification import RandomForestClassificationModel

loaded_model = RandomForestClassificationModel.load('saved_model')


But I get this error:

Py4JJavaError: An error occurred while calling o108.load.
: java.lang.UnsupportedOperationException: empty collection

I'm not sure to which collection it is referring to. Any ideas how to
properly load (or save) the model?

Cheers,
--
Matheus Braun Magrin

Reply via email to