By the way, always try to use `ml`, instead of `mllib`. import org.apache.spark.ml.feature.LabeledPoint import org.apache.spark.ml.classification.RandomForestClassifier or import org.apache.spark.ml.regression.RandomForestRegressor
more details, see http://spark.apache.org/docs/latest/ml-classification-regression.html. On Mon, Apr 10, 2017 at 1:45 PM, 颜发才(Yan Facai) <facai....@gmail.com> wrote: > how about using > > val dataset = spark.read.format("libsvm") > .option("numFeatures", "780") > .load("data/mllib/sample_libsvm_data.txt") > > instead of > val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2") > > > > > > On Mon, Apr 10, 2017 at 11:19 AM, Ryan <ryan.hd....@gmail.com> wrote: > >> you could write a udf using the asML method along with some type casting, >> then apply the udf to data after pca. >> >> when using pipeline, that udf need to be wrapped in a customized >> transformer, I think. >> >> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <nick.pentre...@gmail.com >> > wrote: >> >>> Why not use the RandomForest from Spark ML? >>> >>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim < >>> rezaul.ka...@insight-centre.org> wrote: >>> >>>> I have already posted this question to the StackOverflow >>>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>. >>>> However, not getting any response from someone else. I'm trying to use >>>> RandomForest algorithm for the classification after applying the PCA >>>> technique since the dataset is pretty high-dimensional. Here's my source >>>> code: >>>> >>>> import org.apache.spark.mllib.util.MLUtils >>>> import org.apache.spark.mllib.tree.RandomForest >>>> import org.apache.spark.mllib.tree.model.RandomForestModel >>>> import org.apache.spark.mllib.regression.LabeledPoint >>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT} >>>> import org.apache.spark.sql._ >>>> import org.apache.spark.sql.SQLContext >>>> import org.apache.spark.sql.SparkSession >>>> >>>> import org.apache.spark.ml.feature.PCA >>>> import org.apache.spark.rdd.RDD >>>> >>>> object PCAExample { >>>> def main(args: Array[String]): Unit = { >>>> val spark = SparkSession >>>> .builder >>>> .master("local[*]") >>>> .config("spark.sql.warehouse.dir", "E:/Exp/") >>>> .appName(s"OneVsRestExample") >>>> .getOrCreate() >>>> >>>> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, >>>> "data/mnist.bz2") >>>> >>>> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L) >>>> val (trainingData, testData) = (splits(0), splits(1)) >>>> >>>> val sqlContext = new SQLContext(spark.sparkContext) >>>> import sqlContext.implicits._ >>>> val trainingDF = trainingData.toDF("label", "features") >>>> >>>> val pca = new PCA() >>>> .setInputCol("features") >>>> .setOutputCol("pcaFeatures") >>>> .setK(100) >>>> .fit(trainingDF) >>>> >>>> val pcaTrainingData = pca.transform(trainingDF) >>>> //pcaTrainingData.show() >>>> >>>> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint( >>>> row.getAs[Double]("label"), >>>> row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures"))) >>>> >>>> //val labeled = pca.transform(trainingDF).rdd.map(row => >>>> LabeledPoint(row.getAs[Double]("label"), >>>> // >>>> Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features")))) >>>> >>>> val numClasses = 10 >>>> val categoricalFeaturesInfo = Map[Int, Int]() >>>> val numTrees = 10 // Use more in practice. >>>> val featureSubsetStrategy = "auto" // Let the algorithm choose. >>>> val impurity = "gini" >>>> val maxDepth = 20 >>>> val maxBins = 32 >>>> >>>> val model = RandomForest.trainClassifier(labeled, numClasses, >>>> categoricalFeaturesInfo, >>>> numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) >>>> } >>>> } >>>> >>>> However, I'm getting the following error: >>>> >>>> *Exception in thread "main" java.lang.IllegalArgumentException: >>>> requirement failed: Column features must be of type >>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually >>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.* >>>> >>>> What am I doing wrong in my code? Actually, I'm getting the above >>>> exception in this line: >>>> >>>> val pca = new PCA() >>>> .setInputCol("features") >>>> .setOutputCol("pcaFeatures") >>>> .setK(100) >>>> .fit(trainingDF) /// GETTING EXCEPTION HERE >>>> >>>> Please, someone, help me to solve the problem. >>>> >>>> >>>> >>>> >>>> >>>> Kind regards, >>>> *Md. Rezaul Karim* >>>> >>> >> >