Re: How to convert Spark MLlib vector to ML Vector?

Yan Facai Sun, 09 Apr 2017 22:51:32 -0700

By the way, always try to use `ml`, instead of `mllib`.

import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.classification.RandomForestClassifier
or
import org.apache.spark.ml.regression.RandomForestRegressor



more details, see
http://spark.apache.org/docs/latest/ml-classification-regression.html.



On Mon, Apr 10, 2017 at 1:45 PM, 颜发才(Yan Facai) <facai....@gmail.com> wrote:

> how about using
>
> val dataset = spark.read.format("libsvm")
>   .option("numFeatures", "780")
>   .load("data/mllib/sample_libsvm_data.txt")
>
> instead of
> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
>
>
>
>
> On Mon, Apr 10, 2017 at 11:19 AM, Ryan <ryan.hd....@gmail.com> wrote:
>
>> you could write a udf using the asML method along with some type casting,
>> then apply the udf to data after pca.
>>
>> when using pipeline, that udf need to be wrapped in a customized
>> transformer, I think.
>>
>> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <nick.pentre...@gmail.com
>> > wrote:
>>
>>> Why not use the RandomForest from Spark ML?
>>>
>>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>>> rezaul.ka...@insight-centre.org> wrote:
>>>
>>>> I have already posted this question to the StackOverflow
>>>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>>>> However, not getting any response from someone else. I'm trying to use
>>>> RandomForest algorithm for the classification after applying the PCA
>>>> technique since the dataset is pretty high-dimensional. Here's my source
>>>> code:
>>>>
>>>> import org.apache.spark.mllib.util.MLUtils
>>>> import org.apache.spark.mllib.tree.RandomForest
>>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>>> import org.apache.spark.mllib.regression.LabeledPoint
>>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>>>> import org.apache.spark.sql._
>>>> import org.apache.spark.sql.SQLContext
>>>> import org.apache.spark.sql.SparkSession
>>>>
>>>> import org.apache.spark.ml.feature.PCA
>>>> import org.apache.spark.rdd.RDD
>>>>
>>>> object PCAExample {
>>>>   def main(args: Array[String]): Unit = {
>>>>     val spark = SparkSession
>>>>       .builder
>>>>       .master("local[*]")
>>>>       .config("spark.sql.warehouse.dir", "E:/Exp/")
>>>>       .appName(s"OneVsRestExample")
>>>>       .getOrCreate()
>>>>
>>>>     val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, 
>>>> "data/mnist.bz2")
>>>>
>>>>     val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>>>     val (trainingData, testData) = (splits(0), splits(1))
>>>>
>>>>     val sqlContext = new SQLContext(spark.sparkContext)
>>>>     import sqlContext.implicits._
>>>>     val trainingDF = trainingData.toDF("label", "features")
>>>>
>>>>     val pca = new PCA()
>>>>       .setInputCol("features")
>>>>       .setOutputCol("pcaFeatures")
>>>>       .setK(100)
>>>>       .fit(trainingDF)
>>>>
>>>>     val pcaTrainingData = pca.transform(trainingDF)
>>>>     //pcaTrainingData.show()
>>>>
>>>>     val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>>>       row.getAs[Double]("label"),
>>>>       row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>>>
>>>>     //val labeled = pca.transform(trainingDF).rdd.map(row => 
>>>> LabeledPoint(row.getAs[Double]("label"),
>>>>     //  
>>>> Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>>>
>>>>     val numClasses = 10
>>>>     val categoricalFeaturesInfo = Map[Int, Int]()
>>>>     val numTrees = 10 // Use more in practice.
>>>>     val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>>>     val impurity = "gini"
>>>>     val maxDepth = 20
>>>>     val maxBins = 32
>>>>
>>>>     val model = RandomForest.trainClassifier(labeled, numClasses, 
>>>> categoricalFeaturesInfo,
>>>>       numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>>>   }
>>>> }
>>>>
>>>> However, I'm getting the following error:
>>>>
>>>> *Exception in thread "main" java.lang.IllegalArgumentException:
>>>> requirement failed: Column features must be of type
>>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>>>
>>>> What am I doing wrong in my code?  Actually, I'm getting the above
>>>> exception in this line:
>>>>
>>>> val pca = new PCA()
>>>>       .setInputCol("features")
>>>>       .setOutputCol("pcaFeatures")
>>>>       .setK(100)
>>>>       .fit(trainingDF) /// GETTING EXCEPTION HERE
>>>>
>>>> Please, someone, help me to solve the problem.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>> *Md. Rezaul Karim*
>>>>
>>>
>>
>

Re: How to convert Spark MLlib vector to ML Vector?

Reply via email to