you could write a udf using the asML method along with some type casting,
then apply the udf to data after pca.

when using pipeline, that udf need to be wrapped in a customized
transformer, I think.

On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Why not use the RandomForest from Spark ML?
>
> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> I have already posted this question to the StackOverflow
>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>> However, not getting any response from someone else. I'm trying to use
>> RandomForest algorithm for the classification after applying the PCA
>> technique since the dataset is pretty high-dimensional. Here's my source
>> code:
>>
>> import org.apache.spark.mllib.util.MLUtils
>> import org.apache.spark.mllib.tree.RandomForest
>> import org.apache.spark.mllib.tree.model.RandomForestModel
>> import org.apache.spark.mllib.regression.LabeledPoint
>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>> import org.apache.spark.sql._
>> import org.apache.spark.sql.SQLContext
>> import org.apache.spark.sql.SparkSession
>>
>> import org.apache.spark.ml.feature.PCA
>> import org.apache.spark.rdd.RDD
>>
>> object PCAExample {
>>   def main(args: Array[String]): Unit = {
>>     val spark = SparkSession
>>       .builder
>>       .master("local[*]")
>>       .config("spark.sql.warehouse.dir", "E:/Exp/")
>>       .appName(s"OneVsRestExample")
>>       .getOrCreate()
>>
>>     val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, 
>> "data/mnist.bz2")
>>
>>     val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>     val (trainingData, testData) = (splits(0), splits(1))
>>
>>     val sqlContext = new SQLContext(spark.sparkContext)
>>     import sqlContext.implicits._
>>     val trainingDF = trainingData.toDF("label", "features")
>>
>>     val pca = new PCA()
>>       .setInputCol("features")
>>       .setOutputCol("pcaFeatures")
>>       .setK(100)
>>       .fit(trainingDF)
>>
>>     val pcaTrainingData = pca.transform(trainingDF)
>>     //pcaTrainingData.show()
>>
>>     val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>       row.getAs[Double]("label"),
>>       row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>
>>     //val labeled = pca.transform(trainingDF).rdd.map(row => 
>> LabeledPoint(row.getAs[Double]("label"),
>>     //  
>> Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>
>>     val numClasses = 10
>>     val categoricalFeaturesInfo = Map[Int, Int]()
>>     val numTrees = 10 // Use more in practice.
>>     val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>     val impurity = "gini"
>>     val maxDepth = 20
>>     val maxBins = 32
>>
>>     val model = RandomForest.trainClassifier(labeled, numClasses, 
>> categoricalFeaturesInfo,
>>       numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>   }
>> }
>>
>> However, I'm getting the following error:
>>
>> *Exception in thread "main" java.lang.IllegalArgumentException:
>> requirement failed: Column features must be of type
>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>
>> What am I doing wrong in my code?  Actually, I'm getting the above
>> exception in this line:
>>
>> val pca = new PCA()
>>       .setInputCol("features")
>>       .setOutputCol("pcaFeatures")
>>       .setK(100)
>>       .fit(trainingDF) /// GETTING EXCEPTION HERE
>>
>> Please, someone, help me to solve the problem.
>>
>>
>>
>>
>>
>> Kind regards,
>> *Md. Rezaul Karim*
>>
>

Reply via email to