how about using

val dataset = spark.read.format("libsvm")
  .option("numFeatures", "780")
  .load("data/mllib/sample_libsvm_data.txt")

instead of
val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")





On Mon, Apr 10, 2017 at 11:19 AM, Ryan <ryan.hd....@gmail.com> wrote:

> you could write a udf using the asML method along with some type casting,
> then apply the udf to data after pca.
>
> when using pipeline, that udf need to be wrapped in a customized
> transformer, I think.
>
> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Why not use the RandomForest from Spark ML?
>>
>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>> rezaul.ka...@insight-centre.org> wrote:
>>
>>> I have already posted this question to the StackOverflow
>>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>>> However, not getting any response from someone else. I'm trying to use
>>> RandomForest algorithm for the classification after applying the PCA
>>> technique since the dataset is pretty high-dimensional. Here's my source
>>> code:
>>>
>>> import org.apache.spark.mllib.util.MLUtils
>>> import org.apache.spark.mllib.tree.RandomForest
>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>> import org.apache.spark.mllib.regression.LabeledPoint
>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>>> import org.apache.spark.sql._
>>> import org.apache.spark.sql.SQLContext
>>> import org.apache.spark.sql.SparkSession
>>>
>>> import org.apache.spark.ml.feature.PCA
>>> import org.apache.spark.rdd.RDD
>>>
>>> object PCAExample {
>>>   def main(args: Array[String]): Unit = {
>>>     val spark = SparkSession
>>>       .builder
>>>       .master("local[*]")
>>>       .config("spark.sql.warehouse.dir", "E:/Exp/")
>>>       .appName(s"OneVsRestExample")
>>>       .getOrCreate()
>>>
>>>     val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, 
>>> "data/mnist.bz2")
>>>
>>>     val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>>     val (trainingData, testData) = (splits(0), splits(1))
>>>
>>>     val sqlContext = new SQLContext(spark.sparkContext)
>>>     import sqlContext.implicits._
>>>     val trainingDF = trainingData.toDF("label", "features")
>>>
>>>     val pca = new PCA()
>>>       .setInputCol("features")
>>>       .setOutputCol("pcaFeatures")
>>>       .setK(100)
>>>       .fit(trainingDF)
>>>
>>>     val pcaTrainingData = pca.transform(trainingDF)
>>>     //pcaTrainingData.show()
>>>
>>>     val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>>       row.getAs[Double]("label"),
>>>       row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>>
>>>     //val labeled = pca.transform(trainingDF).rdd.map(row => 
>>> LabeledPoint(row.getAs[Double]("label"),
>>>     //  
>>> Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>>
>>>     val numClasses = 10
>>>     val categoricalFeaturesInfo = Map[Int, Int]()
>>>     val numTrees = 10 // Use more in practice.
>>>     val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>>     val impurity = "gini"
>>>     val maxDepth = 20
>>>     val maxBins = 32
>>>
>>>     val model = RandomForest.trainClassifier(labeled, numClasses, 
>>> categoricalFeaturesInfo,
>>>       numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>>   }
>>> }
>>>
>>> However, I'm getting the following error:
>>>
>>> *Exception in thread "main" java.lang.IllegalArgumentException:
>>> requirement failed: Column features must be of type
>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>>
>>> What am I doing wrong in my code?  Actually, I'm getting the above
>>> exception in this line:
>>>
>>> val pca = new PCA()
>>>       .setInputCol("features")
>>>       .setOutputCol("pcaFeatures")
>>>       .setK(100)
>>>       .fit(trainingDF) /// GETTING EXCEPTION HERE
>>>
>>> Please, someone, help me to solve the problem.
>>>
>>>
>>>
>>>
>>>
>>> Kind regards,
>>> *Md. Rezaul Karim*
>>>
>>
>

Reply via email to