Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Yan Facai
By the way, always try to use `ml`, instead of `mllib`.

import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.classification.RandomForestClassifier
or
import org.apache.spark.ml.regression.RandomForestRegressor


more details, see
http://spark.apache.org/docs/latest/ml-classification-regression.html.



On Mon, Apr 10, 2017 at 1:45 PM, 颜发才(Yan Facai)  wrote:

> how about using
>
> val dataset = spark.read.format("libsvm")
>   .option("numFeatures", "780")
>   .load("data/mllib/sample_libsvm_data.txt")
>
> instead of
> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
>
>
>
>
> On Mon, Apr 10, 2017 at 11:19 AM, Ryan  wrote:
>
>> you could write a udf using the asML method along with some type casting,
>> then apply the udf to data after pca.
>>
>> when using pipeline, that udf need to be wrapped in a customized
>> transformer, I think.
>>
>> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath > > wrote:
>>
>>> Why not use the RandomForest from Spark ML?
>>>
>>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>>> rezaul.ka...@insight-centre.org> wrote:
>>>
 I have already posted this question to the StackOverflow
 .
 However, not getting any response from someone else. I'm trying to use
 RandomForest algorithm for the classification after applying the PCA
 technique since the dataset is pretty high-dimensional. Here's my source
 code:

 import org.apache.spark.mllib.util.MLUtils
 import org.apache.spark.mllib.tree.RandomForest
 import org.apache.spark.mllib.tree.model.RandomForestModel
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
 import org.apache.spark.sql._
 import org.apache.spark.sql.SQLContext
 import org.apache.spark.sql.SparkSession

 import org.apache.spark.ml.feature.PCA
 import org.apache.spark.rdd.RDD

 object PCAExample {
   def main(args: Array[String]): Unit = {
 val spark = SparkSession
   .builder
   .master("local[*]")
   .config("spark.sql.warehouse.dir", "E:/Exp/")
   .appName(s"OneVsRestExample")
   .getOrCreate()

 val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, 
 "data/mnist.bz2")

 val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
 val (trainingData, testData) = (splits(0), splits(1))

 val sqlContext = new SQLContext(spark.sparkContext)
 import sqlContext.implicits._
 val trainingDF = trainingData.toDF("label", "features")

 val pca = new PCA()
   .setInputCol("features")
   .setOutputCol("pcaFeatures")
   .setK(100)
   .fit(trainingDF)

 val pcaTrainingData = pca.transform(trainingDF)
 //pcaTrainingData.show()

 val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
   row.getAs[Double]("label"),
   row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))

 //val labeled = pca.transform(trainingDF).rdd.map(row => 
 LabeledPoint(row.getAs[Double]("label"),
 //  
 Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"

 val numClasses = 10
 val categoricalFeaturesInfo = Map[Int, Int]()
 val numTrees = 10 // Use more in practice.
 val featureSubsetStrategy = "auto" // Let the algorithm choose.
 val impurity = "gini"
 val maxDepth = 20
 val maxBins = 32

 val model = RandomForest.trainClassifier(labeled, numClasses, 
 categoricalFeaturesInfo,
   numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
   }
 }

 However, I'm getting the following error:

 *Exception in thread "main" java.lang.IllegalArgumentException:
 requirement failed: Column features must be of type
 org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
 org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*

 What am I doing wrong in my code?  Actually, I'm getting the above
 exception in this line:

 val pca = new PCA()
   .setInputCol("features")
   .setOutputCol("pcaFeatures")
   .setK(100)
   .fit(trainingDF) /// GETTING EXCEPTION HERE

 Please, someone, help me to solve the problem.





 Kind regards,
 *Md. Rezaul Karim*

>>>
>>
>


Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Yan Facai
how about using

val dataset = spark.read.format("libsvm")
  .option("numFeatures", "780")
  .load("data/mllib/sample_libsvm_data.txt")

instead of
val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")





On Mon, Apr 10, 2017 at 11:19 AM, Ryan  wrote:

> you could write a udf using the asML method along with some type casting,
> then apply the udf to data after pca.
>
> when using pipeline, that udf need to be wrapped in a customized
> transformer, I think.
>
> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath 
> wrote:
>
>> Why not use the RandomForest from Spark ML?
>>
>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>> rezaul.ka...@insight-centre.org> wrote:
>>
>>> I have already posted this question to the StackOverflow
>>> .
>>> However, not getting any response from someone else. I'm trying to use
>>> RandomForest algorithm for the classification after applying the PCA
>>> technique since the dataset is pretty high-dimensional. Here's my source
>>> code:
>>>
>>> import org.apache.spark.mllib.util.MLUtils
>>> import org.apache.spark.mllib.tree.RandomForest
>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>> import org.apache.spark.mllib.regression.LabeledPoint
>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>>> import org.apache.spark.sql._
>>> import org.apache.spark.sql.SQLContext
>>> import org.apache.spark.sql.SparkSession
>>>
>>> import org.apache.spark.ml.feature.PCA
>>> import org.apache.spark.rdd.RDD
>>>
>>> object PCAExample {
>>>   def main(args: Array[String]): Unit = {
>>> val spark = SparkSession
>>>   .builder
>>>   .master("local[*]")
>>>   .config("spark.sql.warehouse.dir", "E:/Exp/")
>>>   .appName(s"OneVsRestExample")
>>>   .getOrCreate()
>>>
>>> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, 
>>> "data/mnist.bz2")
>>>
>>> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>> val (trainingData, testData) = (splits(0), splits(1))
>>>
>>> val sqlContext = new SQLContext(spark.sparkContext)
>>> import sqlContext.implicits._
>>> val trainingDF = trainingData.toDF("label", "features")
>>>
>>> val pca = new PCA()
>>>   .setInputCol("features")
>>>   .setOutputCol("pcaFeatures")
>>>   .setK(100)
>>>   .fit(trainingDF)
>>>
>>> val pcaTrainingData = pca.transform(trainingDF)
>>> //pcaTrainingData.show()
>>>
>>> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>>   row.getAs[Double]("label"),
>>>   row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>>
>>> //val labeled = pca.transform(trainingDF).rdd.map(row => 
>>> LabeledPoint(row.getAs[Double]("label"),
>>> //  
>>> Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"
>>>
>>> val numClasses = 10
>>> val categoricalFeaturesInfo = Map[Int, Int]()
>>> val numTrees = 10 // Use more in practice.
>>> val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>> val impurity = "gini"
>>> val maxDepth = 20
>>> val maxBins = 32
>>>
>>> val model = RandomForest.trainClassifier(labeled, numClasses, 
>>> categoricalFeaturesInfo,
>>>   numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>>   }
>>> }
>>>
>>> However, I'm getting the following error:
>>>
>>> *Exception in thread "main" java.lang.IllegalArgumentException:
>>> requirement failed: Column features must be of type
>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>>
>>> What am I doing wrong in my code?  Actually, I'm getting the above
>>> exception in this line:
>>>
>>> val pca = new PCA()
>>>   .setInputCol("features")
>>>   .setOutputCol("pcaFeatures")
>>>   .setK(100)
>>>   .fit(trainingDF) /// GETTING EXCEPTION HERE
>>>
>>> Please, someone, help me to solve the problem.
>>>
>>>
>>>
>>>
>>>
>>> Kind regards,
>>> *Md. Rezaul Karim*
>>>
>>
>


Re: Does spark 2.1.0 structured streaming support jdbc sink?

2017-04-09 Thread lucas.g...@gmail.com
Interesting, does anyone know if we'll be seeing the JDBC sinks in upcoming
releases?

Thanks!

Gary Lucas

On 9 April 2017 at 13:52, Silvio Fiorito 
wrote:

> JDBC sink is not in 2.1. You can see here for an example implementation
> using the ForEachWriter sink instead: https://databricks.com/blog/20
> 17/04/04/real-time-end-to-end-integration-with-apache-kafka-
> in-apache-sparks-structured-streaming.html
>
>
>
>
>
> *From: *Hemanth Gudela 
> *Date: *Sunday, April 9, 2017 at 4:30 PM
> *To: *"user@spark.apache.org" 
> *Subject: *Does spark 2.1.0 structured streaming support jdbc sink?
>
>
>
> Hello Everyone,
>
> I am new to Spark, especially spark streaming.
>
>
>
> I am trying to read an input stream from Kafka, perform windowed
> aggregations in spark using structured streaming, and finally write
> aggregates to a sink.
>
> -  MySQL as an output sink doesn’t seem to be an option, because
> this block of code throws an error
>
> streamingDF.writeStream.format("jdbc").start("jdbc:mysql…”)
>
> *ava.lang.UnsupportedOperationException*: Data source jdbc does not
> support streamed writing
>
> This is strange because, this
> 
> document shows that jdbc is supported as an output sink!
>
>
>
> -  Parquet doesn’t seem to be an option, because it doesn’t
> support “complete” output mode, but “append” only. As I’m preforming
> windows aggregations in spark streaming, the output mode has to be
> complete, and cannot be “append”
>
>
>
> -  Memory and console sinks are good for debugging, but are not
> suitable for production jobs.
>
>
>
> So, please correct me if I’m missing something in my code to enable jdbc
> output sink.
>
> If jdbc output sink is not option, please suggest me an alternative output
> sink that suits my needs better.
>
>
>
> Or since structured streaming is still ‘alpha’, should I resort to spark
> dstreams to achieve my use case described above.
>
> Please suggest.
>
>
>
> Thanks in advance,
>
> Hemanth
>


pandas DF Dstream to Spark DF

2017-04-09 Thread Yogesh Vyas
Hi,

I am writing a pyspark streaming job in which i am returning a pandas data
frame as DStream. Now I wanted to save this DStream dataframe to parquet
file. How to do that?

I am trying to convert it to spark data frame but I am getting multiple
errors. Please suggest me how to do that.

Regards,
Yogesh


pandas DF DStream to Spark dataframe

2017-04-09 Thread Yogesh Vyas
Hi,

I am writing a pyspark streaming job in which i am returning a pandas data
frame as DStream. Now I wanted to save this DStream dataframe to parquet
file. How to do that?

I am trying to convert it to spark data frame but I am getting multiple
errors. Please suggest me how to do that.


Regards,
Yogesh


spark off heap memory

2017-04-09 Thread Georg Heiler
Hi,
I thought that with the integration of project Tungesten, spark would
automatically use off heap memory.

What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do
I manually need to specify the amount of off heap memory for Tungsten here?

Regards,
Georg


Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Ryan
you could write a udf using the asML method along with some type casting,
then apply the udf to data after pca.

when using pipeline, that udf need to be wrapped in a customized
transformer, I think.

On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath 
wrote:

> Why not use the RandomForest from Spark ML?
>
> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> I have already posted this question to the StackOverflow
>> .
>> However, not getting any response from someone else. I'm trying to use
>> RandomForest algorithm for the classification after applying the PCA
>> technique since the dataset is pretty high-dimensional. Here's my source
>> code:
>>
>> import org.apache.spark.mllib.util.MLUtils
>> import org.apache.spark.mllib.tree.RandomForest
>> import org.apache.spark.mllib.tree.model.RandomForestModel
>> import org.apache.spark.mllib.regression.LabeledPoint
>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>> import org.apache.spark.sql._
>> import org.apache.spark.sql.SQLContext
>> import org.apache.spark.sql.SparkSession
>>
>> import org.apache.spark.ml.feature.PCA
>> import org.apache.spark.rdd.RDD
>>
>> object PCAExample {
>>   def main(args: Array[String]): Unit = {
>> val spark = SparkSession
>>   .builder
>>   .master("local[*]")
>>   .config("spark.sql.warehouse.dir", "E:/Exp/")
>>   .appName(s"OneVsRestExample")
>>   .getOrCreate()
>>
>> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, 
>> "data/mnist.bz2")
>>
>> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>> val (trainingData, testData) = (splits(0), splits(1))
>>
>> val sqlContext = new SQLContext(spark.sparkContext)
>> import sqlContext.implicits._
>> val trainingDF = trainingData.toDF("label", "features")
>>
>> val pca = new PCA()
>>   .setInputCol("features")
>>   .setOutputCol("pcaFeatures")
>>   .setK(100)
>>   .fit(trainingDF)
>>
>> val pcaTrainingData = pca.transform(trainingDF)
>> //pcaTrainingData.show()
>>
>> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>   row.getAs[Double]("label"),
>>   row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>
>> //val labeled = pca.transform(trainingDF).rdd.map(row => 
>> LabeledPoint(row.getAs[Double]("label"),
>> //  
>> Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"
>>
>> val numClasses = 10
>> val categoricalFeaturesInfo = Map[Int, Int]()
>> val numTrees = 10 // Use more in practice.
>> val featureSubsetStrategy = "auto" // Let the algorithm choose.
>> val impurity = "gini"
>> val maxDepth = 20
>> val maxBins = 32
>>
>> val model = RandomForest.trainClassifier(labeled, numClasses, 
>> categoricalFeaturesInfo,
>>   numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>   }
>> }
>>
>> However, I'm getting the following error:
>>
>> *Exception in thread "main" java.lang.IllegalArgumentException:
>> requirement failed: Column features must be of type
>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>
>> What am I doing wrong in my code?  Actually, I'm getting the above
>> exception in this line:
>>
>> val pca = new PCA()
>>   .setInputCol("features")
>>   .setOutputCol("pcaFeatures")
>>   .setK(100)
>>   .fit(trainingDF) /// GETTING EXCEPTION HERE
>>
>> Please, someone, help me to solve the problem.
>>
>>
>>
>>
>>
>> Kind regards,
>> *Md. Rezaul Karim*
>>
>


Re: Does spark 2.1.0 structured streaming support jdbc sink?

2017-04-09 Thread Silvio Fiorito
JDBC sink is not in 2.1. You can see here for an example implementation using 
the ForEachWriter sink instead: 
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html


From: Hemanth Gudela 
Date: Sunday, April 9, 2017 at 4:30 PM
To: "user@spark.apache.org" 
Subject: Does spark 2.1.0 structured streaming support jdbc sink?

Hello Everyone,
I am new to Spark, especially spark streaming.

I am trying to read an input stream from Kafka, perform windowed aggregations 
in spark using structured streaming, and finally write aggregates to a sink.

-  MySQL as an output sink doesn’t seem to be an option, because this 
block of code throws an error

streamingDF.writeStream.format("jdbc").start("jdbc:mysql…”)

ava.lang.UnsupportedOperationException: Data source jdbc does not support 
streamed writing

This is strange because, 
this 
document shows that jdbc is supported as an output sink!



-  Parquet doesn’t seem to be an option, because it doesn’t support 
“complete” output mode, but “append” only. As I’m preforming windows 
aggregations in spark streaming, the output mode has to be complete, and cannot 
be “append”


-  Memory and console sinks are good for debugging, but are not 
suitable for production jobs.

So, please correct me if I’m missing something in my code to enable jdbc output 
sink.
If jdbc output sink is not option, please suggest me an alternative output sink 
that suits my needs better.

Or since structured streaming is still ‘alpha’, should I resort to spark 
dstreams to achieve my use case described above.
Please suggest.

Thanks in advance,
Hemanth


Spark 2.1 and Hive Metastore

2017-04-09 Thread Benjamin Kim
I’m curious about if and when Spark SQL will ever remove its dependency on Hive 
Metastore. Now that Spark 2.1’s SparkSession has superseded the need for 
HiveContext, are there plans for Spark to no longer use the Hive Metastore 
service with a “SparkSchema” service with a PostgreSQL, MySQL, etc. DB backend? 
Hive is growing long in the tooth, and it would be nice to retire it someday.

Cheers,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Does spark 2.1.0 structured streaming support jdbc sink?

2017-04-09 Thread Hemanth Gudela
Hello Everyone,
I am new to Spark, especially spark streaming.

I am trying to read an input stream from Kafka, perform windowed aggregations 
in spark using structured streaming, and finally write aggregates to a sink.

-  MySQL as an output sink doesn’t seem to be an option, because this 
block of code throws an error

streamingDF.writeStream.format("jdbc").start("jdbc:mysql…”)

ava.lang.UnsupportedOperationException: Data source jdbc does not support 
streamed writing

This is strange because, 
this 
document shows that jdbc is supported as an output sink!



-  Parquet doesn’t seem to be an option, because it doesn’t support 
“complete” output mode, but “append” only. As I’m preforming windows 
aggregations in spark streaming, the output mode has to be complete, and cannot 
be “append”


-  Memory and console sinks are good for debugging, but are not 
suitable for production jobs.

So, please correct me if I’m missing something in my code to enable jdbc output 
sink.
If jdbc output sink is not option, please suggest me an alternative output sink 
that suits my needs better.

Or since structured streaming is still ‘alpha’, should I resort to spark 
dstreams to achieve my use case described above.
Please suggest.

Thanks in advance,
Hemanth


Re: Why dataframe can be more efficient than dataset?

2017-04-09 Thread Koert Kuipers
in this case there is no difference in performance. both will do the
operation directly on the internal representation of the data (so the
InternalRow).

also it is worth pointing out that switching back and forth between
Dataset[X] and DataFrame is free.

On Sun, Apr 9, 2017 at 1:28 PM, Shiyuan  wrote:

> Thank you for the detailed explanation!  You point out two reasons why
> Dataset is not as efficeint as dataframe:
> 1). Spark cannot look into lambda and therefore cannot optimize.
> 2). The  type conversion  occurs under the hood, eg. from X to internal
> row.
>
> Just to check my understanding,  some method of Dataset can also take sql
> expression string  instead of lambda function, in this case, Is it  the
> type conversion still happens under the hood and therefore Dataset is still
> not as efficient as DataFrame.  Here is the code,
>
> //define a dataset and a dataframe, same content, but one is stored as
> Dataset, the other is Dataset
> scala> case class Person(name: String, age: Long)
> scala> val ds = Seq(Person("A",32), Person("B", 18)).toDS
> ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
> scala> val df = Seq(Person("A",32), Person("B", 18)).toDF
> df: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
>
> //Which filtering is more efficient? both use sql expression string.
> scala> df.filter("age < 20")
> res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name:
> string, age: bigint]
>
> scala> ds.filter("age < 20")
> res8: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
>
>
>
>
>
>
>
>
> On Sat, Apr 8, 2017 at 7:22 PM, Koert Kuipers  wrote:
>
>> how would you use only relational transformations on dataset?
>>
>> On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan  wrote:
>>
>>> Hi Spark-users,
>>> I came across a few sources which mentioned DataFrame can be more
>>> efficient than Dataset.  I can understand this is true because Dataset
>>> allows functional transformation which Catalyst cannot look into and hence
>>> cannot optimize well. But can DataFrame be more efficient than Dataset even
>>> if we only use the relational transformation on dataset? If so, can anyone
>>> give some explanation why  it is so? Any benchmark comparing dataset vs.
>>> dataframe?   Thank you!
>>>
>>> Shiyuan
>>>
>>
>>
>


Re: Why dataframe can be more efficient than dataset?

2017-04-09 Thread Shiyuan
Thank you for the detailed explanation!  You point out two reasons why
Dataset is not as efficeint as dataframe:
1). Spark cannot look into lambda and therefore cannot optimize.
2). The  type conversion  occurs under the hood, eg. from X to internal
row.

Just to check my understanding,  some method of Dataset can also take sql
expression string  instead of lambda function, in this case, Is it  the
type conversion still happens under the hood and therefore Dataset is still
not as efficient as DataFrame.  Here is the code,

//define a dataset and a dataframe, same content, but one is stored as
Dataset, the other is Dataset
scala> case class Person(name: String, age: Long)
scala> val ds = Seq(Person("A",32), Person("B", 18)).toDS
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
scala> val df = Seq(Person("A",32), Person("B", 18)).toDF
df: org.apache.spark.sql.DataFrame = [name: string, age: bigint]

//Which filtering is more efficient? both use sql expression string.
scala> df.filter("age < 20")
res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name:
string, age: bigint]

scala> ds.filter("age < 20")
res8: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]








On Sat, Apr 8, 2017 at 7:22 PM, Koert Kuipers  wrote:

> how would you use only relational transformations on dataset?
>
> On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan  wrote:
>
>> Hi Spark-users,
>> I came across a few sources which mentioned DataFrame can be more
>> efficient than Dataset.  I can understand this is true because Dataset
>> allows functional transformation which Catalyst cannot look into and hence
>> cannot optimize well. But can DataFrame be more efficient than Dataset even
>> if we only use the relational transformation on dataset? If so, can anyone
>> give some explanation why  it is so? Any benchmark comparing dataset vs.
>> dataframe?   Thank you!
>>
>> Shiyuan
>>
>
>


Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Nick Pentreath
Why not use the RandomForest from Spark ML?

On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> I have already posted this question to the StackOverflow
> .
> However, not getting any response from someone else. I'm trying to use
> RandomForest algorithm for the classification after applying the PCA
> technique since the dataset is pretty high-dimensional. Here's my source
> code:
>
> import org.apache.spark.mllib.util.MLUtils
> import org.apache.spark.mllib.tree.RandomForest
> import org.apache.spark.mllib.tree.model.RandomForestModel
> import org.apache.spark.mllib.regression.LabeledPoint
> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
> import org.apache.spark.sql._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.SparkSession
>
> import org.apache.spark.ml.feature.PCA
> import org.apache.spark.rdd.RDD
>
> object PCAExample {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder
>   .master("local[*]")
>   .config("spark.sql.warehouse.dir", "E:/Exp/")
>   .appName(s"OneVsRestExample")
>   .getOrCreate()
>
> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
> val (trainingData, testData) = (splits(0), splits(1))
>
> val sqlContext = new SQLContext(spark.sparkContext)
> import sqlContext.implicits._
> val trainingDF = trainingData.toDF("label", "features")
>
> val pca = new PCA()
>   .setInputCol("features")
>   .setOutputCol("pcaFeatures")
>   .setK(100)
>   .fit(trainingDF)
>
> val pcaTrainingData = pca.transform(trainingDF)
> //pcaTrainingData.show()
>
> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>   row.getAs[Double]("label"),
>   row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>
> //val labeled = pca.transform(trainingDF).rdd.map(row => 
> LabeledPoint(row.getAs[Double]("label"),
> //  
> Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"
>
> val numClasses = 10
> val categoricalFeaturesInfo = Map[Int, Int]()
> val numTrees = 10 // Use more in practice.
> val featureSubsetStrategy = "auto" // Let the algorithm choose.
> val impurity = "gini"
> val maxDepth = 20
> val maxBins = 32
>
> val model = RandomForest.trainClassifier(labeled, numClasses, 
> categoricalFeaturesInfo,
>   numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>   }
> }
>
> However, I'm getting the following error:
>
> *Exception in thread "main" java.lang.IllegalArgumentException:
> requirement failed: Column features must be of type
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>
> What am I doing wrong in my code?  Actually, I'm getting the above
> exception in this line:
>
> val pca = new PCA()
>   .setInputCol("features")
>   .setOutputCol("pcaFeatures")
>   .setK(100)
>   .fit(trainingDF) /// GETTING EXCEPTION HERE
>
> Please, someone, help me to solve the problem.
>
>
>
>
>
> Kind regards,
> *Md. Rezaul Karim*
>


How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Md. Rezaul Karim
I have already posted this question to the StackOverflow
.
However, not getting any response from someone else. I'm trying to use
RandomForest algorithm for the classification after applying the PCA
technique since the dataset is pretty high-dimensional. Here's my source
code:

import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SparkSession

import org.apache.spark.ml.feature.PCA
import org.apache.spark.rdd.RDD

object PCAExample {
  def main(args: Array[String]): Unit = {
val spark = SparkSession
  .builder
  .master("local[*]")
  .config("spark.sql.warehouse.dir", "E:/Exp/")
  .appName(s"OneVsRestExample")
  .getOrCreate()

val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")

val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
val (trainingData, testData) = (splits(0), splits(1))

val sqlContext = new SQLContext(spark.sparkContext)
import sqlContext.implicits._
val trainingDF = trainingData.toDF("label", "features")

val pca = new PCA()
  .setInputCol("features")
  .setOutputCol("pcaFeatures")
  .setK(100)
  .fit(trainingDF)

val pcaTrainingData = pca.transform(trainingDF)
//pcaTrainingData.show()

val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
  row.getAs[Double]("label"),
  row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))

//val labeled = pca.transform(trainingDF).rdd.map(row =>
LabeledPoint(row.getAs[Double]("label"),
//  
Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"

val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32

val model = RandomForest.trainClassifier(labeled, numClasses,
categoricalFeaturesInfo,
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
  }
}

However, I'm getting the following error:

*Exception in thread "main" java.lang.IllegalArgumentException: requirement
failed: Column features must be of type
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*

What am I doing wrong in my code?  Actually, I'm getting the above
exception in this line:

val pca = new PCA()
  .setInputCol("features")
  .setOutputCol("pcaFeatures")
  .setK(100)
  .fit(trainingDF) /// GETTING EXCEPTION HERE

Please, someone, help me to solve the problem.





Kind regards,
*Md. Rezaul Karim*