Thanks, Peter. It works! Why udf is needed?
On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi <pete.figlio...@gmail.com> wrote: > Hi Yan, I agree, it IS really confusing. Here is the technique for > transforming a column. It is very general because you can make "myConvert" > do whatever you want. > > import org.apache.spark.mllib.linalg.Vectors > val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF > > df.show() > // The columns were named "_1" and "_2" > // Very confusing, because it looks like a Scala wildcard when we refer to > it in code > > val myConvert = (x: String) => { Vectors.parse(x) } > val myConvertUDF = udf(myConvert) > > val newDf = df.withColumn("parsed", myConvertUDF(col("_2"))) > > newDf.show() > > On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > >> Hi, all. >> I find that it's really confuse. >> >> I can use Vectors.parse to create a DataFrame contains Vector type. >> >> scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1, >> Vectors.parse("[2,4,6]"))).toDF >> dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector] >> >> >> But using map to convert String to Vector throws an error: >> >> scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF >> dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string] >> >> scala> dataStr.map(row => Vectors.parse(row.getString(1))) >> <console>:30: error: Unable to find encoder for type stored in a >> Dataset. Primitive types (Int, String, etc) and Product types (case >> classes) are supported by importing spark.implicits._ Support for >> serializing other types will be added in future releases. >> dataStr.map(row => Vectors.parse(row.getString(1))) >> >> >> Dose anyone can help me, >> thanks very much! >> >> >> >> >> >> >> >> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi <pete.figlio...@gmail.com >> > wrote: >> >>> Hi Yan, I think you'll have to map the features column to a new >>> numerical features column. >>> >>> Here's one way to do the individual transform: >>> >>> scala> val x = "[1, 2, 3, 4, 5]" >>> x: String = [1, 2, 3, 4, 5] >>> >>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "") >>> split(" ") map(_.toInt) >>> y: Array[Int] = Array(1, 2, 3, 4, 5) >>> >>> If you don't know about the Scala command line, just type "scala" in a >>> terminal window. It's a good place to try things out. >>> >>> You can make a function out of this transformation and apply it to your >>> features column to make a new column. Then add this with >>> Dataset.withColumn. >>> >>> See here >>> <http://stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column> >>> on how to apply a function to a Column to make a new column. >>> >>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: >>> >>>> Hi, >>>> I have a csv file like: >>>> uid mid features label >>>> 123 5231 [0, 1, 3, ...] True >>>> >>>> Both "features" and "label" columns are used for GBTClassifier. >>>> >>>> However, when I read the file: >>>> Dataset<Row> samples = sparkSession.read().csv(file); >>>> The type of samples.select("features") is String. >>>> >>>> My question is: >>>> How to map samples.select("features") to Vector or any appropriate type, >>>> so I can use it to train like: >>>> GBTClassifier gbdt = new GBTClassifier() >>>> .setLabelCol("label") >>>> .setFeaturesCol("features") >>>> .setMaxIter(2) >>>> .setMaxDepth(7); >>>> >>>> Thanks. >>>> >>> >>> >> >