My expectation is: root |-- tag: vector namely, I want to extract from: [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| to: Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7))
I believe it needs two step: 1. val tag2vec = {tag: Array[Structure] => Vector} 2. mblog_tags.withColumn("vec", tag2vec(col("tag")) But, I have no idea of how to describe the Array[Structure] in the DataFrame. On Fri, Oct 21, 2016 at 4:51 PM, lk_spark <lk_sp...@163.com> wrote: > how about change Schema from > root > |-- category.firstCategory: array (nullable = true) > | |-- element: struct (containsNull = true) > | | |-- category: string (nullable = true) > | | |-- weight: string (nullable = true) > to: > > root > |-- category: string (nullable = true) > |-- weight: string (nullable = true) > > 2016-10-21 > ------------------------------ > lk_spark > ------------------------------ > > *发件人:*颜发才(Yan Facai) <yaf...@gmail.com> > *发送时间:*2016-10-21 15:35 > *主题:*Re: How to iterate the element of an array in DataFrame? > *收件人:*"user.spark"<user@spark.apache.org> > *抄送:* > > I don't know how to construct `array<struct<category:string, > weight:string>>`. > Could anyone help me? > > I try to get the array by : > scala> mblog_tags.map(_.getSeq[(String, String)](0)) > > while the result is: > res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: > array<struct<_1:string,_2:string>>] > > > How to express `struct<string, string>` ? > > > > On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > >> Hi, I want to extract the attribute `weight` of an array, and combine >> them to construct a sparse vector. >> >> ### My data is like this: >> >> scala> mblog_tags.printSchema >> root >> |-- category.firstCategory: array (nullable = true) >> | |-- element: struct (containsNull = true) >> | | |-- category: string (nullable = true) >> | | |-- weight: string (nullable = true) >> >> >> scala> mblog_tags.show(false) >> +--------------------------------------------------------------+ >> |category.firstCategory | >> +--------------------------------------------------------------+ >> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| >> |[[tagCategory_029, 0.9]] | >> |[[tagCategory_029, 0.8]] | >> +--------------------------------------------------------------+ >> >> >> ### And expected: >> Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) >> Vectors.sparse(100, Array(29), Array(0.9)) >> Vectors.sparse(100, Array(29), Array(0.8)) >> >> How to iterate an array in DataFrame? >> Thanks. >> >> >> >> >