I don't know if this is the best way or not, but:

val indexer = new StringIndexer().setInputCol("vr").setOutputCol("vrIdx")
val indexModel = indexer.fit(data)
val indexedData = indexModel.transform(data)
val variables = indexModel.labels.length

val toSeq = udf((a: Double, b: Double) => Seq(a, b))
val toVector = udf((seq: Seq[Seq[Double]]) => {
  new SparseVector(variables, seq.map(_(0).toInt).toArray,
seq.map(_(1)).toArray)
})
val result = indexedData
  .withColumn("val", toSeq(col("vrIdx"), col("value")))
  .groupBy("ID")
  .agg(collect_set(col("val")).name("collected_val"))
  .withColumn("collected_val",
toVector(col("collected_val")).as[Row](Encoders.javaSerialization(classOf[Row])))


at least works.  The indices still aren't in order in the vector - I don't
know if this matters much, but if it does, it's easy enough to sort them in
toVector (and to remove duplicates)


On Tue, Jun 12, 2018 at 2:24 PM, Patrick McCarthy <pmccar...@dstillery.com>
wrote:

> I work with a lot of data in a long format, cases in which an ID column is
> repeated, followed by a variable and a value column like so:
>
> +---+-----+-------+
> |ID | var | value |
> +---+-----+-------+
> | A | v1  | 1.0   |
> | A | v2  | 2.0   |
> | B | v1  | 1.5   |
> | B | v3  | -1.0  |
> +---+-----+-------+
>
> It seems to me that Spark doesn't provide any clear native way to
> transform data of this format into a Vector() or VectorUDT() type suitable
> for machine learning algorithms.
>
> The best solution I've found so far (which isn't very good) is to group by
> ID, perform a collect_list, and then use a UDF to translate the resulting
> array into a vector datatype.
>
> I can get kind of close like so:
>
> indexer = MF.StringIndexer(inputCol = 'var', outputCol = 'varIdx')
>
> (indexed_df
> .withColumn('val',F.concat(F.col('varIdx').astype(T.
> IntegerType()).astype(T.StringType()), F.lit(':'),F.col('value')))
> .groupBy('ID')
> .agg(F.collect_set('val'))
> )
>
> But the resultant 'val' vector is out of index order, and still would need
> to be parsed.
>
> What's the current preferred way to solve a problem like this?
>

Reply via email to