I work with a lot of data in a long format, cases in which an ID column is
repeated, followed by a variable and a value column like so:

+---+-----+-------+
|ID | var | value |
+---+-----+-------+
| A | v1  | 1.0   |
| A | v2  | 2.0   |
| B | v1  | 1.5   |
| B | v3  | -1.0  |
+---+-----+-------+

It seems to me that Spark doesn't provide any clear native way to transform
data of this format into a Vector() or VectorUDT() type suitable for
machine learning algorithms.

The best solution I've found so far (which isn't very good) is to group by
ID, perform a collect_list, and then use a UDF to translate the resulting
array into a vector datatype.

I can get kind of close like so:

indexer = MF.StringIndexer(inputCol = 'var', outputCol = 'varIdx')

(indexed_df
.withColumn('val',F.concat(F.col('varIdx').astype(T.IntegerType()).astype(T.StringType()),
F.lit(':'),F.col('value')))
.groupBy('ID')
.agg(F.collect_set('val'))
)

But the resultant 'val' vector is out of index order, and still would need
to be parsed.

What's the current preferred way to solve a problem like this?

Reply via email to