I work with a lot of data in a long format, cases in which an ID column is repeated, followed by a variable and a value column like so:
+---+-----+-------+ |ID | var | value | +---+-----+-------+ | A | v1 | 1.0 | | A | v2 | 2.0 | | B | v1 | 1.5 | | B | v3 | -1.0 | +---+-----+-------+ It seems to me that Spark doesn't provide any clear native way to transform data of this format into a Vector() or VectorUDT() type suitable for machine learning algorithms. The best solution I've found so far (which isn't very good) is to group by ID, perform a collect_list, and then use a UDF to translate the resulting array into a vector datatype. I can get kind of close like so: indexer = MF.StringIndexer(inputCol = 'var', outputCol = 'varIdx') (indexed_df .withColumn('val',F.concat(F.col('varIdx').astype(T.IntegerType()).astype(T.StringType()), F.lit(':'),F.col('value'))) .groupBy('ID') .agg(F.collect_set('val')) ) But the resultant 'val' vector is out of index order, and still would need to be parsed. What's the current preferred way to solve a problem like this?