I'm still validating my results, but my solution for the moment looks like
the below. I'm presently dealing with one-hot encoded values, so all the
numbers in my array are 1:
def udfMaker(feature_len):
return F.udf(lambda x: SparseVector(feature_len, sorted(x),
[1.0]*len(x)), VectorUDT())
I don't know if this is the best way or not, but:
val indexer = new StringIndexer().setInputCol("vr").setOutputCol("vrIdx")
val indexModel = indexer.fit(data)
val indexedData = indexModel.transform(data)
val variables = indexModel.labels.length
val toSeq = udf((a: Double, b: Double) => Seq(a,
I work with a lot of data in a long format, cases in which an ID column is
repeated, followed by a variable and a value column like so:
+---+-+---+
|ID | var | value |
+---+-+---+
| A | v1 | 1.0 |
| A | v2 | 2.0 |
| B | v1 | 1.5 |
| B | v3 | -1.0 |
+---+-+---+