Re: Building SparkML vectors from long data

2018-07-03 Thread Patrick McCarthy
I'm still validating my results, but my solution for the moment looks like the below. I'm presently dealing with one-hot encoded values, so all the numbers in my array are 1: def udfMaker(feature_len): return F.udf(lambda x: SparseVector(feature_len, sorted(x), [1.0]*len(x)), VectorUDT())

Re: Building SparkML vectors from long data

2018-06-12 Thread Nathan Kronenfeld
I don't know if this is the best way or not, but: val indexer = new StringIndexer().setInputCol("vr").setOutputCol("vrIdx") val indexModel = indexer.fit(data) val indexedData = indexModel.transform(data) val variables = indexModel.labels.length val toSeq = udf((a: Double, b: Double) => Seq(a,

Building SparkML vectors from long data

2018-06-12 Thread Patrick McCarthy
I work with a lot of data in a long format, cases in which an ID column is repeated, followed by a variable and a value column like so: +---+-+---+ |ID | var | value | +---+-+---+ | A | v1 | 1.0 | | A | v2 | 2.0 | | B | v1 | 1.5 | | B | v3 | -1.0 | +---+-+---+