How about using countvectorizer? http://spark.apache.org/docs/latest/ml-features.html#countvectorizer
On Tue, Apr 25, 2017 at 9:31 AM, Zeming Yu <zemin...@gmail.com> wrote: > how do I do one hot encode on a column of array? e.g. ['TG', 'CA'] > > > FYI here's my code for one hot encoding normal categorical columns. How do I > make it work for a column of array? > > > from pyspark.ml import Pipeline > from pyspark.ml.feature import StringIndexer > > indexers = [StringIndexer(inputCol=column, > outputCol=column+"_index").fit(flight3) for column in list(set['ColA', > 'ColB', 'ColC'])] > > pipeline = Pipeline(stages=indexers) > flight4 = pipeline.fit(flight3).transform(flight3) > > > >