How about using countvectorizer?
http://spark.apache.org/docs/latest/ml-features.html#countvectorizer





On Tue, Apr 25, 2017 at 9:31 AM, Zeming Yu <zemin...@gmail.com> wrote:

> how do I do one hot encode on a column of array? e.g. ['TG', 'CA']
>
>
> FYI here's my code for one hot encoding normal categorical columns. How do I 
> make it work for a column of array?
>
>
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import StringIndexer
>
> indexers = [StringIndexer(inputCol=column, 
> outputCol=column+"_index").fit(flight3) for column in list(set['ColA', 
> 'ColB', 'ColC'])]
>
> pipeline = Pipeline(stages=indexers)
> flight4 = pipeline.fit(flight3).transform(flight3)
>
>
>
>

Reply via email to