Re: one hot encode a column of vector

2017-04-24 Thread Yan Facai
How about using countvectorizer?
http://spark.apache.org/docs/latest/ml-features.html#countvectorizer





On Tue, Apr 25, 2017 at 9:31 AM, Zeming Yu  wrote:

> how do I do one hot encode on a column of array? e.g. ['TG', 'CA']
>
>
> FYI here's my code for one hot encoding normal categorical columns. How do I 
> make it work for a column of array?
>
>
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import StringIndexer
>
> indexers = [StringIndexer(inputCol=column, 
> outputCol=column+"_index").fit(flight3) for column in list(set['ColA', 
> 'ColB', 'ColC'])]
>
> pipeline = Pipeline(stages=indexers)
> flight4 = pipeline.fit(flight3).transform(flight3)
>
>
>
>


one hot encode a column of vector

2017-04-24 Thread Zeming Yu
how do I do one hot encode on a column of array? e.g. ['TG', 'CA']


FYI here's my code for one hot encoding normal categorical columns.
How do I make it work for a column of array?


from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer

indexers = [StringIndexer(inputCol=column,
outputCol=column+"_index").fit(flight3) for column in list(set['ColA',
'ColB', 'ColC'])]

pipeline = Pipeline(stages=indexers)
flight4 = pipeline.fit(flight3).transform(flight3)