Re: StringIndexer on several columns in a DataFrame with Scala

Nick Pentreath Mon, 30 Oct 2017 02:20:00 -0700

For now, you must follow this approach of constructing a pipeline
consisting of a StringIndexer for each categorical column. See
https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to
allow multiple columns for StringIndexer, which is being worked on
currently.


The reason you're seeing a NPE is:

var indexers: Array[StringIndexer] = null

and then you're trying to append an element to something that is null.

Try this instead:

var indexers: Array[StringIndexer] = Array()


But even better is a more functional approach:

val indexers = featureCol.map { colName =>

  new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")

}


On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> Hi All,
>
> There are several categorical columns in my dataset as follows:
> [image: grafik.png]
>
> How can I transform values in each (categorical) columns into numeric
> using StringIndexer so that the resulting DataFrame can be feed into
> VectorAssembler to generate a feature vector?
>
> A naive approach that I can try using StringIndexer for each categorical
> column. But that sounds hilarious, I know.
> A possible workaround
> <https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in
> PySpark is combining several StringIndexer on a list and use a Pipeline
> to execute them all as follows:
>
> from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
> indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) 
> for column in list(set(df.columns)-set(['date'])) ]
> pipeline = Pipeline(stages=indexers)
> df_r = pipeline.fit(df).transform(df)
> df_r.show()
>
> How I can do the same in Scala? I tried the following:
>
>     val featureCol = trainingDF.columns
>     var indexers: Array[StringIndexer] = null
>
>     for (colName <- featureCol) {
>       val index = new StringIndexer()
>         .setInputCol(colName)
>         .setOutputCol(colName + "_indexed")
>         //.fit(trainDF)
>       indexers = indexers :+ index
>     }
>
>      val pipeline = new Pipeline()
>                     .setStages(indexers)
>     val newDF = pipeline.fit(trainingDF).transform(trainingDF)
>     newDF.show()
>
> However, I am experiencing NullPointerException at
>
> for (colName <- featureCol)
>
> I am sure, I am doing something wrong. Any suggestion?
>
>
>
> Regards,
> _________________________________
> *Md. Rezaul Karim*, BSc, MSc
> Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> <http://139.59.184.114/index.html>
>

Re: StringIndexer on several columns in a DataFrame with Scala

Reply via email to