For now, you must follow this approach of constructing a pipeline consisting of a StringIndexer for each categorical column. See https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to allow multiple columns for StringIndexer, which is being worked on currently.
The reason you're seeing a NPE is: var indexers: Array[StringIndexer] = null and then you're trying to append an element to something that is null. Try this instead: var indexers: Array[StringIndexer] = Array() But even better is a more functional approach: val indexers = featureCol.map { colName => new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed") } On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Hi All, > > There are several categorical columns in my dataset as follows: > [image: grafik.png] > > How can I transform values in each (categorical) columns into numeric > using StringIndexer so that the resulting DataFrame can be feed into > VectorAssembler to generate a feature vector? > > A naive approach that I can try using StringIndexer for each categorical > column. But that sounds hilarious, I know. > A possible workaround > <https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in > PySpark is combining several StringIndexer on a list and use a Pipeline > to execute them all as follows: > > from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer > indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) > for column in list(set(df.columns)-set(['date'])) ] > pipeline = Pipeline(stages=indexers) > df_r = pipeline.fit(df).transform(df) > df_r.show() > > How I can do the same in Scala? I tried the following: > > val featureCol = trainingDF.columns > var indexers: Array[StringIndexer] = null > > for (colName <- featureCol) { > val index = new StringIndexer() > .setInputCol(colName) > .setOutputCol(colName + "_indexed") > //.fit(trainDF) > indexers = indexers :+ index > } > > val pipeline = new Pipeline() > .setStages(indexers) > val newDF = pipeline.fit(trainingDF).transform(trainingDF) > newDF.show() > > However, I am experiencing NullPointerException at > > for (colName <- featureCol) > > I am sure, I am doing something wrong. Any suggestion? > > > > Regards, > _________________________________ > *Md. Rezaul Karim*, BSc, MSc > Researcher, INSIGHT Centre for Data Analytics > National University of Ireland, Galway > IDA Business Park, Dangan, Galway, Ireland > Web: http://www.reza-analytics.eu/index.html > <http://139.59.184.114/index.html> >