An alternative approach would be to translate your categorical variables into dummy variables. If your strings represent N classes/categories you would generate N-1 dummy variables containing 0/1 values.
Auto-magically creating dummy variables from categorical data definitely comes in handy. I assume this is what SPARK-1216 is referring to, but I am not sure from the description. https://issues.apache.org/jira/browse/SPARK-1216 Auto-magically doing the scheme that Sean mentioned is referenced in SPARK-4081, I believe. https://issues.apache.org/jira/browse/SPARK-4081 On Fri, Jan 16, 2015 at 4:45 PM, Sean Owen <so...@cloudera.com> wrote: > The implementation accepts an RDD of LabeledPoint only, so you > couldn't feed in strings from a text file directly. LabeledPoint is a > wrapper around double values rather than strings. How were you trying > to create the input then? > > No, it only accepts numeric values, although you can encode > categorical values as 0, 1, 2 ... and tell the implementation about > your categorical features to use categorical features. > > On Fri, Jan 16, 2015 at 9:25 PM, Asaf Lahav <asaf.la...@gmail.com> wrote: > > Hi, > > > > I have been playing around with the new version of Spark MLlib Random > forest > > implementation, and while in the process, tried it with a file with > String > > Features. > > While training, it fails with: > > java.lang.NumberFormatException: For input string. > > > > > > Is MBLib Random forest adapted to run on top of numeric data only? > > > > Thanks > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Nick Allen <n...@nickallen.org>