We are working on the pipeline features, which would make this procedure much easier in MLlib. This is still a WIP and the main JIRA is at:
https://issues.apache.org/jira/browse/SPARK-1856 Best, Xiangrui On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani <chirag.lakh...@gmail.com> wrote: > Hello, > > I have been prototyping a text classification model that my company would > like to eventually put into production. Our technology stack is currently > Java based but we would like to be able to build our models in Spark/MLlib > and then export something like a PMML file which can be used for model > scoring in real-time. > > I have been using scikit learn where I am able to take the training data > convert the text data into a sparse data format and then take the other > features and use the dictionary vectorizer to do one-hot encoding for the > other categorical variables. All of those things seem to be possible in > mllib but I am still puzzled about how that can be packaged in such a way > that the incoming data can be first made into feature vectors and then > evaluated as well. > > Are there any best practices for this type of thing in Spark? I hope this > is clear but if there are any confusions then please let me know. > > Thanks, > > Chirag --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org