[ https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956832#comment-14956832 ]
Yanbo Liang commented on SPARK-10513: ------------------------------------- [~mengxr] A simple model training example has been posted at [here|https://github.com/yanboliang/Springleaf/blob/master/src/main/scala/com/ybliang/kaggle/Springleaf.scala]. Although the code snippet looks sometimes naive because of I just want to only use the components which provided by Spark DataFrame and ML/MLlib, I will update the code snippet continuously after the issues found in this example has been resolved. I found the Springleaf competition is more difficulty than San Francisco crime classification because of a great many number of column (#1934), little semantic knowledge of columns, missing and mistake data, but it more useful to test ML pipeline. In general case, we should start with GBT model if we have little knowledge of the data. I intend to illustrate the usage of logistic regression model because of SPARK-10055 already has example of using Decision Tree model, so I try to train a LR model. I list the issues and requirements that I have found during the model training and prediction process: 1, Although there is an option to infer schema by spark-csv, it can only identify DoubleType and StringType. For example, the timestamp column(which should be loaded as TimeStampType) will be identified as StringType by mistake and it will lead to OneHotEncoder produce massive number of encoded features. 2, Missing data and mistake data. I think DataFrame should provide method to replace null value or “” value by a user specific value just like “train[is.na(train)] <- -1” in R. It will be better if we can provide methods to remove illegal value. I see that DataFrame has “.na.drop()" method but it does not enough. 3, OneHotEncoder only can be fitted on the column with DoubleType currently. I think we can extends it to also support other NumericType such as IntType. 4, OneHotEncoder should consider a better way to handle “” value. 5, Usually we have the requirements to use OneHotEncoder to encode multiple columns at the same time, but ML does not provide this ability AFAIK. 6, I found StringIndexer and OneHotEncoder often use simultaneously, should we provide the binding of these two feature transformers? Looking forward your comments about these points I have found in my code practice. And I can submit some patches to fix these issues if they are in the scope of Spark roadmap. And after that I will update the code of this example. > Springleaf Marketing Response > ----------------------------- > > Key: SPARK-10513 > URL: https://issues.apache.org/jira/browse/SPARK-10513 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Yanbo Liang > Assignee: Yanbo Liang > > Apply ML pipeline API to Springleaf Marketing Response > (https://www.kaggle.com/c/springleaf-marketing-response) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org