[ 
https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956832#comment-14956832
 ] 

Yanbo Liang commented on SPARK-10513:
-------------------------------------

[~mengxr]  A simple model training example has been posted at 
[here|https://github.com/yanboliang/Springleaf/blob/master/src/main/scala/com/ybliang/kaggle/Springleaf.scala].
 Although the code snippet looks sometimes naive because of I just want to only 
use the components which provided by Spark DataFrame and ML/MLlib, I will 
update the code snippet continuously after the issues found in this example has 
been resolved.

I found the Springleaf competition is more difficulty than San Francisco crime 
classification because of a great many number of column (#1934), little 
semantic knowledge of columns, missing and mistake data, but it more useful to 
test ML pipeline.

In general case, we should start with GBT model if we have little knowledge of 
the data. I intend to illustrate the usage of logistic regression model because 
of SPARK-10055 already has example of using Decision Tree model, so I try to 
train a LR model.

I list the issues and requirements that I have found during the model training 
and prediction process:

1, Although there is an option to infer schema by spark-csv, it can only 
identify DoubleType and StringType. For example, the timestamp column(which 
should be loaded as TimeStampType) will be identified as StringType by mistake 
and it will lead to OneHotEncoder produce massive number of encoded features.
2, Missing data and mistake data. I think DataFrame should provide method to 
replace null value or “” value by a user specific value just like 
“train[is.na(train)] <- -1” in R. It will be better if we can provide methods 
to remove illegal value. I see that DataFrame has “.na.drop()" method but it 
does not enough.
3, OneHotEncoder only can be fitted on the column with DoubleType currently. I 
think we can extends it to also support other NumericType such as IntType.
4, OneHotEncoder should consider a better way to handle “” value. 
5, Usually we have the requirements to use OneHotEncoder to encode multiple 
columns at the same time, but ML does not provide this ability AFAIK.
6, I found StringIndexer and OneHotEncoder often use simultaneously, should we 
provide the binding of these two feature transformers?

Looking forward your comments about these points I have found in my code 
practice. And I can submit some patches to fix these issues if they are in the 
scope of Spark roadmap. And after that I will update the code of this example.

> Springleaf Marketing Response
> -----------------------------
>
>                 Key: SPARK-10513
>                 URL: https://issues.apache.org/jira/browse/SPARK-10513
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Yanbo Liang
>            Assignee: Yanbo Liang
>
> Apply ML pipeline API to Springleaf Marketing Response 
> (https://www.kaggle.com/c/springleaf-marketing-response)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to