[jira] [Commented] (SPARK-10055) San Francisco Crime Classification

2015-10-13 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956276#comment-14956276
 ] 

Xusen Yin commented on SPARK-10055:
---

Yes, I will find a new dataset soon and ping you on JIRA.

> San Francisco Crime Classification
> --
>
> Key: SPARK-10055
> URL: https://issues.apache.org/jira/browse/SPARK-10055
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>
> Apply ML pipeline API to San Francisco Crime Classification 
> (https://www.kaggle.com/c/sf-crime).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10055) San Francisco Crime Classification

2015-08-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720232#comment-14720232
 ] 

Xiangrui Meng commented on SPARK-10055:
---

[~yinxusen] Since [~kaisasak] already implemented it, I re-assigned the ticket 
to him. Shall we try another dataset?

 San Francisco Crime Classification
 --

 Key: SPARK-10055
 URL: https://issues.apache.org/jira/browse/SPARK-10055
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Kai Sasaki

 Apply ML pipeline API to San Francisco Crime Classification 
 (https://www.kaggle.com/c/sf-crime).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10055) San Francisco Crime Classification

2015-08-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720245#comment-14720245
 ] 

Xiangrui Meng commented on SPARK-10055:
---

Thanks for posting the feedback! For spark-csv, in the master branch there is 
an option to infer schema automatically. Do you mind checking `RFormula`? It 
handles the string indexers and one-hot encoding automatically.

 San Francisco Crime Classification
 --

 Key: SPARK-10055
 URL: https://issues.apache.org/jira/browse/SPARK-10055
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Kai Sasaki

 Apply ML pipeline API to San Francisco Crime Classification 
 (https://www.kaggle.com/c/sf-crime).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10055) San Francisco Crime Classification

2015-08-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711298#comment-14711298
 ] 

Kai Sasaki commented on SPARK-10055:


I submitted the initial version of this competition. Although the score is not 
good, there are several points I found in using Spark ML API. There might be 
something which is just caused by my lack of knowledge of Spark ML. So if we 
can already solve with existing code, please let me know.

* There does not seem to be {{Transformer}} which can cast type of columns. In 
this case, {{X}} and {{Y}} are String as default when read by 
[spark-csv|http://spark-packages.org/package/databricks/spark-csv].
  In order to use {{StandardScaler}} to {{X}} and {{Y}}, they must be numeric 
types. I cannot do that with Spark ML `Transformer`. Fortunately, {{spark-csv}}
  can infer types of schema to reading all data once. But in case of no such 
option in reading library, I think it is better to cast column types in Spark 
ML pipeline.
  
* {{StringIndexer}} exports its labels in order by frequencies. But in this 
competition, we have to write in alphabetical order. We have to write some 
extra code
  to convert frequency order labels to alphabetical order.
  
* {{StandardScaler}} can only receive vector data as its own input. In this 
case, I want to scale {{X}} and {{Y}} with {{StandardScaler}}. 
  But these are simple double data, it is necessary to assemble these values 
into feature vector. Is there some case to use `StandardScaler`
  to simple Int data or Double data? We have to assemble these data into a 
feature vector before scaling?
  
The code is 
[here|https://github.com/Lewuathe/kaggle-jobs/blob/master/src/main/scala/com/lewuathe/SfCrimeClassification.scala].
 Thank you.


 San Francisco Crime Classification
 --

 Key: SPARK-10055
 URL: https://issues.apache.org/jira/browse/SPARK-10055
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xusen Yin

 Apply ML pipeline API to San Francisco Crime Classification 
 (https://www.kaggle.com/c/sf-crime).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org