[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047905#comment-15047905
 ] 

Joseph K. Bradley commented on SPARK-8517:
--

I merged this PR, but am leaving the JIRA open since it has remaining subtasks.

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047342#comment-15047342
 ] 

Apache Spark commented on SPARK-8517:
-

User 'thunterdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/10207

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047270#comment-15047270
 ] 

Joseph K. Bradley commented on SPARK-8517:
--

+1 on copying content

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036718#comment-15036718
 ] 

Xiangrui Meng commented on SPARK-8517:
--

[~timhunter] I agree with most of your points. I'd recommend the following 
steps:

1. Reorganize the content of spark.ml guide based on the goals, but do not 
introduce new content if possible.
2. Create JIRAs for the rest tasks, which could be done in parallel, including:
  * spark.ml branding
  * move model selection / cross validation to tuning
  * enhance guide for individual algorithms, etc

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036710#comment-15036710
 ] 

Xiangrui Meng commented on SPARK-8517:
--

* I'm not sure whether the mathematical formulation is helpful or not. They 
might be useful to explain the parameters but it seems unnecessary for us to 
explain how the models work. I'm okay with copying the content.
* We don't have linear SVM in spark.ml.
* +1 on moving perceptron classifier to classification.

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036704#comment-15036704
 ] 

Xiangrui Meng commented on SPARK-8517:
--

* We should only mention MLlib specific types, like vectors and matrices. 
However, UDTs are not public and this doesn't seem to be a must to me.
* I think we can separate model selection from the basic concepts. But 
estimator/transformer/pipeline should get introduced together and the simple 
text classification pipeline is not very complicated to read.
* We didn't put a link because it is tricky to decide which branch/tag to use. 
The release process validates links on the user guide. So we dropped the link. 
See SPARK-11336 and its PR.
* As a workaround, I usually add a field called "id" to avoid "Tuple1.apply":

{code}
val data = Seq((0, -0.5), (1, -0.3), (2, 0.0), (3, 0.2))
val df = sqlContext.createDataFrame(data).toDF("id", "features")
{code}

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036690#comment-15036690
 ] 

Xiangrui Meng commented on SPARK-8517:
--

* Agree that the focus of spark.ml should not only be pipeline but also 
DataFrames.
* +1 on reorganizing the menu spark.ml menu based on the goal.
* Users should be able to use individual algorithms under spark.ml. There are 
some missing features, but this is not part of this JIRA.

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-11-25 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027512#comment-15027512
 ] 

Timothy Hunter commented on SPARK-8517:
---

 - A couple of pages such as {{ml-ensembles}} and {{ml-linear-methods}} refer 
to MLlib "for details". It is unclear what the differences are. I suggest we 
either copy or clearly mention that we mean to refer to the mathematical 
formulation only, that the parameters may have the same name (num iterations, 
regularization parameter) but that the API is different. By the way, I do not 
see SVM yet on the spark.ml documentation.
 - The perceptron classifier does not need to be a top-level section of 
spark.ml. I would move it as a subsection of classification

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-11-25 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027491#comment-15027491
 ] 

Timothy Hunter commented on SPARK-8517:
---

- We need to make a whole page about how best practices with dataframes 
containing numerical data (vector UDTs). That was a big pain point for me. We 
have a whole page on spark.mllib and we should have something similar for 
dataframes.
- in `ml-guide`, I would split the high-level concepts (`fit`, `transform`, 
etc.) from chaining them together with a pipeline. From reading the current 
document, sparkML seems harder to use than spark.mllib because it introduces 
complicated examples right at the start (model selection with 
cross-validation). 
- small nit: the links under each example should link to the github file, right 
now they are not super useful. Do we have a ticket for that?


Building examples:
The current way to build a dead-simple dataframe is as follows. It is rather 
noisy when you compare it to python. I would recommend we move all the example 
code generation to a library, and thoroughly explain there what the dataframe 
contain (or make it part of the graph). For example:
{code}
val data = Array(-0.5, -0.3, 0.0, 0.2)
val dataFrame = 
sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
{code}
This requires some understanding about tuple packing, the synthetic apply 
method, etc. Definitely more complicated than the python or RDD equivalent. I 
do not have a good solution right now, but I find this a bit unsettling when 
this is the first line I read in an example.

Other examples are easier to read, I find:
{code}
val training = sqlContext.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.2, 
-0.5.toDF("label", "features")
{code}

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-11-25 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027375#comment-15027375
 ] 

Timothy Hunter commented on SPARK-8517:
---

Here is a few comments I have at a high level:
 - branding confusion about spark.mllib vs spark.ml vs the union of the two. It 
is a bit hard right now when you navigate to the first page to see the 
difference
 - the focus of spark.ml is on pipelines. It should be on dataframes. It makes 
it clear to separate it from spark.mllib which is on RDDs
 - make pipelines a sub-concept of the spark.ml (instead of saying that 
spark.ml is pipeline). Say that you can build pipelines with spark.ml
 - make sure that all algorithms in spark.ml have the same level of usability 
as in mllib. You should not be force to make a pipeline to use a single 
algorithm
 - Reorganize the spark.ml menu about the goal and not about the content. Users 
want to solve issues (clustering, regression, classification), we organize by 
theoretical concepts (decision trees, ensembles, linear methods). We should do 
as mllib and sk-learn:
{code}
- MLlib: machine learning on RDDs
...
- SparkML: machine learning with (Spark) Dataframes
  - General concepts and overview
  - Building and transforming features
  - Classification and Regression
  - Clustering
  - Collaborative filtering
  - Chaining transforms with pipelines
  - Advanced: Evaluation, import/export, developer APIs
  - Examples
{code}
Some pieces are missing with this such as Dimensionality reduction. Also, the 
scikit-learn guide has a more academic focus by splitting roughly at supervised 
vs unsupervised.
I am going to drill down more into the sections for some suggestions.

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org