[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047905#comment-15047905 ] Joseph K. Bradley commented on SPARK-8517: -- I merged this PR, but am leaving the JIRA open since it has remaining subtasks. > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047342#comment-15047342 ] Apache Spark commented on SPARK-8517: - User 'thunterdb' has created a pull request for this issue: https://github.com/apache/spark/pull/10207 > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047270#comment-15047270 ] Joseph K. Bradley commented on SPARK-8517: -- +1 on copying content > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036718#comment-15036718 ] Xiangrui Meng commented on SPARK-8517: -- [~timhunter] I agree with most of your points. I'd recommend the following steps: 1. Reorganize the content of spark.ml guide based on the goals, but do not introduce new content if possible. 2. Create JIRAs for the rest tasks, which could be done in parallel, including: * spark.ml branding * move model selection / cross validation to tuning * enhance guide for individual algorithms, etc > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036710#comment-15036710 ] Xiangrui Meng commented on SPARK-8517: -- * I'm not sure whether the mathematical formulation is helpful or not. They might be useful to explain the parameters but it seems unnecessary for us to explain how the models work. I'm okay with copying the content. * We don't have linear SVM in spark.ml. * +1 on moving perceptron classifier to classification. > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036704#comment-15036704 ] Xiangrui Meng commented on SPARK-8517: -- * We should only mention MLlib specific types, like vectors and matrices. However, UDTs are not public and this doesn't seem to be a must to me. * I think we can separate model selection from the basic concepts. But estimator/transformer/pipeline should get introduced together and the simple text classification pipeline is not very complicated to read. * We didn't put a link because it is tricky to decide which branch/tag to use. The release process validates links on the user guide. So we dropped the link. See SPARK-11336 and its PR. * As a workaround, I usually add a field called "id" to avoid "Tuple1.apply": {code} val data = Seq((0, -0.5), (1, -0.3), (2, 0.0), (3, 0.2)) val df = sqlContext.createDataFrame(data).toDF("id", "features") {code} > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036690#comment-15036690 ] Xiangrui Meng commented on SPARK-8517: -- * Agree that the focus of spark.ml should not only be pipeline but also DataFrames. * +1 on reorganizing the menu spark.ml menu based on the goal. * Users should be able to use individual algorithms under spark.ml. There are some missing features, but this is not part of this JIRA. > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027512#comment-15027512 ] Timothy Hunter commented on SPARK-8517: --- - A couple of pages such as {{ml-ensembles}} and {{ml-linear-methods}} refer to MLlib "for details". It is unclear what the differences are. I suggest we either copy or clearly mention that we mean to refer to the mathematical formulation only, that the parameters may have the same name (num iterations, regularization parameter) but that the API is different. By the way, I do not see SVM yet on the spark.ml documentation. - The perceptron classifier does not need to be a top-level section of spark.ml. I would move it as a subsection of classification > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027491#comment-15027491 ] Timothy Hunter commented on SPARK-8517: --- - We need to make a whole page about how best practices with dataframes containing numerical data (vector UDTs). That was a big pain point for me. We have a whole page on spark.mllib and we should have something similar for dataframes. - in `ml-guide`, I would split the high-level concepts (`fit`, `transform`, etc.) from chaining them together with a pipeline. From reading the current document, sparkML seems harder to use than spark.mllib because it introduces complicated examples right at the start (model selection with cross-validation). - small nit: the links under each example should link to the github file, right now they are not super useful. Do we have a ticket for that? Building examples: The current way to build a dead-simple dataframe is as follows. It is rather noisy when you compare it to python. I would recommend we move all the example code generation to a library, and thoroughly explain there what the dataframe contain (or make it part of the graph). For example: {code} val data = Array(-0.5, -0.3, 0.0, 0.2) val dataFrame = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features") {code} This requires some understanding about tuple packing, the synthetic apply method, etc. Definitely more complicated than the python or RDD equivalent. I do not have a good solution right now, but I find this a bit unsettling when this is the first line I read in an example. Other examples are easier to read, I find: {code} val training = sqlContext.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.2, -0.5.toDF("label", "features") {code} > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027375#comment-15027375 ] Timothy Hunter commented on SPARK-8517: --- Here is a few comments I have at a high level: - branding confusion about spark.mllib vs spark.ml vs the union of the two. It is a bit hard right now when you navigate to the first page to see the difference - the focus of spark.ml is on pipelines. It should be on dataframes. It makes it clear to separate it from spark.mllib which is on RDDs - make pipelines a sub-concept of the spark.ml (instead of saying that spark.ml is pipeline). Say that you can build pipelines with spark.ml - make sure that all algorithms in spark.ml have the same level of usability as in mllib. You should not be force to make a pipeline to use a single algorithm - Reorganize the spark.ml menu about the goal and not about the content. Users want to solve issues (clustering, regression, classification), we organize by theoretical concepts (decision trees, ensembles, linear methods). We should do as mllib and sk-learn: {code} - MLlib: machine learning on RDDs ... - SparkML: machine learning with (Spark) Dataframes - General concepts and overview - Building and transforming features - Classification and Regression - Clustering - Collaborative filtering - Chaining transforms with pipelines - Advanced: Evaluation, import/export, developer APIs - Examples {code} Some pieces are missing with this such as Dimensionality reduction. Also, the scikit-learn guide has a more academic focus by splitting roughly at supervised vs unsupervised. I am going to drill down more into the sections for some suggestions. > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org