[jira] [Commented] (SPARK-3530) Pipeline and Parameters

Sean Owen (JIRA) Mon, 15 Sep 2014 03:13:52 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133760#comment-14133760
 ]


Sean Owen commented on SPARK-3530:
----------------------------------

A few high-level questions:

Is this a rewrite of MLlib? I see the old code will be deprecated. I assume the 
algorithms will come along, but in a fairly different form. I think that's 
actually a good thing. But is this targeted at a 2.x release, or sooner?

How does this relate to MLI and MLbase? I had thought they would in theory 
handle things like grid-search, but haven't seen activity or mention of these 
in a while. Is this at all a merge of the two or is MLlib going to take over 
these concerns?

I don't think you will need or want to use this code, but the oryx project 
already has an implementation of grid search on Spark. At least another take on 
the API for such a thing to consider. 
https://github.com/OryxProject/oryx/tree/master/oryx-ml/src/main/java/com/cloudera/oryx/ml/param

Big +1 for parameter tuning. That belongs as a first-class citizen. I'm also 
intrigued by doing better than trying every possible combination of parameters 
separately, and maybe sharing partial results to speed up several models' 
training. Is this realistic for any parameters besides things like # 
iterations? which isn't really a hyperparam. I don't know, for example, ways to 
build N models with N different overfitting params and share some work. I would 
love to know that's possible. Good to design for it anyway.

I see mention of a Dataset abstraction, which I'm assuming contains some type 
information, like distinguishing categorical and numeric features. I think 
that's very good!

I've always found the 'pipeline' part hard to build. It's tempting to construct 
a framework for feature extraction. To some degree you can by providing 
transformations, 1-hot encoding, etc. But I think that a framework for 
understanding arbitrary databases and fields and so on quickly becomes too 
endlessly large a scope. Spark Core to me is already the right abstraction for 
upstream ETL of data before entering an ML framework.  I mention it just 
because it's in the first picture, but I don't see discussion of actually doing 
user/product attribute selection later. So maybe it's not meant to be part of 
the proposal. 

I'd certainly like to keep up more with your work here. This is a big step 
forward in making MLlib more relevant to production deployments rather than 
just pure algorithms implementations.

> Pipeline and Parameters
> -----------------------
>
>                 Key: SPARK-3530
>                 URL: https://issues.apache.org/jira/browse/SPARK-3530
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> This part of the design doc is for pipelines and parameters. I put the design 
> doc at
> https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
> I will copy the proposed interfaces to this JIRA later. Some sample code can 
> be viewed at: https://github.com/mengxr/spark-ml/
> Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3530) Pipeline and Parameters

Reply via email to