[ 
https://issues.apache.org/jira/browse/SPARK-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15424689#comment-15424689
 ] 

Jeff Levy commented on SPARK-11106:
-----------------------------------

The objections to implementing RFormula in pySpark that Max raises strike me as 
trivial - not because the issues with assumptions aren't real, but because 
users of other data analysis tools _are already entirely familiar with working 
this way._  As was suggested, leave the option of doing each step explicitly in 
place for users who need it, but leaving it as the _only_ way to do what 
generally takes one or two clean lines in every other comparable environment 
(Python Statsmodels, R, SparkR, Stata, SAS) strikes me as a huge barrier to 
getting people to use pySpark.  

I was quite disappointed to see this wasn't already in the Spark 2.0 release, 
especially with the focus of ML on dataframes.

> Should ML Models contains single models or Pipelines?
> -----------------------------------------------------
>
>                 Key: SPARK-11106
>                 URL: https://issues.apache.org/jira/browse/SPARK-11106
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Priority: Critical
>
> This JIRA is for discussing whether an ML Estimators should do feature 
> processing.
> h2. Issue
> Currently, almost all ML Estimators require strict input types.  E.g., 
> DecisionTreeClassifier requires that the label column be Double type and have 
> metadata indicating the number of classes.
> This requires users to know how to preprocess data.
> h2. Ideal workflow
> A user should be able to pass any reasonable data to a Transformer or 
> Estimator and have it "do the right thing."
> E.g.:
> * If DecisionTreeClassifier is given a String column for labels, it should 
> know to index the Strings.
> * See [SPARK-10513] for a similar issue with OneHotEncoder.
> h2. Possible solutions
> There are a few solutions I have thought of.  Please comment with feedback or 
> alternative ideas!
> h3. Leave as is
> Pro: The current setup is good in that it forces the user to be very aware of 
> what they are doing.  Feature transformations will not happen silently.
> Con: The user has to write boilerplate code for transformations.  The API is 
> not what some users would expect; e.g., coming from R, a user might expect 
> some automatic transformations.
> h3. All Transformers can contain PipelineModels
> We could allow all Transformers and Models to contain arbitrary 
> PipelineModels.  E.g., if a DecisionTreeClassifier were given a String label 
> column, it might return a Model which contains a simple fitted PipelineModel 
> containing StringIndexer + DecisionTreeClassificationModel.
> The API could present this to the user, or it could be hidden from the user.  
> Ideally, it would be hidden from the beginner user, but accessible for 
> experts.
> The main problem is that we might have to break APIs.  E.g., OneHotEncoder 
> may need to do indexing if given a String input column.  This means it should 
> no longer be a Transformer; it should be an Estimator.
> h3. All Estimators should use RFormula
> The best option I have thought of is to make RFormula be the primary method 
> for automatic feature transformation.  We could start adding an RFormula 
> Param to all Estimators, and it could handle most of these feature 
> transformation issues.
> We could maintain old APIs:
> * If a user sets the input column names, then those can be used in the 
> traditional (no automatic transformation) way.
> * If a user sets the RFormula Param, then it can be used instead.  (This 
> should probably take precedence over the old API.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to