[ https://issues.apache.org/jira/browse/SPARK-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15424689#comment-15424689 ]
Jeff Levy commented on SPARK-11106: ----------------------------------- The objections to implementing RFormula in pySpark that Max raises strike me as trivial - not because the issues with assumptions aren't real, but because users of other data analysis tools _are already entirely familiar with working this way._ As was suggested, leave the option of doing each step explicitly in place for users who need it, but leaving it as the _only_ way to do what generally takes one or two clean lines in every other comparable environment (Python Statsmodels, R, SparkR, Stata, SAS) strikes me as a huge barrier to getting people to use pySpark. I was quite disappointed to see this wasn't already in the Spark 2.0 release, especially with the focus of ML on dataframes. > Should ML Models contains single models or Pipelines? > ----------------------------------------------------- > > Key: SPARK-11106 > URL: https://issues.apache.org/jira/browse/SPARK-11106 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Joseph K. Bradley > Priority: Critical > > This JIRA is for discussing whether an ML Estimators should do feature > processing. > h2. Issue > Currently, almost all ML Estimators require strict input types. E.g., > DecisionTreeClassifier requires that the label column be Double type and have > metadata indicating the number of classes. > This requires users to know how to preprocess data. > h2. Ideal workflow > A user should be able to pass any reasonable data to a Transformer or > Estimator and have it "do the right thing." > E.g.: > * If DecisionTreeClassifier is given a String column for labels, it should > know to index the Strings. > * See [SPARK-10513] for a similar issue with OneHotEncoder. > h2. Possible solutions > There are a few solutions I have thought of. Please comment with feedback or > alternative ideas! > h3. Leave as is > Pro: The current setup is good in that it forces the user to be very aware of > what they are doing. Feature transformations will not happen silently. > Con: The user has to write boilerplate code for transformations. The API is > not what some users would expect; e.g., coming from R, a user might expect > some automatic transformations. > h3. All Transformers can contain PipelineModels > We could allow all Transformers and Models to contain arbitrary > PipelineModels. E.g., if a DecisionTreeClassifier were given a String label > column, it might return a Model which contains a simple fitted PipelineModel > containing StringIndexer + DecisionTreeClassificationModel. > The API could present this to the user, or it could be hidden from the user. > Ideally, it would be hidden from the beginner user, but accessible for > experts. > The main problem is that we might have to break APIs. E.g., OneHotEncoder > may need to do indexing if given a String input column. This means it should > no longer be a Transformer; it should be an Estimator. > h3. All Estimators should use RFormula > The best option I have thought of is to make RFormula be the primary method > for automatic feature transformation. We could start adding an RFormula > Param to all Estimators, and it could handle most of these feature > transformation issues. > We could maintain old APIs: > * If a user sets the input column names, then those can be used in the > traditional (no automatic transformation) way. > * If a user sets the RFormula Param, then it can be used instead. (This > should probably take precedence over the old API.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org