[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275049#comment-16275049 ]
Joseph K. Bradley commented on SPARK-8418: ------------------------------------------ I just glanced through the various PRs adding multi-column support and wanted to get consensus about a few items to make sure we have consistent APIs. CC [~mlnick], [~yuhaoyan], [~yanboliang], [~WeichenXu123], [~huaxing], [~viirya] Let me know what you think! *1. When both inputCol and inputCols are specified, what should we do?* * [SPARK-20542]: Bucketizer: logWarning * [SPARK-13030]: OneHotEncoder: n/a (no single-column support) * [SPARK-11215]: StringIndexer: throw exception * [SPARK-22397]: QuantileDiscretizer: logWarning * my vote: throw exception (safer since it's easier for users to recognize their error) *2. Should we have single- and multi-column support or just multi-column? E.g., should we have (a) inputCol and inputCols or (b) only inputCols?* Currently, [SPARK-13030] only has multi-column support for the new OneHotEncoderEstimator. The other PRs have both single- and multi-column support since they are modifying existing APIs. *Q*: Should we add single-column to OneHotEncoderEstimator for consistency or not bother? I'm ambivalent. *3. Backwards compatibility for ML persistence* We'll have to be aware of whether we're breaking compatibility. I don't see problems in most PRs but have not tested it manually. The only PR with an issue is [SPARK-13030] for OneHotEncoder; however, that's pretty reasonable to break compatibility for persistence there. *4. Python APIs* I don't see follow-ups for Python APIs yet. Are those planned for 2.3? > Add single- and multi-value support to ML Transformers > ------------------------------------------------------ > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org