[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516861#comment-16516861 ]
Nick Pentreath commented on SPARK-24467: ---------------------------------------- One option is to do that same as we did for one hot encoder: we could create a new Estimator/Model pair, and deprecate the old one, for 2.4.0. Then for 3.0, we could remove the old one. > VectorAssemblerEstimator > ------------------------ > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML > Affects Versions: 2.4.0 > Reporter: Joseph K. Bradley > Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org