[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516861#comment-16516861
 ] 

Nick Pentreath commented on SPARK-24467:
----------------------------------------

One option is to do that same as we did for one hot encoder: we could create a 
new Estimator/Model pair, and deprecate the old one, for 2.4.0. Then for 3.0, 
we could remove the old one.

> VectorAssemblerEstimator
> ------------------------
>
>                 Key: SPARK-24467
>                 URL: https://issues.apache.org/jira/browse/SPARK-24467
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 2.4.0
>            Reporter: Joseph K. Bradley
>            Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to