Joseph K. Bradley created SPARK-24467:
-----------------------------------------

             Summary: VectorAssemblerEstimator
                 Key: SPARK-24467
                 URL: https://issues.apache.org/jira/browse/SPARK-24467
             Project: Spark
          Issue Type: New Feature
          Components: ML
    Affects Versions: 2.4.0
            Reporter: Joseph K. Bradley


In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
`VectorSizeHint` instead of making `VectorAssembler` into an Estimator since I 
thought the latter option would break most workflows.  However, I should have 
proposed:
* Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
inputCols.  This Param can be optional.  If not given, then VectorAssembler 
will behave as it does now.  If given, then VectorAssembler can use that info 
instead of figuring out the Vector sizes via metadata or examining Rows in the 
data (though it could do consistency checks).
* Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
produces a VectorAssembler with the vector lengths Param specified.

This will not break existing workflows.  Migrating to VectorAssemblerEstimator 
will be easier than adding VectorSizeHint since it will not require users to 
manually input Vector lengths.

Note: Even with this Estimator, VectorSizeHint might prove useful for other 
things in the future which require vector length metadata, so we could consider 
keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to