[ 
https://issues.apache.org/jira/browse/SPARK-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6915:
-------------------------------------
    Description: 
This covers several improvements to VectorIndexer.  They could be handled 
separately or in 1 PR.

*Preserving metadata*

Currently, it preserves non-ML metadata.  This is different from StringIndexer. 
 We should change it so it does not maintain non-ML metadata.

Currently, it does not preserve ML-specific input metadata in the output 
column.  If a feature is already marked as categorical or continuous, we should 
preserve that metadata (rather than recomputing it).  We should also check that 
the input data is valid for that metadata.

*Allow unknown categories*

Add option for allowing unknown categories, probably via a parameter like 
"allowUnknownCategories."
If true, then handle unknown categories during transform by assigning them to 
an extra category index.

*Index particular features*

Add option for limiting indexing to particular features.
This could be specified by an option, or we could handle it via the "Preserve 
metadata" task above, where users would denote features as continuous in order 
to have VectorIndexer ignore them.

*Performance optimizations*

See the TODO items within VectorIndexer.scala


  was:
This covers several improvements to VectorIndexer.  They could be handled 
separately or in 1 PR.

*Preserve metadata*

Currently, it does not preserve ML-specific input metadata in the output 
column.  If a feature is already marked as categorical or continuous, we should 
preserve that metadata (rather than recomputing it).  We should also check that 
the input data is valid for that metadata.

*Allow unknown categories*

Add option for allowing unknown categories, probably via a parameter like 
"allowUnknownCategories."
If true, then handle unknown categories during transform by assigning them to 
an extra category index.

*Index particular features*

Add option for limiting indexing to particular features.
This could be specified by an option, or we could handle it via the "Preserve 
metadata" task above, where users would denote features as continuous in order 
to have VectorIndexer ignore them.

*Performance optimizations*

See the TODO items within VectorIndexer.scala



> VectorIndexer improvements
> --------------------------
>
>                 Key: SPARK-6915
>                 URL: https://issues.apache.org/jira/browse/SPARK-6915
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.4.0
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> This covers several improvements to VectorIndexer.  They could be handled 
> separately or in 1 PR.
> *Preserving metadata*
> Currently, it preserves non-ML metadata.  This is different from 
> StringIndexer.  We should change it so it does not maintain non-ML metadata.
> Currently, it does not preserve ML-specific input metadata in the output 
> column.  If a feature is already marked as categorical or continuous, we 
> should preserve that metadata (rather than recomputing it).  We should also 
> check that the input data is valid for that metadata.
> *Allow unknown categories*
> Add option for allowing unknown categories, probably via a parameter like 
> "allowUnknownCategories."
> If true, then handle unknown categories during transform by assigning them to 
> an extra category index.
> *Index particular features*
> Add option for limiting indexing to particular features.
> This could be specified by an option, or we could handle it via the "Preserve 
> metadata" task above, where users would denote features as continuous in 
> order to have VectorIndexer ignore them.
> *Performance optimizations*
> See the TODO items within VectorIndexer.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to