[ 
https://issues.apache.org/jira/browse/SPARK-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2272:
---------------------------------

    Assignee: DB Tsai

> Feature scaling which standardizes the range of independent variables or 
> features of data.
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-2272
>                 URL: https://issues.apache.org/jira/browse/SPARK-2272
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: DB Tsai
>            Assignee: DB Tsai
>
> Feature scaling is a method used to standardize the range of independent 
> variables or features of data. In data processing, it is also known as data 
> normalization and is generally performed during the data preprocessing step.
> In this work, a trait called `VectorTransformer` is defined for generic 
> transformation of a vector. It contains two methods, `apply` which applies 
> transformation on a vector and `unapply` which applies inverse transformation 
> on a vector.
> There are three concrete implementations of `VectorTransformer`, and they all 
> can be easily extended with PMML transformation support. 
> 1) `VectorStandardizer` - Standardises a vector given the mean and variance. 
> Since the standardization will densify the output, the output is always in 
> dense vector format.
>  
> 2) `VectorRescaler` -  Rescales a vector into target range specified by a 
> tuple of two double values or two vectors as new target minimum and maximum. 
> Since the rescaling will substrate the minimum of each column first, the 
> output will always be in dense vector regardless of input vector type.
> 3) `VectorDivider` -  Transforms a vector by dividing a constant or diving a 
> vector with element by element basis. This transformation will preserve the 
> type of input vector without densifying the result.
> Utility helper methods are implemented for taking an input of RDD[Vector], 
> and then transformed RDD[Vector] and transformer are returned for dividing, 
> rescaling, normalization, and standardization. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to