Hi
It seems that the output of MLlib's *StandardScaler*(*withMean=*True,
*withStd*=True)are not as expected.
The above configuration is expected to do the following transformation:
X -> Y = (X-Mean)/Std - Eq.1
This transformation (a.k.a. Standardization) should result in a
"standardized" vector with unit-variance and zero-mean.
I'll demonstrate my claim using the current documentation example:
>>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]>>>
>>> dataset = sc.parallelize(vs)>>> standardizer = StandardScaler(True,
>>> True)>>> model = standardizer.fit(dataset)>>> result =
>>> model.transform(dataset)>>> for r in result.collect(): print r
DenseVector([-0.7071, 0.7071, -0.7071]) DenseVector([0.7071,
-0.7071, 0.7071])
This result in std = sqrt(1/2) foreach column instead of std=1.
Applying Standardization transformation on the above 2 vectors result
in the following output
DenseVector([-1.0, 1.0, -1.0]) DenseVector([1.0, -1.0, 1.0])
Another example:
Adding another DenseVector([2.4, 0.8, 3.5]) to the above we get a 3
rows of DenseVectors:
[DenseVector([-2.0, 2.3, 0.0]), DenseVector([3.8, 0.0, 1.9]),
DenseVector([2.4, 0.8, 3.5])]
The StandardScaler result the following scaled vectors:
[DenseVector([-1.12339, 1.084829, -1.02731]), DenseVector([0.792982,
-0.88499, 0.057073]), DenseVector([0.330409, 4
-0.19984, 0.970241])
This result has std=sqrt(2/3)
Instead it should have resulted other 3 vectors that form std=1 for each column.
Adding another vector (4 total) results in 4 scaled vectors that form
std= sqrt(3/4) instead of std=1
I hope all the examples help to make my point clear.
I hope I don't miss here something.
Thank you
Gilad Barkan