Hi It seems that the output of MLlib's *StandardScaler*(*withMean=*True, *withStd*=True)are not as expected.
The above configuration is expected to do the following transformation: X -> Y = (X-Mean)/Std - Eq.1 This transformation (a.k.a. Standardization) should result in a "standardized" vector with unit-variance and zero-mean. I'll demonstrate my claim using the current documentation example: >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]>>> >>> dataset = sc.parallelize(vs)>>> standardizer = StandardScaler(True, >>> True)>>> model = standardizer.fit(dataset)>>> result = >>> model.transform(dataset)>>> for r in result.collect(): print r DenseVector([-0.7071, 0.7071, -0.7071]) DenseVector([0.7071, -0.7071, 0.7071]) This result in std = sqrt(1/2) foreach column instead of std=1. Applying Standardization transformation on the above 2 vectors result in the following output DenseVector([-1.0, 1.0, -1.0]) DenseVector([1.0, -1.0, 1.0]) Another example: Adding another DenseVector([2.4, 0.8, 3.5]) to the above we get a 3 rows of DenseVectors: [DenseVector([-2.0, 2.3, 0.0]), DenseVector([3.8, 0.0, 1.9]), DenseVector([2.4, 0.8, 3.5])] The StandardScaler result the following scaled vectors: [DenseVector([-1.12339, 1.084829, -1.02731]), DenseVector([0.792982, -0.88499, 0.057073]), DenseVector([0.330409, 4 -0.19984, 0.970241]) This result has std=sqrt(2/3) Instead it should have resulted other 3 vectors that form std=1 for each column. Adding another vector (4 total) results in 4 scaled vectors that form std= sqrt(3/4) instead of std=1 I hope all the examples help to make my point clear. I hope I don't miss here something. Thank you Gilad Barkan