This is sample variance, not population (i.e. divide by n-1, not n). I think that's justified as the data are notionally a sample from a population.
On Thu, Sep 29, 2022 at 9:21 PM 姜鑫 <jiangxin...@gmail.com> wrote: > Hi folks, > > Has anyone used VarianceThresholdSelector refer to > https://spark.apache.org/docs/latest/ml-features.html#variancethresholdselector > ? > In the doc, an example is gaven and says `The variance for the 6 features > are 16.67, 0.67, 8.17, 10.17, 5.07, and 11.47 respectively`, but after > calculating I found that the variance should be 13.89, 0.56, 6.81, 8.47, > 4.22, 9.56, and there should be only 3 columns selected. Is there something > wrong with me or this is a bug? > > > Regards, > Xin >