Shashi, You are correct that this can be a problem, especially with vectors that have a large number of elements that are zero, but not known to be such.
The definition as it stands is roughly an L^0 normalization. It is more common in clustering to use an L^1 or L^2 normalization. This would divide the terms by, respectively, the sum of the elements or the square root of the sum of the squares of the elements. Both L^1 and L^2 normalization avoids the problem you mention since negligibly small elements will not contribute significantly to the norm. Traditionally, L^2 norms are used with documents. This dates back to Salton and the term-vector model of text retrieval. That practice was, however, based on somewhat inappropriate geometric intuitions. Other norms are quite plausibly more appropriate. For instance, if normalized term frequencies are considered to be estimates of word generation probabilities, then the L^1 norm is much more appropriate. On Wed, May 27, 2009 at 11:52 PM, Shashikant Kore <[email protected]>wrote: > ... > My concern in the following code is that the total is divided by > numPoints. For a term, only few of the numPoints vectors have > contributed towards the weight. Rest had the value set to zero. That > drags down the average and it much more pronounced in a large set of > sparse vectors. > >
