On Mon, Nov 5, 2012 at 6:03 PM, Ted Dunning <[email protected]> wrote:

> On Mon, Nov 5, 2012 at 4:44 AM, Dan Filimon <[email protected]
> >wrote:
>
> >
> > Ted told me that Mahout Centroids [1] are Weighted vectors that
> > additionally perform a Welford-style update of a vector.
> >
>
> I think that there may be an older Centroid definition that is different
> from this.
>


> So, in the code, for an existing Centroid c, with weight w_c, updating it
> > with a new Vector v whose weight is w_v, the result of an "update" is:
> >
> > (w_c * c[i] + w_v * v[i]) / (w_c + w_v), for all elements (i is the
> index)
> >
>
> Correct.
>
>
> > Since weights actually mean the number of elements in a certain cluster,
> > merging two clusters is exactly the operation described above.
> >
> > Why is this called a Welford update?
> >
>
> See here
>
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm


Ah, thank you! I've seen this in the Mahout clustering code.
By the older kind of Centroid, I think you're referring to AbstractCluster
from mahout/clustering/AbstractCluster.
I saw an issue on JIRA about refactoring that computation to use the
OnlineGaussianAccumulator class (I'd like to do that eventually :.


> > Also, why is the original Vector's function named assign?
>
>
> This is just one of several assign functions.  The name assign comes from
> the original name used in the Colt library and is intended to indicate that
> it is a destructive operation rather than a copying operation like times().
>
> It's really an implementation of the higher order function zipwith [2].
> >
>
> I don't know that zipwith is a more common name.  Haskell has historically
> had a very small community.
>

That's fair. Not to be a functional troll, but having a bit of background
makes it clear what the function does without having to read the code. :P
But I think documenting its behavior is the better approach.

Reply via email to