Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski Thu, 05 Mar 2020 02:45:26 -0800

Hello.

2020-03-05 4:50 UTC+01:00, chentao...@qq.com <chentao...@qq.com>:
> Hi,
>
>>Hi.
>>
>>Le ven. 28 févr. 2020 à 05:04, chentao...@qq.com <chentao...@qq.com> a
>> écrit :
>>>
>>>
>>>
>>>
>>> --------------
>>> chentao...@qq.com
>>> >Hi.
>>> >
>>> >Le jeu. 27 févr. 2020 à 06:17, chentao...@qq.com <chentao...@qq.com> a
>>> > écrit :
>>> >>
>>> >> Hi,
>>> >>
>>> >> > [...]
>>> >> >> >>
>>> >> >> >> Do you mean I should fire a JIRA issue about
>>> >> >> >> reuse&nbsp;"centroidOf" and "chooseInitialCenters",
>>> >> >> >> then start a PR and a disscuss about "ClusterUtils"?
>>> >> >> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after
>>> >> >> >> all done?
>>> >> >> >
>>> >> >> >I cannot guarantee that the whole process will be streamlined.
>>> >> >> >In effect, you can work on multiple branches (one for each
>>> >> >> >prospective PR).
>>> >> >> >I'd say that you should start by describing (here on the ML) the
>>> >> >> >rationale for "ClusterUtils" (and contrast it with say, a common
>>> >> >> >base class).
>>> >> >> >[Only when the design has been agreed on,  a JIRA issue to
>>> >> >> >implement it should be created in order to track the actual
>>> >> >> >coding work).]
>>> >> >>
>>> >> >> OK, I think we should start from here:
>>> >> >>
>>> >> >> The method "centroidOf"  and "chooseInitialCenters" in
>>> >> >> KMeansPlusPlusClusterer
>>> >> >>  could be reused by other KMeans Clusterer like
>>> >> >> MiniBatchKMeansClusterer which I want to implement.
>>> >> >>
>>> >> >> There are two solution for reuse "centroidOf"  and
>>> >> >> "chooseInitialCenters":
>>> >> >> 1. Extract a abstract class for KMeans Clusterer named
>>> >> >> "AbstractKMeansClusterer",
>>> >> >>  and move "centroidOf"  and "chooseInitialCenters" as protected
>>> >> >> methods in it;
>>> >> >>  the EmptyClusterStrategy and related logic can also move to the
>>> >> >> "AbstractKMeansClusterer".
>>> >> >> 2. Create a static utility class, and move "centroidOf"  and
>>> >> >> "chooseInitialCenters" in it,
>>> >> >>  and some useful clustering method like predict(Predict which
>>> >> >> cluster is best for a specified point) can put in it.
>>> >> >>
>>> >> >
>>> >> >At first sight, I prefer option 1.
>>> >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no
>>> >> > interest to
>>> >> >users of the functionality (and so should not be part of the "public"
>>> >> > API).
>>> >>
>>> >> Persuasive explain, and I agree with you, that extract a abstract
>>> >> class for KMeans is better.
>>> >> And how can we make a conclusion?
>>> >> ---------------------------------------------
>>> >>
>>> >> Mention the "public API", I suppose there should be a series of
>>> >> "CentroidInitializer",
>>> >>  that "chooseInitialCenters" with various of algorithms.
>>> >> The k-means++ cluster algorithm is a special implementation of k-means
>>> >>  which initialize cluster centers with k-means++ algorithm.
>>> >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can
>>> >> be "KMeansClusterer"
>>> >>  with a "KMeansPlusPlusCentroidInitializer" strategy.
>>> >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer",
>>> >> it is a common k-means.
>>> >>
>>> >> ----------------------------------------------------------
>>> >> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
>>> >> >the "Cluster"
>>> >> >interface?  What is the difference with method "getCenter" (define by
>>> >> > class
>>> >> >"CentroidCluster")?
>>> >>
>>> >> My understanding is,:
>>> >>  * "Cluster" is a data class that carry the result of a clustering,
>>> >> "getCenter" is just a get method of CentroidCluster for get the value
>>> >> of a center point.
>>> >>  * "Cluster[er]" is a (Interface of )algorithm that classify points to
>>> >> sets of Cluster.
>>> >>  * "CentroidCluster" is the result of a group of special Clusterer
>>> >> algorithm like k-means,
>>> >>  "centroidOf" is a specific logic to calculate the center point for a
>>> >> collection of points.
>>> >> [Instead the DBScan cluster algorithm dose not care about the
>>> >> "Centroid"]
>>> >>
>>> >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists
>>> >> yet),
>>> >>  but different with "CentroidCluster.getCenter".
>>> >
>>> >I may be missing something about the existing design,
>>> >but it seems strange that "CentroidCluster" is initialized
>>> >with a given "center", yet it is possible to add points after
>>> >initialization (which IIUC would invalidate the "center").
>>>
>>> The "centroidOf" could be part of "CentroidCluster",
>>> but I think the existsing desgin was focus on decouple of
>>> "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".
>>
>>I don't see why we need both "Cluster" and "CentroidCluster".
>>Indeed, as suggested before, the "center" can be computed
>>from a "Cluster", but does not need to be stored in it.
>
> Typical usecase for a List<CentroidCluster> is, when we need classify a new
> point,


Is there a use for "Cluster" instances that do not have a "center"?

> calculate the distance of the new point to each CentroidCluster.center is
> the simplest,
> and the center should be cached.

Yes, but not necessarily within a dedicated class.
I agree that there should be an easy way (API-wise) to classify
a point wrt a list of clusters but we should avoid the potential
inconsistency of a mutable "Cluster" instance.
Efficiency during cluster building (a.o. caching the current center)
is a separate issue from using the resulting list of clusters.

>
>>
>>>
>>> Center recalculate often happens in each iteration of k-means Clustering,
>>> always with points reassign to clusters.
>>> We often use k-means as two pharse:
>>> Pharse 1: Training, classify thousands of points to set of clusters.
>>> Pharse 2: Predict, predict which cluster is best for a new point,
>>> or add a new point to the best cluster in ClusterSet,
>>
>>Method "cluster" returns a "List<Cluster>"; there is no need for a
>>new "ClusterSet" class.
>>Also, IIUC the centers can be collected into a "List<Clusterable>",
>>so that the association is through the index into the list(s).
>>
>>> but we never update the cluster center until next retraining.
>>
>>IMO, that's the reason for *not*" storing the center (in such a
>>mutable instance).
>>
>>>
>>> The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math"
>>> just design for pharse "Training",
>>> it is clearly if we can consider "CentroidCluster" as a pure data class
>>> just for k-means clustering result.
>>
>>See above.
>>
>>Discussing the existing design further, I think that the "cluster" method
>> should
>>rather be:
>>---CUT---
>>public List<Cluster<T>> cluster(Collections<T> points, DistanceMeasure
>> dist)
>>---CUT---
>>
>>And, similarly,
>>---CUT---
>>@FunctionalInterface
>>public interface ClusterFinder<T extends Clusterable> {
>>    public int indexOf(T point, List<Cluster<T> clusters, DistanceMeasure
>> dist);
>>}
>>---CUT---
>
> I think there is a balance between flexibility, efficiency and usability.
> The DistanceMeasure should be assigned once for a kmeans,

Of course, during learning, only one distance measure is to
be used.
But it does not imply that it must be an instance field of the
clustering algorithm.

> and can not change
> in exists clusters.

The "Cluster" class does not define a "DistanceMeasure" field.

> So DistanceMeasure should be a property of Cluster, but redundant for a List
> of Cluster.

Above, you said that it must be a property of "kmeans"...

> Consider the Cluster fewly use separately, the DistanceMeasure should be a
> readonly property of List<Cluster>,

Why?
shouldn't a user be able to query the same list of clusters
using different distance measures?

> If there is a ClusterSet, so we can serialize and unserialize the Clusters
> correctly and with no redundancy.

I don't know about "ClusterSet"; what is its interface?

Anyways, "serialization" is yet another issue, and it should be
dealt with separately from the usage API.
Do you suggest to (de)serialize "Cluster" objects (i.e. instances
of an unknown type)?

> As a comparison, dl4j's design is easy to use for predict, especial the
> ClusterSet
> but leak of  flexibility(stroke coupling to nd4j) and efficiency(complexity
> on nd4j implemention):
> https://github.com/eclipse/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering

Please show stripped down versions of the API that illustrates
the points which you want to highlight.  Thanks.

>>
>>> If we want the cluster result useful enough for parse "Predict",
>>>  the result of "KMeansPlusPlusClusterer.cluster" should return a
>>> "ClusterSet":
>>> ```java
>>> public interface ClusterSet<T extends Clusterable> extends Collection<T>
>>> {
>>>   // Retrun the cluster which the point should belong to.
>>>   Cluster predict(T point);
>>>   // Add a point to best cluster.
>>>   void addPoint(T point);
>>> }
>>> ```
>>
>>This  "ClusterSet" seems less flexible than a "List<Cluster>".
>
> "ClusterSet" is useful for predict as I mentioned above.

Cf. previous comment.

>>
>>> And "centroidOf"(just used in clustering iteration) can move up into a
>>> abstract class like "CenroidClusterer".
>>
>>It seems that this method could be useful for users too.
>>
>
> May be "centroidOf" is useful, but now it is just used in kmeans.

Is "Kmeans" the only algorithm that computes clusters centers?
Do users never need to compute a center?

Best regards,
Gilles

>>> >It would seem that "center" should be a property computed
>>> >from the contents of "Cluster" e.g.:
>>> >
>>> >@FunctionalInterface
>>> >public interface ClusterCenterComputer<T extends Clusterable> {
>>> >    T centroidOf(Cluster<T> cluster);
>>> >}
>>> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Reply via email to