The contingency matrix ( https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cluster.contingency_matrix.html) counts how many times each pair of (true cluster, predicted cluster) occurs. It is sufficient statistics for every "supervised" (i.e. ground truth-based) clustering evaluation metric in Scikit-learn. In an incremental setting, you can simply add to the contingency matrix with each new predicted batch. In https://github.com/scikit-learn/scikit-learn/issues/8103 I proposed that we provide an API for calculating clustering metrics from the sufficient statistics alone, but it's not come to fruition.
On Thu, 16 May 2019 at 11:47, lampahome <pahome.c...@mirlab.org> wrote: > Joel Nothman <joel.noth...@gmail.com> 於 2019年5月15日 週三 下午12:16寫道: > >> Evaluating on large datasets is easy if the sufficient statistics are >> just the contingency matrix. >> >> > Sorry, I don't understand it. Can you explain detailly? > You mean we could take subset of samples to evaluating if subset is > contingency(normal distribution) matrix? > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn