Sounds like you need to use spark, this project looks promising: https://github.com/xiaocai00/SparkPinkMST
On Tue, May 14, 2019 at 5:12 AM lampahome <pahome.c...@mirlab.org> wrote: > > Uri Goren <ugo...@gmail.com> 於 2019年5月3日 週五 下午7:29寫道: > >> I usually use clustering to save costs on labelling. >> I like to apply hierarchical clustering, and then label a small sample >> and fine-tune the clustering algorithm. >> >> That way, you can evaluate the effectiveness in terms of cluster purity >> (how many clusters contain mixed labels) >> >> See example with sklearn here : >> https://youtu.be/GM8L324MuHc?list=PLqkckaeDLF4IDdKltyBwx8jLaz5nwDPQU >> >> >> But if my dataset is too large to load into memory, will it work? > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn