What happens with 8 million points and 1000 threads? Or 8 billion points?
On Sun, Nov 30, 2014 at 1:10 PM, 3316 Chirag Nagpal < chiragnagpal_12...@aitpune.edu.in> wrote: > Hi Ted, > > Thanks for the reply. > > I have been using DBSCAN (in python), the one implemented in sci-kit > package. For a dataset with about 8k points, the running time on my Intel > i7 4700 QM comes to around ~300 seconds. > > I have implemented a parallel version using the multiprocessing python > library, and the running time comes down to about 100~120 seconds, when I 3 > parallel threads. > > Thus the scale up is almost 'n'. I think scalability should not be an > issue for a Map Reduce implementation. > > Chirag Nagpal > University of Pune, India > www.chiragnagpal.com > ________________________________________ > From: Ted Dunning <ted.dunn...@gmail.com> > Sent: Sunday, November 30, 2014 6:29 PM > To: user@mahout.apache.org > Subject: Re: DBSCAN implementation in Mahout > > On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal < > chiragnagpal_12...@aitpune.edu.in> wrote: > > > Since Density based clustering algorithms, are being utilised > extensively, > > especially by the GIS research groups, it is a bit sad that there isn't a > > Map Reduce implementation available.. > > > > I think I will propose to write MapReduce code for DBSCAN and OPTICS for > > GSoC '15. > > > > I would like to take your input as to how much of significance would this > > be of to the community in general? > > > > We have had proposals to add this to Mahout, but as far as I remember, no > credible requests to use it. > > Also, there is the question of scalability of dbscan like algorithms. >