Re: DBSCAN implementation in Mahout

Ted Dunning Sun, 30 Nov 2014 12:48:23 -0800

What happens with 8 million points and 1000 threads?

Or 8 billion points?







On Sun, Nov 30, 2014 at 1:10 PM, 3316 Chirag Nagpal <
chiragnagpal_12...@aitpune.edu.in> wrote:

> Hi Ted,
>
> Thanks for the reply.
>
> I have been using DBSCAN (in python), the one implemented in sci-kit
> package. For a dataset with about 8k points, the running time on my Intel
> i7 4700 QM comes to around ~300 seconds.
>
> I have implemented a parallel version using the multiprocessing python
> library, and the running time comes down to about 100~120 seconds, when I 3
> parallel threads.
>
> Thus the scale up is almost 'n'. I think scalability should not be an
> issue for a Map Reduce implementation.
>
> Chirag Nagpal
> University of Pune, India
> www.chiragnagpal.com
> ________________________________________
> From: Ted Dunning <ted.dunn...@gmail.com>
> Sent: Sunday, November 30, 2014 6:29 PM
> To: user@mahout.apache.org
> Subject: Re: DBSCAN implementation in Mahout
>
> On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal <
> chiragnagpal_12...@aitpune.edu.in> wrote:
>
> > Since Density based clustering algorithms, are being utilised
> extensively,
> > especially by the GIS research groups, it is a bit sad that there isn't a
> > Map Reduce implementation available..
> >
> > I think I will propose to write MapReduce code for DBSCAN and OPTICS for
> > GSoC '15.
> >
> > I would like to take your input as to how much of significance would this
> > be of to the community in general?
> >
>
> We have had proposals to add this to Mahout, but as far as I remember, no
> credible requests to use it.
>
> Also, there is the question of scalability of dbscan like algorithms.
>

Re: DBSCAN implementation in Mahout

Reply via email to