Re: DBSCAN implementation in Mahout

2014-11-30 Thread Ted Dunning
On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal <
chiragnagpal_12...@aitpune.edu.in> wrote:

> Since Density based clustering algorithms, are being utilised extensively,
> especially by the GIS research groups, it is a bit sad that there isn't a
> Map Reduce implementation available..
>
> I think I will propose to write MapReduce code for DBSCAN and OPTICS for
> GSoC '15.
>
> I would like to take your input as to how much of significance would this
> be of to the community in general?
>

We have had proposals to add this to Mahout, but as far as I remember, no
credible requests to use it.

Also, there is the question of scalability of dbscan like algorithms.


Re: DBSCAN implementation in Mahout

2014-11-30 Thread 3316 Chirag Nagpal
Hi Ted,

Thanks for the reply.

I have been using DBSCAN (in python), the one implemented in sci-kit package. 
For a dataset with about 8k points, the running time on my Intel i7 4700 QM 
comes to around ~300 seconds.

I have implemented a parallel version using the multiprocessing python library, 
and the running time comes down to about 100~120 seconds, when I 3 parallel 
threads. 

Thus the scale up is almost 'n'. I think scalability should not be an issue for 
a Map Reduce implementation.

Chirag Nagpal
University of Pune, India
www.chiragnagpal.com

From: Ted Dunning 
Sent: Sunday, November 30, 2014 6:29 PM
To: user@mahout.apache.org
Subject: Re: DBSCAN implementation in Mahout

On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal <
chiragnagpal_12...@aitpune.edu.in> wrote:

> Since Density based clustering algorithms, are being utilised extensively,
> especially by the GIS research groups, it is a bit sad that there isn't a
> Map Reduce implementation available..
>
> I think I will propose to write MapReduce code for DBSCAN and OPTICS for
> GSoC '15.
>
> I would like to take your input as to how much of significance would this
> be of to the community in general?
>

We have had proposals to add this to Mahout, but as far as I remember, no
credible requests to use it.

Also, there is the question of scalability of dbscan like algorithms.


Re: DBSCAN implementation in Mahout

2014-11-30 Thread Ted Dunning
What happens with 8 million points and 1000 threads?

Or 8 billion points?






On Sun, Nov 30, 2014 at 1:10 PM, 3316 Chirag Nagpal <
chiragnagpal_12...@aitpune.edu.in> wrote:

> Hi Ted,
>
> Thanks for the reply.
>
> I have been using DBSCAN (in python), the one implemented in sci-kit
> package. For a dataset with about 8k points, the running time on my Intel
> i7 4700 QM comes to around ~300 seconds.
>
> I have implemented a parallel version using the multiprocessing python
> library, and the running time comes down to about 100~120 seconds, when I 3
> parallel threads.
>
> Thus the scale up is almost 'n'. I think scalability should not be an
> issue for a Map Reduce implementation.
>
> Chirag Nagpal
> University of Pune, India
> www.chiragnagpal.com
> 
> From: Ted Dunning 
> Sent: Sunday, November 30, 2014 6:29 PM
> To: user@mahout.apache.org
> Subject: Re: DBSCAN implementation in Mahout
>
> On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal <
> chiragnagpal_12...@aitpune.edu.in> wrote:
>
> > Since Density based clustering algorithms, are being utilised
> extensively,
> > especially by the GIS research groups, it is a bit sad that there isn't a
> > Map Reduce implementation available..
> >
> > I think I will propose to write MapReduce code for DBSCAN and OPTICS for
> > GSoC '15.
> >
> > I would like to take your input as to how much of significance would this
> > be of to the community in general?
> >
>
> We have had proposals to add this to Mahout, but as far as I remember, no
> credible requests to use it.
>
> Also, there is the question of scalability of dbscan like algorithms.
>