Re: DBSCAN implementation in Mahout

2014-12-02 Thread Dmitriy Lyubimov
Correction. MR.SCAN is Univ. of Wisconsin's paper. Google Beijing was
another paper on the subject but i found mr.scan having a bit more elegant
simplicity in it.

On Mon, Dec 1, 2014 at 12:41 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 if memory serves me, DeLiClu (density-link) is current best density thing
 since it does not require parameter searches.

 What is parallelization strategy you are proposing?

 I know there were a bunch of attempts to parallelize/partition the dbscan
 problem, one of more interesting is perhaps of Google's MR.SCAN paper, but
 even the latter is not qutie embarassingly parallel (requires partitioning
 overlap between subtasks which is a function of epsilon neighborhood).
 Nevertheless, this seemed to yield significantly interesting performance.

 also, MR version of Mahout has (or used to have) mean shift, which is just
 fine, if not better, for irregularly-shaped density clustering. Not sure of
 its performance though. their translations into spark perhaps would be
 interesting enough.



 On Sat, Nov 29, 2014 at 12:31 PM, 3316 Chirag Nagpal 
 chiragnagpal_12...@aitpune.edu.in wrote:

 Hi Dimitry,

 Thanks for the reply

 Since Density based clustering algorithms, are being utilised
 extensively, especially by the GIS research groups, it is a bit sad that
 there isn't a Map Reduce implementation available..

 I think I will propose to write MapReduce code for DBSCAN and OPTICS for
 GSoC '15.

 I would like to take your input as to how much of significance would this
 be of to the community in general?

 Thanks,

 Chirag Nagpal
 University of Pune, India
 www.chiragnagpal.com
 
 From: Dmitriy Lyubimov dlie...@gmail.com
 Sent: Saturday, November 29, 2014 11:29 PM
 To: user@mahout.apache.org
 Subject: Re: DBSCAN implementation in Mahout

 No there is no dbscan, optics or any other density flavor afaik

 Sent from my phone.
 On Nov 28, 2014 11:41 AM, 3316 Chirag Nagpal 
 chiragnagpal_12...@aitpune.edu.in wrote:

  ?
 
  Hello
  I am Chirag Nagpal, a third year student of Computer Engineering at the
  University of Pune, India and currently interning at SERC, Indian
 Institute
  of Science, Bangalore
 
  My work involves using density based clustering algorithms like DBSCAN
 on
  geo-referenced data like Tweets. Typically the dataset consists of
 millions
  of points. I would like to know if there is any Map Reduce
 implementation
  of DBSCAN available.
 
  thank you
  Chirag ?
 





Re: DBSCAN implementation in Mahout

2014-12-01 Thread Dmitriy Lyubimov
if memory serves me, DeLiClu (density-link) is current best density thing
since it does not require parameter searches.

What is parallelization strategy you are proposing?

I know there were a bunch of attempts to parallelize/partition the dbscan
problem, one of more interesting is perhaps of Google's MR.SCAN paper, but
even the latter is not qutie embarassingly parallel (requires partitioning
overlap between subtasks which is a function of epsilon neighborhood).
Nevertheless, this seemed to yield significantly interesting performance.

also, MR version of Mahout has (or used to have) mean shift, which is just
fine, if not better, for irregularly-shaped density clustering. Not sure of
its performance though. their translations into spark perhaps would be
interesting enough.



On Sat, Nov 29, 2014 at 12:31 PM, 3316 Chirag Nagpal 
chiragnagpal_12...@aitpune.edu.in wrote:

 Hi Dimitry,

 Thanks for the reply

 Since Density based clustering algorithms, are being utilised extensively,
 especially by the GIS research groups, it is a bit sad that there isn't a
 Map Reduce implementation available..

 I think I will propose to write MapReduce code for DBSCAN and OPTICS for
 GSoC '15.

 I would like to take your input as to how much of significance would this
 be of to the community in general?

 Thanks,

 Chirag Nagpal
 University of Pune, India
 www.chiragnagpal.com
 
 From: Dmitriy Lyubimov dlie...@gmail.com
 Sent: Saturday, November 29, 2014 11:29 PM
 To: user@mahout.apache.org
 Subject: Re: DBSCAN implementation in Mahout

 No there is no dbscan, optics or any other density flavor afaik

 Sent from my phone.
 On Nov 28, 2014 11:41 AM, 3316 Chirag Nagpal 
 chiragnagpal_12...@aitpune.edu.in wrote:

  ?
 
  Hello
  I am Chirag Nagpal, a third year student of Computer Engineering at the
  University of Pune, India and currently interning at SERC, Indian
 Institute
  of Science, Bangalore
 
  My work involves using density based clustering algorithms like DBSCAN on
  geo-referenced data like Tweets. Typically the dataset consists of
 millions
  of points. I would like to know if there is any Map Reduce implementation
  of DBSCAN available.
 
  thank you
  Chirag ?
 



Re: DBSCAN implementation in Mahout

2014-11-30 Thread Ted Dunning
On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal 
chiragnagpal_12...@aitpune.edu.in wrote:

 Since Density based clustering algorithms, are being utilised extensively,
 especially by the GIS research groups, it is a bit sad that there isn't a
 Map Reduce implementation available..

 I think I will propose to write MapReduce code for DBSCAN and OPTICS for
 GSoC '15.

 I would like to take your input as to how much of significance would this
 be of to the community in general?


We have had proposals to add this to Mahout, but as far as I remember, no
credible requests to use it.

Also, there is the question of scalability of dbscan like algorithms.


Re: DBSCAN implementation in Mahout

2014-11-30 Thread 3316 Chirag Nagpal
Hi Ted,

Thanks for the reply.

I have been using DBSCAN (in python), the one implemented in sci-kit package. 
For a dataset with about 8k points, the running time on my Intel i7 4700 QM 
comes to around ~300 seconds.

I have implemented a parallel version using the multiprocessing python library, 
and the running time comes down to about 100~120 seconds, when I 3 parallel 
threads. 

Thus the scale up is almost 'n'. I think scalability should not be an issue for 
a Map Reduce implementation.

Chirag Nagpal
University of Pune, India
www.chiragnagpal.com

From: Ted Dunning ted.dunn...@gmail.com
Sent: Sunday, November 30, 2014 6:29 PM
To: user@mahout.apache.org
Subject: Re: DBSCAN implementation in Mahout

On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal 
chiragnagpal_12...@aitpune.edu.in wrote:

 Since Density based clustering algorithms, are being utilised extensively,
 especially by the GIS research groups, it is a bit sad that there isn't a
 Map Reduce implementation available..

 I think I will propose to write MapReduce code for DBSCAN and OPTICS for
 GSoC '15.

 I would like to take your input as to how much of significance would this
 be of to the community in general?


We have had proposals to add this to Mahout, but as far as I remember, no
credible requests to use it.

Also, there is the question of scalability of dbscan like algorithms.


Re: DBSCAN implementation in Mahout

2014-11-30 Thread Ted Dunning
What happens with 8 million points and 1000 threads?

Or 8 billion points?






On Sun, Nov 30, 2014 at 1:10 PM, 3316 Chirag Nagpal 
chiragnagpal_12...@aitpune.edu.in wrote:

 Hi Ted,

 Thanks for the reply.

 I have been using DBSCAN (in python), the one implemented in sci-kit
 package. For a dataset with about 8k points, the running time on my Intel
 i7 4700 QM comes to around ~300 seconds.

 I have implemented a parallel version using the multiprocessing python
 library, and the running time comes down to about 100~120 seconds, when I 3
 parallel threads.

 Thus the scale up is almost 'n'. I think scalability should not be an
 issue for a Map Reduce implementation.

 Chirag Nagpal
 University of Pune, India
 www.chiragnagpal.com
 
 From: Ted Dunning ted.dunn...@gmail.com
 Sent: Sunday, November 30, 2014 6:29 PM
 To: user@mahout.apache.org
 Subject: Re: DBSCAN implementation in Mahout

 On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal 
 chiragnagpal_12...@aitpune.edu.in wrote:

  Since Density based clustering algorithms, are being utilised
 extensively,
  especially by the GIS research groups, it is a bit sad that there isn't a
  Map Reduce implementation available..
 
  I think I will propose to write MapReduce code for DBSCAN and OPTICS for
  GSoC '15.
 
  I would like to take your input as to how much of significance would this
  be of to the community in general?
 

 We have had proposals to add this to Mahout, but as far as I remember, no
 credible requests to use it.

 Also, there is the question of scalability of dbscan like algorithms.



Re: DBSCAN implementation in Mahout

2014-11-29 Thread Dmitriy Lyubimov
No there is no dbscan, optics or any other density flavor afaik

Sent from my phone.
On Nov 28, 2014 11:41 AM, 3316 Chirag Nagpal 
chiragnagpal_12...@aitpune.edu.in wrote:

 ?

 Hello
 I am Chirag Nagpal, a third year student of Computer Engineering at the
 University of Pune, India and currently interning at SERC, Indian Institute
 of Science, Bangalore

 My work involves using density based clustering algorithms like DBSCAN on
 geo-referenced data like Tweets. Typically the dataset consists of millions
 of points. I would like to know if there is any Map Reduce implementation
 of DBSCAN available.

 thank you
 Chirag ?



Re: DBSCAN implementation in Mahout

2014-11-29 Thread 3316 Chirag Nagpal
Hi Dimitry,

Thanks for the reply

Since Density based clustering algorithms, are being utilised extensively, 
especially by the GIS research groups, it is a bit sad that there isn't a Map 
Reduce implementation available.. 

I think I will propose to write MapReduce code for DBSCAN and OPTICS for GSoC 
'15.

I would like to take your input as to how much of significance would this be of 
to the community in general? 

Thanks,

Chirag Nagpal
University of Pune, India
www.chiragnagpal.com

From: Dmitriy Lyubimov dlie...@gmail.com
Sent: Saturday, November 29, 2014 11:29 PM
To: user@mahout.apache.org
Subject: Re: DBSCAN implementation in Mahout

No there is no dbscan, optics or any other density flavor afaik

Sent from my phone.
On Nov 28, 2014 11:41 AM, 3316 Chirag Nagpal 
chiragnagpal_12...@aitpune.edu.in wrote:

 ?

 Hello
 I am Chirag Nagpal, a third year student of Computer Engineering at the
 University of Pune, India and currently interning at SERC, Indian Institute
 of Science, Bangalore

 My work involves using density based clustering algorithms like DBSCAN on
 geo-referenced data like Tweets. Typically the dataset consists of millions
 of points. I would like to know if there is any Map Reduce implementation
 of DBSCAN available.

 thank you
 Chirag ?



DBSCAN implementation in Mahout

2014-11-28 Thread 3316 Chirag Nagpal
?

Hello
I am Chirag Nagpal, a third year student of Computer Engineering at the 
University of Pune, India and currently interning at SERC, Indian Institute of 
Science, Bangalore

My work involves using density based clustering algorithms like DBSCAN on 
geo-referenced data like Tweets. Typically the dataset consists of millions of 
points. I would like to know if there is any Map Reduce implementation of 
DBSCAN available.

thank you
Chirag ?