Re: DBSCAN implementation in Mahout
Correction. MR.SCAN is Univ. of Wisconsin's paper. Google Beijing was another paper on the subject but i found mr.scan having a bit more elegant simplicity in it. On Mon, Dec 1, 2014 at 12:41 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: if memory serves me, DeLiClu (density-link) is current best density thing since it does not require parameter searches. What is parallelization strategy you are proposing? I know there were a bunch of attempts to parallelize/partition the dbscan problem, one of more interesting is perhaps of Google's MR.SCAN paper, but even the latter is not qutie embarassingly parallel (requires partitioning overlap between subtasks which is a function of epsilon neighborhood). Nevertheless, this seemed to yield significantly interesting performance. also, MR version of Mahout has (or used to have) mean shift, which is just fine, if not better, for irregularly-shaped density clustering. Not sure of its performance though. their translations into spark perhaps would be interesting enough. On Sat, Nov 29, 2014 at 12:31 PM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: Hi Dimitry, Thanks for the reply Since Density based clustering algorithms, are being utilised extensively, especially by the GIS research groups, it is a bit sad that there isn't a Map Reduce implementation available.. I think I will propose to write MapReduce code for DBSCAN and OPTICS for GSoC '15. I would like to take your input as to how much of significance would this be of to the community in general? Thanks, Chirag Nagpal University of Pune, India www.chiragnagpal.com From: Dmitriy Lyubimov dlie...@gmail.com Sent: Saturday, November 29, 2014 11:29 PM To: user@mahout.apache.org Subject: Re: DBSCAN implementation in Mahout No there is no dbscan, optics or any other density flavor afaik Sent from my phone. On Nov 28, 2014 11:41 AM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: ? Hello I am Chirag Nagpal, a third year student of Computer Engineering at the University of Pune, India and currently interning at SERC, Indian Institute of Science, Bangalore My work involves using density based clustering algorithms like DBSCAN on geo-referenced data like Tweets. Typically the dataset consists of millions of points. I would like to know if there is any Map Reduce implementation of DBSCAN available. thank you Chirag ?
Re: DBSCAN implementation in Mahout
if memory serves me, DeLiClu (density-link) is current best density thing since it does not require parameter searches. What is parallelization strategy you are proposing? I know there were a bunch of attempts to parallelize/partition the dbscan problem, one of more interesting is perhaps of Google's MR.SCAN paper, but even the latter is not qutie embarassingly parallel (requires partitioning overlap between subtasks which is a function of epsilon neighborhood). Nevertheless, this seemed to yield significantly interesting performance. also, MR version of Mahout has (or used to have) mean shift, which is just fine, if not better, for irregularly-shaped density clustering. Not sure of its performance though. their translations into spark perhaps would be interesting enough. On Sat, Nov 29, 2014 at 12:31 PM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: Hi Dimitry, Thanks for the reply Since Density based clustering algorithms, are being utilised extensively, especially by the GIS research groups, it is a bit sad that there isn't a Map Reduce implementation available.. I think I will propose to write MapReduce code for DBSCAN and OPTICS for GSoC '15. I would like to take your input as to how much of significance would this be of to the community in general? Thanks, Chirag Nagpal University of Pune, India www.chiragnagpal.com From: Dmitriy Lyubimov dlie...@gmail.com Sent: Saturday, November 29, 2014 11:29 PM To: user@mahout.apache.org Subject: Re: DBSCAN implementation in Mahout No there is no dbscan, optics or any other density flavor afaik Sent from my phone. On Nov 28, 2014 11:41 AM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: ? Hello I am Chirag Nagpal, a third year student of Computer Engineering at the University of Pune, India and currently interning at SERC, Indian Institute of Science, Bangalore My work involves using density based clustering algorithms like DBSCAN on geo-referenced data like Tweets. Typically the dataset consists of millions of points. I would like to know if there is any Map Reduce implementation of DBSCAN available. thank you Chirag ?
Re: DBSCAN implementation in Mahout
On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: Since Density based clustering algorithms, are being utilised extensively, especially by the GIS research groups, it is a bit sad that there isn't a Map Reduce implementation available.. I think I will propose to write MapReduce code for DBSCAN and OPTICS for GSoC '15. I would like to take your input as to how much of significance would this be of to the community in general? We have had proposals to add this to Mahout, but as far as I remember, no credible requests to use it. Also, there is the question of scalability of dbscan like algorithms.
Re: DBSCAN implementation in Mahout
Hi Ted, Thanks for the reply. I have been using DBSCAN (in python), the one implemented in sci-kit package. For a dataset with about 8k points, the running time on my Intel i7 4700 QM comes to around ~300 seconds. I have implemented a parallel version using the multiprocessing python library, and the running time comes down to about 100~120 seconds, when I 3 parallel threads. Thus the scale up is almost 'n'. I think scalability should not be an issue for a Map Reduce implementation. Chirag Nagpal University of Pune, India www.chiragnagpal.com From: Ted Dunning ted.dunn...@gmail.com Sent: Sunday, November 30, 2014 6:29 PM To: user@mahout.apache.org Subject: Re: DBSCAN implementation in Mahout On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: Since Density based clustering algorithms, are being utilised extensively, especially by the GIS research groups, it is a bit sad that there isn't a Map Reduce implementation available.. I think I will propose to write MapReduce code for DBSCAN and OPTICS for GSoC '15. I would like to take your input as to how much of significance would this be of to the community in general? We have had proposals to add this to Mahout, but as far as I remember, no credible requests to use it. Also, there is the question of scalability of dbscan like algorithms.
Re: DBSCAN implementation in Mahout
What happens with 8 million points and 1000 threads? Or 8 billion points? On Sun, Nov 30, 2014 at 1:10 PM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: Hi Ted, Thanks for the reply. I have been using DBSCAN (in python), the one implemented in sci-kit package. For a dataset with about 8k points, the running time on my Intel i7 4700 QM comes to around ~300 seconds. I have implemented a parallel version using the multiprocessing python library, and the running time comes down to about 100~120 seconds, when I 3 parallel threads. Thus the scale up is almost 'n'. I think scalability should not be an issue for a Map Reduce implementation. Chirag Nagpal University of Pune, India www.chiragnagpal.com From: Ted Dunning ted.dunn...@gmail.com Sent: Sunday, November 30, 2014 6:29 PM To: user@mahout.apache.org Subject: Re: DBSCAN implementation in Mahout On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: Since Density based clustering algorithms, are being utilised extensively, especially by the GIS research groups, it is a bit sad that there isn't a Map Reduce implementation available.. I think I will propose to write MapReduce code for DBSCAN and OPTICS for GSoC '15. I would like to take your input as to how much of significance would this be of to the community in general? We have had proposals to add this to Mahout, but as far as I remember, no credible requests to use it. Also, there is the question of scalability of dbscan like algorithms.
Re: DBSCAN implementation in Mahout
No there is no dbscan, optics or any other density flavor afaik Sent from my phone. On Nov 28, 2014 11:41 AM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: ? Hello I am Chirag Nagpal, a third year student of Computer Engineering at the University of Pune, India and currently interning at SERC, Indian Institute of Science, Bangalore My work involves using density based clustering algorithms like DBSCAN on geo-referenced data like Tweets. Typically the dataset consists of millions of points. I would like to know if there is any Map Reduce implementation of DBSCAN available. thank you Chirag ?
Re: DBSCAN implementation in Mahout
Hi Dimitry, Thanks for the reply Since Density based clustering algorithms, are being utilised extensively, especially by the GIS research groups, it is a bit sad that there isn't a Map Reduce implementation available.. I think I will propose to write MapReduce code for DBSCAN and OPTICS for GSoC '15. I would like to take your input as to how much of significance would this be of to the community in general? Thanks, Chirag Nagpal University of Pune, India www.chiragnagpal.com From: Dmitriy Lyubimov dlie...@gmail.com Sent: Saturday, November 29, 2014 11:29 PM To: user@mahout.apache.org Subject: Re: DBSCAN implementation in Mahout No there is no dbscan, optics or any other density flavor afaik Sent from my phone. On Nov 28, 2014 11:41 AM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: ? Hello I am Chirag Nagpal, a third year student of Computer Engineering at the University of Pune, India and currently interning at SERC, Indian Institute of Science, Bangalore My work involves using density based clustering algorithms like DBSCAN on geo-referenced data like Tweets. Typically the dataset consists of millions of points. I would like to know if there is any Map Reduce implementation of DBSCAN available. thank you Chirag ?
DBSCAN implementation in Mahout
? Hello I am Chirag Nagpal, a third year student of Computer Engineering at the University of Pune, India and currently interning at SERC, Indian Institute of Science, Bangalore My work involves using density based clustering algorithms like DBSCAN on geo-referenced data like Tweets. Typically the dataset consists of millions of points. I would like to know if there is any Map Reduce implementation of DBSCAN available. thank you Chirag ?