[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2017-02-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880786#comment-15880786
 ] 

Xiangrui Meng commented on SPARK-5226:
--

I closed this ticket as "Won't Do" due to DBSCAN's high complexity and hence 
bad scalability as documented in 
http://staff.itee.uq.edu.au/taoyf/paper/sigmod15-dbscan.pdf.

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2016-01-07 Thread mustafa elbehery (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087301#comment-15087301
 ] 

mustafa elbehery commented on SPARK-5226:
-

I have tried to use Aliaksei's implementation on 500MB of GPS Trajectories. The 
algorithm never finished. Though, his implementation worked very well on the 
provided sample data. 

When I have created a scatter plot for both datasets; sample data && 
trajectories data, I found out that his data's distribution was Gaussian, while 
mine was very skewed. Moreover, this implementation has a bottleneck, because 
basically all the partition are merged together in a reduce step, which leads 
turns the algorithm into Serial again !!!.. 

I have commented below a better implementation to avoid this bottleneck, hope 
it would be more helpful.

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2016-01-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086591#comment-15086591
 ] 

Joseph K. Bradley commented on SPARK-5226:
--

Aliaksei did create a package: 
[http://spark-packages.org/package/alitouka/spark_dbscan]

It'd be great to get feedback on it & other implementations.  It may be a 
little while before we have reviewer bandwidth to get dbscan into MLlib itself, 
but a well-tested package would make doing so much easier!

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-12-21 Thread mustafa elbehery (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067200#comment-15067200
 ] 

mustafa elbehery commented on SPARK-5226:
-

Better Implementation, based on research paper for parallel DBSCAN can be found 
here. https://github.com/irvingc/dbscan-on-spark .. 

The approach solved bottleneck of reduce step, in which discovered clusters are 
merged.

Hope it helps.

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-12-21 Thread Stephen Boesch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067177#comment-15067177
 ] 

Stephen Boesch commented on SPARK-5226:
---

It seems that Aliaksei has not been able to add to spark-packages. So I presume 
DBScan is still an open ticket ?

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-11-14 Thread mustafa elbehery (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005355#comment-15005355
 ] 

mustafa elbehery commented on SPARK-5226:
-

Hello, 

I would like to use DBSCAN on spark. [~alitouka] I have tried to use ur 
implementation, on 500 MG of data. However, I think the **Population of 
partition index** step is to expensive. 

Is this implementation is going to be online soon, 

Regards.

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-09-01 Thread Rajeev Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725284#comment-14725284
 ] 

Rajeev Reddy commented on SPARK-5226:
-

Hello Aliaksei Litouka, I have looked into your implementation you are taking 
coordinate points i.e double as input for clustering can you please tell me how 
I can extend this for clustering a set of Text Documents

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-02-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301792#comment-14301792
 ] 

Xiangrui Meng commented on SPARK-5226:
--

[~alitouka] Thanks for implementing DBSCAN on top of Spark! I'd like to 
recommend you registering it as a package on http://spark-packages.org, so it 
is more visible to the community.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-31 Thread Aliaksei Litouka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299902#comment-14299902
 ] 

Aliaksei Litouka commented on SPARK-5226:
-

[~angellandros] , FYI - I attempted to implement DBSCAN algorithm on top of 
Spark. My implementation is limited to Euclidean and Manhattan distance 
measures but maybe you'll find something helpful - 
https://github.com/alitouka/spark_dbscan . 

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-26 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292399#comment-14292399
 ] 

Dmitriy Lyubimov commented on SPARK-5226:
-

All attempts to parallelize dbscan in literature lately (or similar DeLiClu 
type of things) i read about include partitioning the task into smaller 
subtasks, solving each on individual level and merging it all back (see MR.Scan 
paper for example). Merging is of course is the new and the tricky thing.

As far as i understand, they all pretty much have limitations to reduce scope 
to euclidean distances  and captitalize on notions of euclidean geometry 
resulting from that, in order to solve partition and merge problems. Which 
substantially reduces attractiveness of general algorithm. However, the naive 
straightforward port of  simple DBScan algorithm is not terribly practical for 
big data because of total complexity of the problem (or impracticality of 
building something like huge distributed R-tree index system on shared-nothing 
programming models).

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-24 Thread Muhammad-Ali A'rabi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290523#comment-14290523
 ] 

Muhammad-Ali A'rabi commented on SPARK-5226:


That's right. For very huge data, it won't be a good implementation.
It is O(log n), actually. In preprocessing phase, we created a sorted map or 
something, and with a radius, we can retrieve all points with less distance in 
O(log n).
If we use the first implementation, for each region query we have to calculate 
lots of distances, and some of them are surely calculated before.
We can have both ways implemented, and user may use any of them depending on 
their need.
We can also use vector with norm and use the upper bound. But I don't trust 
this method and have to test it.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-20 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284391#comment-14284391
 ] 

Xiangrui Meng commented on SPARK-5226:
--

The preprocessing step is parallel but very expensive. The distance matrix will 
have n^2 entries, which means it would be hard to handle beyond 1 million 
points. Is it true? In the main process, could you elaborate on how region 
queries are performed, and why it is O(1)?

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-15 Thread Muhammad-Ali A'rabi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279170#comment-14279170
 ] 

Muhammad-Ali A'rabi commented on SPARK-5226:


This is DBSCAN algorithm:

{noformat}
DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
  mark P as visited
  NeighborPts = regionQuery(P, eps)
  if sizeof(NeighborPts)  MinPts
 mark P as NOISE
  else
 C = next cluster
 expandCluster(P, NeighborPts, C, eps, MinPts)
  
expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
  if P' is not visited
 mark P' as visited
 NeighborPts' = regionQuery(P', eps)
 if sizeof(NeighborPts') = MinPts
NeighborPts = NeighborPts joined with NeighborPts'
  if P' is not yet member of any cluster
 add P' to cluster C
  
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)
{noformat}

As you can see, there are just two parameters. There is two ways of 
implementation. First one is faster (O(n log n), and requires more memory 
(O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But 
I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and 
distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in 
pseudo-code, and two foreach's are parallelized. Region queries are done very 
fast (O(1)), because of preprocessing.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-14 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277932#comment-14277932
 ] 

Xiangrui Meng commented on SPARK-5226:
--

[~angellandros] Before you start coding, could you describe the computation and 
communication cost of a parallel implementation of DBSCAN as well as the 
proposed APIs (parameters, methods)? It would help us decide whether to include 
it in MLlib.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-14 Thread Muhammad-Ali A'rabi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278305#comment-14278305
 ] 

Muhammad-Ali A'rabi commented on SPARK-5226:


Yeah, of course. It will take me a day or two, so I commented to say that I'm 
working on it.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-13 Thread Muhammad-Ali A'rabi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275981#comment-14275981
 ] 

Muhammad-Ali A'rabi commented on SPARK-5226:


Although I can't assign this task to myself, I am interested to do it.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Muhammad-Ali A'rabi
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org