[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471128#comment-16471128
 ] 

shahid commented on SPARK-24217:


Thanks for the clarification Joseph K. Bradley


Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     id prediction  

      1       0
  2   0
  3   0
  4   0
  5   0
   6      1
   7   1
   8   1
   9   1
   10  1

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread spark_user (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471040#comment-16471040
 ] 

spark_user commented on SPARK-24217:


Thanks for the clarification. I am closing the PR.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470704#comment-16470704
 ] 

Joseph K. Bradley commented on SPARK-24217:
---

On the topic of eating my words, please check out my new comment here: 
[SPARK-15784].  We may need to rework the API.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-09 Thread spark_user (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469859#comment-16469859
 ] 

spark_user commented on SPARK-24217:


Behaviour should be same for both spark.ml and spark.mllib right?

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-09 Thread spark_user (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469858#comment-16469858
 ] 

spark_user commented on SPARK-24217:


For the same input in spark.ml and spark.mllib, spark.mllib giving cluster id 
for all the vertices.

 

For eg:

      id       neighbor          similarity 

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in spark.ml 

     id prediction  

      1       0

        6     1

 

Output in spark.mllib

     Id prediction

      1      0

       2     0

       3     0

       4     0

      5    0

      6     1

     7      1

   8       1

   9   1

    10   1

 

 

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469562#comment-16469562
 ] 

Joseph K. Bradley commented on SPARK-24217:
---

But the reason that the IDs are missing from the "id" column is that the input 
is not symmetric.  If it were made symmetric, then there could not be any 
missing IDs.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-09 Thread spark_user (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469245#comment-16469245
 ] 

spark_user commented on SPARK-24217:


PIC should return the cluster indices of each vertex of the graph, as per the 
definition of PIC, which is also given in the comment in the 
PowerIterationClustering.scala in spark.ml

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes. 
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  Currently PIC will not return the cluster indices of neighbour IDs which are 
> not there in the ID column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-09 Thread spark_user (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469243#comment-16469243
 ] 

spark_user commented on SPARK-24217:


Thanks for the comment Joseph K. Bradley.

Actually the issue is not about the symmetric similarity matrix.  Spark.mllib 
PIC assigns cluster indices corresponding to all the vertices of the similarity 
graph. But spark.ml doesn't return the cluster ids of the vertices which are 
not there in the ID column.

This can be clearly visible in the test cases of both spark.ml and spark.mllib

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes. 
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  Currently PIC will not return the cluster indices of neighbour IDs which are 
> not there in the ID column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469230#comment-16469230
 ] 

Joseph K. Bradley commented on SPARK-24217:
---

I don't really think this is a bug.  PIC's documentation says pretty clearly 
that the input data has to represent a symmetric matrix, and this example seems 
to be failing because the input data is invalid.  I do think it could be 
valuable to throw a better error when the input is not symmetric, though we 
should make sure that any check we do for this is not too expensive.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
> is a symmetric matrix whose entries are non-negative similarities between 
> items.
> PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
> in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
> containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org