[ 
https://issues.apache.org/jira/browse/SPARK-42856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohammadreza Haghpanah updated SPARK-42856:
-------------------------------------------
    Description: 
We are experiencing inconsistent output from the label propagation algorithm in 
Graph X. When we run the algorithm on the same input, we observe different 
outputs each time. This behavior is unexpected since the algorithm is not 
designed to be random, and we should be getting the same output for the same 
input.

We suspect that the issue lies in the tie-breaking logic used by the 
[vertexProgram|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala#L65]
 when picking labels. Currently, the vertexProgram chooses the label with the 
maximum frequency, and in case of a tie, it selects the label that appears 
first. This logic does not handle tie cases correctly, resulting in different 
outputs for the same input.

{code:scala}
def vertexProgram(vid: VertexId, attr: Long, message: Map[VertexId, Long]): 
VertexId = {  
    if (message.isEmpty) attr else message.maxBy(_._2)._1
}
{code}

To solve this issue, we propose changing the tie-breaking logic to something 
like vertex ID. This change will ensure that the same label is always selected 
in case of a tie, resulting in consistent output from the algorithm for the 
same input.

{code:scala}
def vertexProgram(vid: VertexId, attr: Long, message: Map[VertexId, Long]): 
VertexId = {
    if (message.isEmpty) attr else message.maxBy{ case (key, value) => (value, 
key) }._1
}
{code}

Thanks!

  was:
We are experiencing inconsistent output from the label propagation algorithm in 
Graph X. When we run the algorithm on the same input, we observe different 
outputs each time. This behavior is unexpected since the algorithm is not 
designed to be random, and we should be getting the same output for the same 
input.

We suspect that the issue lies in the tie-breaking logic used by the 
[vertexProgram|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala#L65]
 when picking labels. Currently, the vertexProgram chooses the label with the 
maximum frequency, and in case of a tie, it selects the label that appears 
first. This logic does not handle tie cases correctly, resulting in different 
outputs for the same input.

To solve this issue, we propose changing the tie-breaking logic to something 
like vertex ID. This change will ensure that the same label is always selected 
in case of a tie, resulting in consistent output from the algorithm for the 
same input.

{code:scala}
def vertexProgram(vid: VertexId, attr: Long, message: Map[VertexId, Long]): 
VertexId = {
  if (message.isEmpty) attr else message.maxBy{ case (key, value) => (value, 
key) }._1
}
{code}



> Inconsistent output from label propagation algorithm in Graph X due to 
> tie-breaking logic in vertexProgram
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-42856
>                 URL: https://issues.apache.org/jira/browse/SPARK-42856
>             Project: Spark
>          Issue Type: Bug
>          Components: GraphX
>    Affects Versions: 1.1.0, 3.3.2
>            Reporter: Mohammadreza Haghpanah
>            Priority: Major
>
> We are experiencing inconsistent output from the label propagation algorithm 
> in Graph X. When we run the algorithm on the same input, we observe different 
> outputs each time. This behavior is unexpected since the algorithm is not 
> designed to be random, and we should be getting the same output for the same 
> input.
> We suspect that the issue lies in the tie-breaking logic used by the 
> [vertexProgram|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala#L65]
>  when picking labels. Currently, the vertexProgram chooses the label with the 
> maximum frequency, and in case of a tie, it selects the label that appears 
> first. This logic does not handle tie cases correctly, resulting in different 
> outputs for the same input.
> {code:scala}
> def vertexProgram(vid: VertexId, attr: Long, message: Map[VertexId, Long]): 
> VertexId = {  
>     if (message.isEmpty) attr else message.maxBy(_._2)._1
> }
> {code}
> To solve this issue, we propose changing the tie-breaking logic to something 
> like vertex ID. This change will ensure that the same label is always 
> selected in case of a tie, resulting in consistent output from the algorithm 
> for the same input.
> {code:scala}
> def vertexProgram(vid: VertexId, attr: Long, message: Map[VertexId, Long]): 
> VertexId = {
>     if (message.isEmpty) attr else message.maxBy{ case (key, value) => 
> (value, key) }._1
> }
> {code}
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to