[ https://issues.apache.org/jira/browse/SPARK-42856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mohammadreza Haghpanah updated SPARK-42856: ------------------------------------------- Description: We are experiencing inconsistent output from the label propagation algorithm in Graph X. When we run the algorithm on the same input, we observe different outputs each time. This behavior is unexpected since the algorithm is not designed to be random, and we should be getting the same output for the same input. We suspect that the issue lies in the tie-breaking logic used by the [vertexProgram|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala#L65] when picking labels. Currently, the vertexProgram chooses the label with the maximum frequency, and in case of a tie, it selects the label that appears first. This logic does not handle tie cases correctly, resulting in different outputs for the same input. {code:scala} def vertexProgram(vid: VertexId, attr: Long, message: Map[VertexId, Long]): VertexId = { if (message.isEmpty) attr else message.maxBy(_._2)._1 } {code} To solve this issue, we propose changing the tie-breaking logic to something like vertex ID. This change will ensure that the same label is always selected in case of a tie, resulting in consistent output from the algorithm for the same input. {code:scala} def vertexProgram(vid: VertexId, attr: Long, message: Map[VertexId, Long]): VertexId = { if (message.isEmpty) attr else message.maxBy{ case (key, value) => (value, key) }._1 } {code} Thanks! was: We are experiencing inconsistent output from the label propagation algorithm in Graph X. When we run the algorithm on the same input, we observe different outputs each time. This behavior is unexpected since the algorithm is not designed to be random, and we should be getting the same output for the same input. We suspect that the issue lies in the tie-breaking logic used by the [vertexProgram|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala#L65] when picking labels. Currently, the vertexProgram chooses the label with the maximum frequency, and in case of a tie, it selects the label that appears first. This logic does not handle tie cases correctly, resulting in different outputs for the same input. To solve this issue, we propose changing the tie-breaking logic to something like vertex ID. This change will ensure that the same label is always selected in case of a tie, resulting in consistent output from the algorithm for the same input. {code:scala} def vertexProgram(vid: VertexId, attr: Long, message: Map[VertexId, Long]): VertexId = { if (message.isEmpty) attr else message.maxBy{ case (key, value) => (value, key) }._1 } {code} > Inconsistent output from label propagation algorithm in Graph X due to > tie-breaking logic in vertexProgram > ---------------------------------------------------------------------------------------------------------- > > Key: SPARK-42856 > URL: https://issues.apache.org/jira/browse/SPARK-42856 > Project: Spark > Issue Type: Bug > Components: GraphX > Affects Versions: 1.1.0, 3.3.2 > Reporter: Mohammadreza Haghpanah > Priority: Major > > We are experiencing inconsistent output from the label propagation algorithm > in Graph X. When we run the algorithm on the same input, we observe different > outputs each time. This behavior is unexpected since the algorithm is not > designed to be random, and we should be getting the same output for the same > input. > We suspect that the issue lies in the tie-breaking logic used by the > [vertexProgram|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala#L65] > when picking labels. Currently, the vertexProgram chooses the label with the > maximum frequency, and in case of a tie, it selects the label that appears > first. This logic does not handle tie cases correctly, resulting in different > outputs for the same input. > {code:scala} > def vertexProgram(vid: VertexId, attr: Long, message: Map[VertexId, Long]): > VertexId = { > if (message.isEmpty) attr else message.maxBy(_._2)._1 > } > {code} > To solve this issue, we propose changing the tie-breaking logic to something > like vertex ID. This change will ensure that the same label is always > selected in case of a tie, resulting in consistent output from the algorithm > for the same input. > {code:scala} > def vertexProgram(vid: VertexId, attr: Long, message: Map[VertexId, Long]): > VertexId = { > if (message.isEmpty) attr else message.maxBy{ case (key, value) => > (value, key) }._1 > } > {code} > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org