Christos Hadjinikolis created FLINK-8041:
--------------------------------------------

             Summary: Recommended Improvements for Gelly's Connected Components 
Algorithm
                 Key: FLINK-8041
                 URL: https://issues.apache.org/jira/browse/FLINK-8041
             Project: Flink
          Issue Type: Improvement
          Components: Gelly
    Affects Versions: 1.3.2
         Environment: Linux, IntelliJ IDEA
            Reporter: Christos Hadjinikolis
            Priority: Minor
             Fix For: 1.4.0


At the moment, the ConnectedComponents algorithm that comes with Flink's native 
Graph API (Gelly) has two issues:
1. It relies on the user to provide correct values for in the vertices DataSet. 
Based on how the algorithm works, these values must be of type Long and be 
unique for every vertex. If the user provides the same values for every vertex 
(e.g. 1) the algorithm still works but as those values are used for the 
identification of the different connected components, one will end up with a 
single connected component and will have no clue as to why this happened. This 
can be easily fixed in two ways: either by checking that the values that appear 
alongside vertex-ids are unique and informing the user if not, or by generating 
those values for every vertex before the algorithm is ran. I have a running 
implementation of the second way and I really think it is an appropriate 
solution to this problem.
2. Once the connected components are identified, one has to apply additional 
transformations and actions to find out which is the biggest or the order of 
the connected components in terms of their size. Alternatively, the algorithm 
can be implemented so that numerical ids that are given to each component 
reflect their ranking when ordered based on size, e.g. connected component 1 
will be the biggest, connected component 2 should be the second biggest and so 
on.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to