[ https://issues.apache.org/jira/browse/FLINK-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Konstantin Knauf updated FLINK-8041: ------------------------------------ Labels: (was: auto-closed) Priority: Not a Priority (was: Minor) > Recommended Improvements for Gelly's Connected Components Algorithm > ------------------------------------------------------------------- > > Key: FLINK-8041 > URL: https://issues.apache.org/jira/browse/FLINK-8041 > Project: Flink > Issue Type: Improvement > Components: Library / Graph Processing (Gelly) > Affects Versions: 1.3.2 > Environment: Linux, IntelliJ IDEA > Reporter: Christos Hadjinikolis > Priority: Not a Priority > Original Estimate: 336h > Remaining Estimate: 336h > > At the moment, the ConnectedComponents algorithm that comes with Flink's > native Graph API (Gelly) has two issues: > 1. It relies on the user to provide correct values for in the vertices > DataSet. Based on how the algorithm works, these values must be of type Long > and be unique for every vertex. If the user provides the same values for > every vertex (e.g. 1) the algorithm still works but as those values are used > for the identification of the different connected components, one will end up > with a single connected component and will have no clue as to why this > happened. This can be easily fixed in two ways: either by checking that the > values that appear alongside vertex-ids are unique and informing the user if > not, or by generating those values for every vertex before the algorithm is > ran. I have a running implementation of the second way and I really think it > is an appropriate solution to this problem. > 2. Once the connected components are identified, one has to apply additional > transformations and actions to find out which is the biggest or the order of > the connected components in terms of their size. Alternatively, the algorithm > can be implemented so that numerical ids that are given to each component > reflect their ranking when ordered based on size, e.g. connected component 1 > will be the biggest, connected component 2 should be the second biggest and > so on. I have also solved this and I think it would make a nice addition. -- This message was sent by Atlassian Jira (v8.3.4#803005)