[ 
https://issues.apache.org/jira/browse/FLINK-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253784#comment-16253784
 ] 

Greg Hogan commented on FLINK-8041:
-----------------------------------

Currently the vertex value can be any implementation of {{Comparable}}, not 
just {{Long}}. I can see the benefit to assigning vertices a label as with 
{{DataSetUtils#zipWithUniqueId}}. There looks to be a cost or limit on 
scalability with the second proposal, but it would be great to start a 
discussion on a PR.

> Recommended Improvements for Gelly's Connected Components Algorithm
> -------------------------------------------------------------------
>
>                 Key: FLINK-8041
>                 URL: https://issues.apache.org/jira/browse/FLINK-8041
>             Project: Flink
>          Issue Type: Improvement
>          Components: Gelly
>    Affects Versions: 1.3.2
>         Environment: Linux, IntelliJ IDEA
>            Reporter: Christos Hadjinikolis
>            Priority: Minor
>             Fix For: 1.4.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> At the moment, the ConnectedComponents algorithm that comes with Flink's 
> native Graph API (Gelly) has two issues:
> 1. It relies on the user to provide correct values for in the vertices 
> DataSet. Based on how the algorithm works, these values must be of type Long 
> and be unique for every vertex. If the user provides the same values for 
> every vertex (e.g. 1) the algorithm still works but as those values are used 
> for the identification of the different connected components, one will end up 
> with a single connected component and will have no clue as to why this 
> happened. This can be easily fixed in two ways: either by checking that the 
> values that appear alongside vertex-ids are unique and informing the user if 
> not, or by generating those values for every vertex before the algorithm is 
> ran. I have a running implementation of the second way and I really think it 
> is an appropriate solution to this problem.
> 2. Once the connected components are identified, one has to apply additional 
> transformations and actions to find out which is the biggest or the order of 
> the connected components in terms of their size. Alternatively, the algorithm 
> can be implemented so that numerical ids that are given to each component 
> reflect their ranking when ordered based on size, e.g. connected component 1 
> will be the biggest, connected component 2 should be the second biggest and 
> so on.  I have also solved this and I think it would make a nice addition. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to