Re: Can Connected Components run on a streaming dataset using iterate delta?

Arvid Heise Wed, 26 Feb 2020 06:37:33 -0800

Hi Kant,

there has not been high demand yet and it's always a matter of priority for
the scarce manpower.
I'd probably get inspired by gelly and implement it on DataStream in your
stead.


On Sat, Feb 22, 2020 at 11:23 AM kant kodali <kanth...@gmail.com> wrote:

> Hi,
>
> Thanks for that but Looks like it is already available
> https://github.com/vasia/gelly-streaming in streaming but I wonder why
> this is not part of Flink? there are no releases either.
>
> Thanks!
>
> On Tue, Feb 18, 2020 at 9:13 AM Yun Gao <yungao...@aliyun.com> wrote:
>
>>        Hi Kant,
>>
>>               As far as I know, I think the current example connected
>> components implementation based on DataSet API could not be extended to
>> streaming data or incremental batch directly.
>>
>>               From the algorithm's perspective, if the graph only add
>> edge and never remove edge, I think the connected components should be able
>> to be updated incrementally when the graph changes: When some edges are
>> added, a new search should be started from the sources of the added edges
>> to propagate its component ID. This will trigger a new pass of update of
>> the following vertices, and the updates continues until no vertices'
>> component ID get updated. However, if there are also edge removes, I think
>> the incremental computation should not be easily achieved.
>>
>>               To implement the above logic on Flink, I think currently
>> there should be two possible methods:
>>                     1) Use DataSet API and DataSet iteration, maintains
>> the graph structure and the latest computation result in a storage, and
>> whenever there are enough changes to the graph, submits a new DataSet job
>> to recompute the result. The job should load the edges, the latest
>> component id and whether it is the source of the newly added edges for each
>> graph vertex, and then start the above incremental computation logic.
>>                     2) Flink also provide DataStream iteration API[1]
>> that enables iterating on the unbounded data. In this case the graph
>> modification should be modeled as a datastream, and some operators inside
>> the iteration should maintain the graph structure and current component id.
>> whenever there are enough changes, it starts a new pass of computation.
>>
>>     Best,
>>      Yun
>>
>>         [1] Flink DataStream iteration,
>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/datastream_api.html#iterations
>>
>> ------------------------------------------------------------------
>> From:kant kodali <kanth...@gmail.com>
>> Send Time:2020 Feb. 18 (Tue.) 15:11
>> To:user <user@flink.apache.org>
>> Subject:Can Connected Components run on a streaming dataset using iterate
>> delta?
>>
>> Hi All,
>>
>> I am wondering if connected components
>> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/examples.html#connected-components>
>> can run on a streaming data? or say incremental batch?
>>
>> I see that with delta iteration not all vertices need to participate at
>> every iteration which is great but in my case the graph is evolving over
>> time other words new edges are getting added over time. If so, does the
>> algorithm needs to run on the entire graph or can it simply run on the new
>> batch of edges?
>>
>> Finally, What is the best way to compute connected components on Graphs
>> evolving over time? Should I use streaming or batch or any custom
>> incremental approach? Also, the documentation take maxIterations as an
>> input. How can I come up with a good number for max iterations? and once I
>> come up with a good number for max Iterations is the algorithm guaranteed
>> to converge?
>>
>>
>> Thanks,
>> Kant
>>
>>
>>

Re: Can Connected Components run on a streaming dataset using iterate delta?

Reply via email to