It looks like you want to take a PCollection of lists of items of the same
type (but not necessarily of the same length - in your example you pad them
to the same length but that's unnecessary), induce an undirected graph on
them where there's an edge between XS and YS if they have an element in
common*, and compute connected components on that graph.

*you can make the problem somewhat simpler if for each list you also add
nodes for its individual elements + edges from element to list. Then the
total size of the graph increases only linearly, and the connected
components are the same, but the number of edges is no longer quadratic in
the worst case.

This looks like a really specialized use case (I've never seen a
sufficiently similar problem in my career) so contributing it to the Beam
SDK might not be the best way to go, unless more people chime in that
they'd find it useful.

Unfortunately it is also likely not possible to implement in a scalable way
using Beam primitives, because Beam does not yet support iterative
computations, and computing connected components provably requires at least
O(log N) iterations. It is also easy to prove that your original problem
can not be solved faster: any connected components problem can be reduced
to yours by creating a PCollection with 1 element per edge, where the
element is {source, target}.

If you can elaborate why you think you need this algorithm, the community
might help you find a different way to accomplish the original task.

On Fri, Jun 7, 2019 at 12:14 PM Jan Lukavský <[email protected]> wrote:

> Hi,
>
> that sounds interesting, but it seems to be computationally intensive
> and might not be well scalable, if I understand it correctly. It looks
> like it needs a transitive closure, am I right?
>
>   Jan
>
> On 6/7/19 11:17 AM, i.am.moai wrote:
> > Hello everyone, nice to meet you
> >
> > I am Naoki Hyu(日宇尚記). a developer live in Tokyo. I often use scala and
> > python as my favorite language .
> >
> > I have no experience with OSS development, but as I use DataFlow at
> > work, I want to contribute to the development of Beam.
> >
> > In fact, there is a feature I want to develop, and now I have the
> > source code on my local PC.
> >
> > The feature I want to create is an extension of GroupBy to a multiple
> > key, which realizes more complex grouping.
> >
> > https://issues.apache.org/jira/browse/BEAM-7358
> >
> > Everyone, could you give me an opinion on this intent?
> >
>

Reply via email to