Hi Jan, This is a valuable but tricky problem. As you notice, it is an issue with watermarks. If you have a cycle, it may be convergent or not convergent (aka an infinite loop). Since a watermark measures completeness, it should advance only up to some event time where the cycle has converged. If your back edge topic has a watermark that advances - perhaps because you only write future-timestamped data to it - then your pipeline will make progress. Otherwise, there needs to be a separate "nested" watermark (or other notion of convergence) for the cycle, which does not exist in Beam. The most relevant work I know of is Naiad, which supports a nested logical clock for this very purpose.
Kenn On Wed, Jan 2, 2019 at 6:28 AM <[email protected]> wrote: > Hi all, > > In the documentation it says that a pipeline forms a DAG. > For my project however I need to introduce a loop (feed back some results > to an earlier phase of the pipeline). > For now I have solved this using a separate kafka topic, from which I read > somewhere in the beginning of the pipeline, and only write to . > > Using an extra kafka topic of course breaks the flow of eventtime and > watermarks. Also I'm achieving at-least-once behavior iso exactly-once. > For my application these behaviors are acceptable (although it could > perhaps come in handy if I would keep control over the flow of watermarks, > but it's not absolutely needed). > > Is there a better (more efficient) way to achieve this? > > I was thinking of creating a 'pipe'. > > interface Pipe<T> { > PTransform<PBegin, PCollection<T>> getSource(); > PTransform<PCollection<T>, PDone> getSink(); > } > > Pipe<T> createPipe(String name); > > the source and sink should be internally connected and participate > correctly in the snapshotting flow. (not sure yet about the required > details for this so that it is correct, any help/pointers would be > appreciated!) > > However, before I embark on this effort I would like to hear your thoughts > about it. > > Thanks, kind regards, > Jan > >
