Yes, there's a pending PR https://github.com/apache/beam/pull/3360
Note that there are some shading issues with the Dataflow streaming runner support. It passes ValidatesRunner tests, but will likely fail in a real job :-| I'm working on resolving this with +Kenn Knowles <k...@google.com> right now. On Wed, Jun 21, 2017 at 2:18 PM peay <p...@protonmail.com> wrote: > Thanks, great news! Are there plans to get ProcessContinuation from the > original proposal into the API? > > -------- Original Message -------- > Subject: Re: Using watermarks with bounded sources > Local Time: June 20, 2017 7:52 PM > UTC Time: June 20, 2017 11:52 PM > From: kirpic...@google.com > > To: peay <p...@protonmail.com>, k...@google.com <k...@google.com> > user@beam.apache.org <user@beam.apache.org> > > Hi! > > The PR just got submitted. You can play with SDF in Dataflow streaming > runner now :) Hope it doesn't get rolled back (fingers crossed)... > > On Mon, Jun 19, 2017 at 6:06 PM Eugene Kirpichov <kirpic...@google.com> > wrote: > >> Hi, >> The PR is ready and I'm just struggling with setup of tests - Dataflow >> ValidatesRunner tests currently don't have a streaming execution. >> I think +Kenn Knowles <k...@google.com> was doing something about that, >> or I might find a workaround. >> >> But basically if you want to experiment - if you patch in the PR, you can >> experiment with SDF in Dataflow in streaming mode. It passes tests against >> the current production Dataflow Service. >> >> >> On Thu, Jun 15, 2017 at 8:54 AM peay <p...@protonmail.com> wrote: >> >>> Eugene, would you have an ETA on when splittable DoFn would be available >>> in Dataflow in batch/streaming mode? I see that >>> https://github.com/apache/beam/pull/1898 is still active >>> >>> I've started to experiment with those using the DirectRunner and this is >>> a great API. >>> >>> Thanks! >>> >>> >>> -------- Original Message -------- >>> Subject: Re: Using watermarks with bounded sources >>> >>> Local Time: April 23, 2017 10:18 AM >>> UTC Time: April 23, 2017 2:18 PM >>> From: p...@protonmail.com >>> To: Eugene Kirpichov <kirpic...@google.com> >>> user@beam.apache.org <user@beam.apache.org> >>> >>> Ah, I didn't know about that. This is *really* great -- from a quick >>> look, the API looks both very natural and very powerful. Thanks a lot for >>> getting this into Beam! >>> >>> I see Flink support seems to have been merged already. Any idea on when >>> https://github.com/apache/beam/pull/1898 will get merged? >>> >>> I see updateWatermark in the API but not in the proposal's examples >>> which only uses resume/withFutureOutputWatermark. Any reason >>> why updateWatermark is not called after each output in the examples from >>> the proposal? I guess that would be "too fined-grained" to update it for >>> each individual record of a mini-batch? >>> >>> In my case with existing hourly files, would `outputElement(01:00 file), >>> updateWatermark(01:00), outputElement(02:00), updateWatermark(02:00), ...` >>> be the proper way to output per-hour elements while gradually moving the >>> watermark forward while going through an existing list? Or would you >>> instead suggest to still use resume (potentially with were small timeouts)? >>> >>> Thanks, >>> >>> -------- Original Message -------- >>> Subject: Re: Using watermarks with bounded sources >>> Local Time: 22 April 2017 3:59 PM >>> UTC Time: 22 April 2017 19:59 >>> From: kirpic...@google.com >>> To: peay <p...@protonmail.com>, user@beam.apache.org < >>> user@beam.apache.org> >>> >>> Hi! This is an excellent question; don't have time to reply in much >>> detail right now, but please take a look at >>> http://s.apache.org/splittable-do-fn - it unifies the concepts of >>> bounded and unbounded sources, and the use case you mentioned is one of the >>> motivating examples. >>> >>> Also, see recent discussions on pipeline termination semantics: >>> technically nothing should prevent an unbounded source from saying it's >>> done "for real" (no new data will appear), just the current UnboundedSource >>> API does not expose such a method. (but Splittable DoFn does) >>> >>> On Sat, Apr 22, 2017 at 11:15 AM peay <p...@protonmail.com> wrote: >>> >>>> Hello, >>>> >>>> A use case I find myself running into frequently is the following: I >>>> have daily or hourly files, and a Beam pipeline with a small to moderate >>>> size windows. (Actually, I've just seen that support for per-window files >>>> support in file based sinks was recently checked in, which is one way to >>>> get there). >>>> >>>> Now, Beam has no clue about the fact that each file corresponds to a >>>> given time interval. My understanding is that when running the pipeline in >>>> batch mode with a bounded source, there is no notion watermark and we have >>>> to load everything because we just don't know. This is pretty wasteful, >>>> especially as you have to keep a lot of data in memory, while you could in >>>> principle operate close to what you'd do in streaming mode: first read the >>>> oldest files, then newest files, moving the watermark forward as you go >>>> through the input list of files. >>>> >>>> I see one way around this. Let's say that I have hourly files and let's >>>> not assume anything about the order of records within the file to keep it >>>> simple: I don't want a very precise record-level watermark, but more a >>>> rough watermark at the granularity of hours. Say we can easily get the >>>> corresponding time interval from the filename. One can make an unbounded >>>> source that essentially acts as a "List of bounded file-based sources". If >>>> there are K splits, split k can read every file that has `index % K == k` >>>> in the time-ordered list of files. `advance` can advance the current file, >>>> and move on to the next one if no records were read. >>>> >>>> However, as far as I understand, this pipeline will never terminate >>>> since this is an unbounded source and having the `advance` method of our >>>> wrapping source return `false` won't make the pipeline terminate. Can >>>> someone confirm if this is correct? If yes, what would be ways to work >>>> around that? There's always the option to throw to make the pipeline fail, >>>> but this is far from ideal. >>>> >>>> Thanks, >>>> >>> >>>