I'm planning to take up the discussion about Apex runner current state and proposed next steps in a separate thread.
Thanks, Thomas On Tue, Oct 25, 2016 at 10:32 AM, Amit Sela <amitsel...@gmail.com> wrote: > SparkRunner status: > > V1 (Spark 1.6.x - DStream/RDD API): > *Batch*: Full model support for batch, continuous ROS testing setup is in > process now so that CI will validate constantly. > *Streaming*: Supporting UnboundedSource is in review > <https://github.com/apache/incubator-beam/pull/1143>, starting to work on > triggers and accumulation modes now. > > V2 (Spark 2.x - Dataset API): > This is on hold for now as Spark 2.0 - Dataset AP for streaming (AKA > "Structured Streaming") is marked Alpha. > In addition, there are still some basic properties in the Dataset API that > are missing and will be required to properly support Beam: > > - Stateful operators. > - Encoders (Spark's new schema-based coders) optimization support for > classes that are a bit more sophisticated than POJO's (generics, inner > classes, etc.). > > Also waiting to see if Watermarks and purging late/stale data will be > introduced in 2.1 (currently the Dataset grows indefinitely which is not > something acceptable for production applications). > Once this becomes more clear (2.1 release ?) I will get back to working on > this because in general the Dataset API is preferred as it is actually a > real unified API for batch and streaming (and the schema-based > optimizations are also interesting). > > I hope this gives a clear view of the SparkRunner status, feel free to ping > me for more details on the user/dev list or Slack. > > Thanks, > Amit > > On Tue, Oct 25, 2016 at 6:57 PM Aljoscha Krettek <aljos...@apache.org> > wrote: > > > I think we might need to update the capability matrix with some of the > new > > features that have popped up. Immediate things that come to mind are: > > * Timer/State API for user DoFns (coupled with new-style DoFn) (not yet > > completely in master) > > * SplittableDoFn > > > > This would allow tracking the process in each of these for each runner > and > > would not require hunting for that information in email threads. > > > > On Tue, 25 Oct 2016 at 08:12 Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > > > +1. For me it's one of the most important point for the new website. We > > > should give a clear and exhaustive list of what we have, both for > runners > > > and IOs (with supported features). > > > > > > Regards > > > JB > > > > > > > > > > > > On Oct 24, 2016, 21:52, at 21:52, "Ismaël Mejía" <ieme...@gmail.com> > > > wrote: > > > >Hello, > > > > > > > >I am really happy to see new runners been contributed to our community > > > >(e.g. GearPump and Apex recently). We have not discussed a lot about > > > >the > > > >current capabilities of both runners. > > > > > > > >Following the recent discussion about making ongoing work more > explicit > > > >in > > > >the mailing list, I would like to ask the people involved about the > > > >current > > > >status of them, I think it is important to discuss this (apart of > > > >creating > > > >the given JIRAs + updating the capability matrix docs) because more > > > >people > > > >can eventually jump and give a hand on open issues. > > > > > > > >I remember there was a google doc for the capabilities of each > runner, > > > >is > > > >this doc still available (sorry I lost the link). I suppose that once > > > >these > > > >ongoing runners mature we can add this doc also to the website. > > > >https://beam.apache.org/learn/runners/capability-matrix/ > > > > > > > >Regards, > > > >Ismaël > > > > > > > >ps. @Amit, given that the spark 2 (Dataset based) runner has also a > > > >feature > > > >branch, if you consider it worth, can you please share a bit about > that > > > >work too. > > > > > > > >ps2. Can anyone please share the link to the google doc I was talking > > > >about, I can't find it after the recent changes to the website. > > > > > > > > > >