Re: [DISCUSS] Capability Matrix revamp

Stas Levin Wed, 30 Aug 2017 06:41:44 -0700

+1 for having plain English feature descriptions.

Nitpick: the capability matrix uses the "~" symbol, the meaning of which is
not entirely clear from the context. I think a legend would be helpful
given things have gone beyond ✘ and ✓.


-Stas

On Mon, Aug 28, 2017 at 7:23 PM Lukasz Cwik <lc...@google.com.invalid>
wrote:

> I agree with you Aljoscha, a data driven approach of what features work
> based upon test results being summarized and which ones scale based upon
> benchmarks seems like a great way to differentiate runners strengths.
>
> On Mon, Aug 28, 2017 at 8:39 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > I like where this is going!
> >
> > Regarding benchmarking, I think we could do this if we had common
> > benchmarking infrastructure and pipelines that regularly run on different
> > Runners so that we have up-to-date data.
> >
> > I think we can also have a more technical section where we show stats on
> > the level of support via the excluded ValidatesRunner tests. This is hard
> > data that we have on every Runner and we can annotate it to explain why a
> > certain Runner has a given restriction. This is a bit different from what
> > Kenn initially suggested but I think we should have both. Plus, this very
> > clearly specifies what feature is (somewhat) validated to work in a given
> > Runner.
> >
> > Regarding PCollectionView support in Flink, I think this actually works
> > and the ValidatesRunner tests pass for this. Not sure what is going on in
> > that test case yet. For reference, this is the Issue:
> > https://issues.apache.org/jira/browse/BEAM-2806 <
> > https://issues.apache.org/jira/browse/BEAM-2806>
> >
> > Best,
> > Aljoscha
> >
> > > On 23. Aug 2017, at 21:24, Mingmin Xu <mingm...@gmail.com> wrote:
> > >
> > > I would like to have an API compatibility testing. AFAIK there's still
> > gap
> > > to achieve our goal (one job for any runner), that means developers
> > should
> > > notice the limitation when writing the job. For example PCollectionView
> > is
> > > not well supported in FlinkRunner(not quite sure the current status as
> my
> > > test job is broken)/SparkRunner streaming.
> > >
> > >> 5. Reorganize the windowing section to be just support for merging /
> > > non-merging windowing.
> > > sliding/fix_window/session is more straightforward to me,
> > > merging/non-merging is more about the backend implementation.
> > >
> > >
> > > On Tue, Aug 22, 2017 at 7:28 PM, Kenneth Knowles
> <k...@google.com.invalid
> > >
> > > wrote:
> > >
> > >> Oh, I missed
> > >>
> > >> 11. Quantitative properties. This seems like an interesting and
> > important
> > >> project all on its own. Since Beam is so generic, we need pretty
> diverse
> > >> measurements for a user to have a hope of extrapolating to their use
> > case.
> > >>
> > >> Kenn
> > >>
> > >> On Tue, Aug 22, 2017 at 7:22 PM, Kenneth Knowles <k...@google.com>
> > wrote:
> > >>
> > >>> OK, so adding these good ideas to the list:
> > >>>
> > >>> 8. Plain-English summary that comes before the nitty-gritty.
> > >>> 9. Comment on production readiness from maintainers. Maybe
> testimonials
> > >>> are helpful if they can be obtained?
> > >>> 10. Versioning of all of the above
> > >>>
> > >>> Any more thoughts? I'll summarize in a JIRA in a bit.
> > >>>
> > >>> Kenn
> > >>>
> > >>> On Tue, Aug 22, 2017 at 10:45 AM, Griselda Cuevas
> > >> <g...@google.com.invalid
> > >>>> wrote:
> > >>>
> > >>>> Hi, I'd also like to ask if versioning as proposed in BEAM-166 <
> > >>>> https://issues.apache.org/jira/browse/BEAM-166> is still relevant?
> If
> > >> it
> > >>>> is, would this be something we want to add to this proposal?
> > >>>>
> > >>>> G
> > >>>>
> > >>>> On 21 August 2017 at 08:31, Tyler Akidau <taki...@google.com.invalid
> >
> > >>>> wrote:
> > >>>>
> > >>>>> Is there any way we could add quantitative runner metrics to this
> as
> > >>>> well?
> > >>>>> Like by having some benchmarks that process X amount of data, and
> > then
> > >>>>> detailing in the matrix latency, throughput, and (where possible)
> > >> cost,
> > >>>>> etc, numbers for each of the given runners? Semantic support is one
> > >>>> thing,
> > >>>>> but there are other differences between runners that aren't
> captured
> > >> by
> > >>>>> just checking feature boxes. I'd be curious if anyone has other
> ideas
> > >> in
> > >>>>> this vein as well. The benchmark idea might not be the best way to
> go
> > >>>> about
> > >>>>> it.
> > >>>>>
> > >>>>> -Tyler
> > >>>>>
> > >>>>> On Sun, Aug 20, 2017 at 9:43 AM Jesse Anderson <
> > >>>> je...@bigdatainstitute.io>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> It'd be awesome to see these updated. I'd add two more:
> > >>>>>>
> > >>>>>>   1. A plain English summary of the runner's support in Beam.
> > >> People
> > >>>> who
> > >>>>>>   are new to Beam won't understand the in-depth coverage and need
> a
> > >>>>>> general
> > >>>>>>   idea of how it is supported.
> > >>>>>>   2. The production readiness of the runner. Does the maintainer
> > >>>> think
> > >>>>>>   this runner is production ready?
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Sun, Aug 20, 2017 at 8:03 AM Kenneth Knowles
> > >>>> <k...@google.com.invalid>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi all,
> > >>>>>>>
> > >>>>>>> I want to revamp
> > >>>>>>> https://beam.apache.org/documentation/runners/capability-matrix/
> > >>>>>>>
> > >>>>>>> When Beam first started, we didn't work on feature branches for
> > >> the
> > >>>>> core
> > >>>>>>> runners, and they had a lot more gaps compared to what goes on
> > >>>> `master`
> > >>>>>>> today, so this tracked our progress in a way that was easy for
> > >>>> users to
> > >>>>>>> read. Now it is still our best/only comparison page for users,
> > >> but I
> > >>>>>> think
> > >>>>>>> we could improve its usefulness.
> > >>>>>>>
> > >>>>>>> For the benefit of the thread, let me inline all the capabilities
> > >>>> fully
> > >>>>>>> here:
> > >>>>>>>
> > >>>>>>> ========================
> > >>>>>>>
> > >>>>>>> "What is being computed?"
> > >>>>>>> - ParDo
> > >>>>>>> - GroupByKey
> > >>>>>>> - Flatten
> > >>>>>>> - Combine
> > >>>>>>> - Composite Transforms
> > >>>>>>> - Side Inputs
> > >>>>>>> - Source API
> > >>>>>>> - Splittable DoFn
> > >>>>>>> - Metrics
> > >>>>>>> - Stateful Processing
> > >>>>>>>
> > >>>>>>> "Where in event time?"
> > >>>>>>> - Global windows
> > >>>>>>> - Fixed windows
> > >>>>>>> - Sliding windows
> > >>>>>>> - Session windows
> > >>>>>>> - Custom windows
> > >>>>>>> - Custom merging windows
> > >>>>>>> - Timestamp control
> > >>>>>>>
> > >>>>>>> "When in processing time?"
> > >>>>>>> - Configurable triggering
> > >>>>>>> - Event-time triggers
> > >>>>>>> - Processing-time triggers
> > >>>>>>> - Count triggers
> > >>>>>>> - [Meta]data driven triggers
> > >>>>>>> - Composite triggers
> > >>>>>>> - Allowed lateness
> > >>>>>>> - Timers
> > >>>>>>>
> > >>>>>>> "How do refinements relate?"
> > >>>>>>> - Discarding
> > >>>>>>> - Accumulating
> > >>>>>>> - Accumulating & Retracting
> > >>>>>>>
> > >>>>>>> ========================
> > >>>>>>>
> > >>>>>>> Here are some issues I'd like to improve:
> > >>>>>>>
> > >>>>>>> - Rows that are impossible to not support (ParDo)
> > >>>>>>> - Rows where "support" doesn't really make sense (Composite
> > >>>>> transforms)
> > >>>>>>> - Rows are actually the same model feature (non-merging
> > >> windowfns)
> > >>>>>>> - Rows that represent optimizations (Combine)
> > >>>>>>> - Rows in the wrong place (Timers)
> > >>>>>>> - Rows have not been designed ([Meta]Data driven triggers)
> > >>>>>>> - Rows with names that appear no where else (Timestamp control)
> > >>>>>>> - No place to compare non-model differences between runners
> > >>>>>>>
> > >>>>>>> I'm still pondering how to improve this, but I thought I'd send
> > >> the
> > >>>>>> notion
> > >>>>>>> out for discussion. Some imperfect ideas I've had:
> > >>>>>>>
> > >>>>>>> 1. Lump all the basic stuff (ParDo, GroupByKey, Read, Window)
> into
> > >>>> one
> > >>>>>> row
> > >>>>>>> 2. Make sections as users see them, like "ParDo" / "side Inputs"
> > >> not
> > >>>>>>> "What?" / "side inputs"
> > >>>>>>> 3. Add rows for non-model things, like portability framework
> > >>>> support,
> > >>>>>>> metrics backends, etc
> > >>>>>>> 4. Drop rows that are not informative, like Composite transforms,
> > >> or
> > >>>>> not
> > >>>>>>> designed
> > >>>>>>> 5. Reorganize the windowing section to be just support for
> > >> merging /
> > >>>>>>> non-merging windowing.
> > >>>>>>> 6. Switch to a more distinct color scheme than the solid vs faded
> > >>>>> colors
> > >>>>>>> currently used.
> > >>>>>>> 7. Find a web design to get short descriptions into the
> foreground
> > >>>> to
> > >>>>>> make
> > >>>>>>> it easier to grok.
> > >>>>>>>
> > >>>>>>> These are just a few thoughts, and not necessarily compatible
> with
> > >>>> each
> > >>>>>>> other. What do you think?
> > >>>>>>>
> > >>>>>>> Kenn
> > >>>>>>>
> > >>>>>> --
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Jesse
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > >
> > > --
> > > ----
> > > Mingmin
> >
> >
>

Re: [DISCUSS] Capability Matrix revamp

Reply via email to