Re: [DISCUSS] Capability Matrix revamp

Aljoscha Krettek Mon, 28 Aug 2017 08:39:16 -0700

I like where this is going!

Regarding benchmarking, I think we could do this if we had common benchmarking 
infrastructure and pipelines that regularly run on different Runners so that we 
have up-to-date data.


I think we can also have a more technical section where we show stats on the 
level of support via the excluded ValidatesRunner tests. This is hard data that 
we have on every Runner and we can annotate it to explain why a certain Runner 
has a given restriction. This is a bit different from what Kenn initially 
suggested but I think we should have both. Plus, this very clearly specifies 
what feature is (somewhat) validated to work in a given Runner.

Regarding PCollectionView support in Flink, I think this actually works and the 
ValidatesRunner tests pass for this. Not sure what is going on in that test 
case yet. For reference, this is the Issue: 
https://issues.apache.org/jira/browse/BEAM-2806 
<https://issues.apache.org/jira/browse/BEAM-2806>

Best,
Aljoscha

> On 23. Aug 2017, at 21:24, Mingmin Xu <mingm...@gmail.com> wrote:
> 
> I would like to have an API compatibility testing. AFAIK there's still gap
> to achieve our goal (one job for any runner), that means developers should
> notice the limitation when writing the job. For example PCollectionView is
> not well supported in FlinkRunner(not quite sure the current status as my
> test job is broken)/SparkRunner streaming.
> 
>> 5. Reorganize the windowing section to be just support for merging /
> non-merging windowing.
> sliding/fix_window/session is more straightforward to me,
> merging/non-merging is more about the backend implementation.
> 
> 
> On Tue, Aug 22, 2017 at 7:28 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
> 
>> Oh, I missed
>> 
>> 11. Quantitative properties. This seems like an interesting and important
>> project all on its own. Since Beam is so generic, we need pretty diverse
>> measurements for a user to have a hope of extrapolating to their use case.
>> 
>> Kenn
>> 
>> On Tue, Aug 22, 2017 at 7:22 PM, Kenneth Knowles <k...@google.com> wrote:
>> 
>>> OK, so adding these good ideas to the list:
>>> 
>>> 8. Plain-English summary that comes before the nitty-gritty.
>>> 9. Comment on production readiness from maintainers. Maybe testimonials
>>> are helpful if they can be obtained?
>>> 10. Versioning of all of the above
>>> 
>>> Any more thoughts? I'll summarize in a JIRA in a bit.
>>> 
>>> Kenn
>>> 
>>> On Tue, Aug 22, 2017 at 10:45 AM, Griselda Cuevas
>> <g...@google.com.invalid
>>>> wrote:
>>> 
>>>> Hi, I'd also like to ask if versioning as proposed in BEAM-166 <
>>>> https://issues.apache.org/jira/browse/BEAM-166> is still relevant? If
>> it
>>>> is, would this be something we want to add to this proposal?
>>>> 
>>>> G
>>>> 
>>>> On 21 August 2017 at 08:31, Tyler Akidau <taki...@google.com.invalid>
>>>> wrote:
>>>> 
>>>>> Is there any way we could add quantitative runner metrics to this as
>>>> well?
>>>>> Like by having some benchmarks that process X amount of data, and then
>>>>> detailing in the matrix latency, throughput, and (where possible)
>> cost,
>>>>> etc, numbers for each of the given runners? Semantic support is one
>>>> thing,
>>>>> but there are other differences between runners that aren't captured
>> by
>>>>> just checking feature boxes. I'd be curious if anyone has other ideas
>> in
>>>>> this vein as well. The benchmark idea might not be the best way to go
>>>> about
>>>>> it.
>>>>> 
>>>>> -Tyler
>>>>> 
>>>>> On Sun, Aug 20, 2017 at 9:43 AM Jesse Anderson <
>>>> je...@bigdatainstitute.io>
>>>>> wrote:
>>>>> 
>>>>>> It'd be awesome to see these updated. I'd add two more:
>>>>>> 
>>>>>>   1. A plain English summary of the runner's support in Beam.
>> People
>>>> who
>>>>>>   are new to Beam won't understand the in-depth coverage and need a
>>>>>> general
>>>>>>   idea of how it is supported.
>>>>>>   2. The production readiness of the runner. Does the maintainer
>>>> think
>>>>>>   this runner is production ready?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Aug 20, 2017 at 8:03 AM Kenneth Knowles
>>>> <k...@google.com.invalid>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I want to revamp
>>>>>>> https://beam.apache.org/documentation/runners/capability-matrix/
>>>>>>> 
>>>>>>> When Beam first started, we didn't work on feature branches for
>> the
>>>>> core
>>>>>>> runners, and they had a lot more gaps compared to what goes on
>>>> `master`
>>>>>>> today, so this tracked our progress in a way that was easy for
>>>> users to
>>>>>>> read. Now it is still our best/only comparison page for users,
>> but I
>>>>>> think
>>>>>>> we could improve its usefulness.
>>>>>>> 
>>>>>>> For the benefit of the thread, let me inline all the capabilities
>>>> fully
>>>>>>> here:
>>>>>>> 
>>>>>>> ========================
>>>>>>> 
>>>>>>> "What is being computed?"
>>>>>>> - ParDo
>>>>>>> - GroupByKey
>>>>>>> - Flatten
>>>>>>> - Combine
>>>>>>> - Composite Transforms
>>>>>>> - Side Inputs
>>>>>>> - Source API
>>>>>>> - Splittable DoFn
>>>>>>> - Metrics
>>>>>>> - Stateful Processing
>>>>>>> 
>>>>>>> "Where in event time?"
>>>>>>> - Global windows
>>>>>>> - Fixed windows
>>>>>>> - Sliding windows
>>>>>>> - Session windows
>>>>>>> - Custom windows
>>>>>>> - Custom merging windows
>>>>>>> - Timestamp control
>>>>>>> 
>>>>>>> "When in processing time?"
>>>>>>> - Configurable triggering
>>>>>>> - Event-time triggers
>>>>>>> - Processing-time triggers
>>>>>>> - Count triggers
>>>>>>> - [Meta]data driven triggers
>>>>>>> - Composite triggers
>>>>>>> - Allowed lateness
>>>>>>> - Timers
>>>>>>> 
>>>>>>> "How do refinements relate?"
>>>>>>> - Discarding
>>>>>>> - Accumulating
>>>>>>> - Accumulating & Retracting
>>>>>>> 
>>>>>>> ========================
>>>>>>> 
>>>>>>> Here are some issues I'd like to improve:
>>>>>>> 
>>>>>>> - Rows that are impossible to not support (ParDo)
>>>>>>> - Rows where "support" doesn't really make sense (Composite
>>>>> transforms)
>>>>>>> - Rows are actually the same model feature (non-merging
>> windowfns)
>>>>>>> - Rows that represent optimizations (Combine)
>>>>>>> - Rows in the wrong place (Timers)
>>>>>>> - Rows have not been designed ([Meta]Data driven triggers)
>>>>>>> - Rows with names that appear no where else (Timestamp control)
>>>>>>> - No place to compare non-model differences between runners
>>>>>>> 
>>>>>>> I'm still pondering how to improve this, but I thought I'd send
>> the
>>>>>> notion
>>>>>>> out for discussion. Some imperfect ideas I've had:
>>>>>>> 
>>>>>>> 1. Lump all the basic stuff (ParDo, GroupByKey, Read, Window) into
>>>> one
>>>>>> row
>>>>>>> 2. Make sections as users see them, like "ParDo" / "side Inputs"
>> not
>>>>>>> "What?" / "side inputs"
>>>>>>> 3. Add rows for non-model things, like portability framework
>>>> support,
>>>>>>> metrics backends, etc
>>>>>>> 4. Drop rows that are not informative, like Composite transforms,
>> or
>>>>> not
>>>>>>> designed
>>>>>>> 5. Reorganize the windowing section to be just support for
>> merging /
>>>>>>> non-merging windowing.
>>>>>>> 6. Switch to a more distinct color scheme than the solid vs faded
>>>>> colors
>>>>>>> currently used.
>>>>>>> 7. Find a web design to get short descriptions into the foreground
>>>> to
>>>>>> make
>>>>>>> it easier to grok.
>>>>>>> 
>>>>>>> These are just a few thoughts, and not necessarily compatible with
>>>> each
>>>>>>> other. What do you think?
>>>>>>> 
>>>>>>> Kenn
>>>>>>> 
>>>>>> --
>>>>>> Thanks,
>>>>>> 
>>>>>> Jesse
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> ----
> Mingmin

Re: [DISCUSS] Capability Matrix revamp

Reply via email to