Re: [DISCUSS] Capability Matrix revamp

Etienne Chauchot Thu, 31 Aug 2017 01:36:31 -0700

Hi,

I think Nexmark(https://github.com/apache/beam/tree/master/sdks/java/nexmark) couldhelp in getting quantitative benchmark metrics for all the runners likeTyler suggested.

Another thing, the current matrix might be wrong on custom windowmerging: I think it should be*X *for Spark and Gearpump because of thetickets below (even though I haven't tested it lately, maybe the statushas changed)


https://issues.apache.org/jira/browse/BEAM-2759**

https://issues.apache.org/jira/browse/BEAM-2499

But, as Kenn suggested grouping all windowing stuff in merging andnon-merging windows sections, maybe this detail does not make sense anymore.


Best

Etienne



Le 23/08/2017 à 04:28, Kenneth Knowles a écrit :

Oh, I missed

11. Quantitative properties. This seems like an interesting and important
project all on its own. Since Beam is so generic, we need pretty diverse
measurements for a user to have a hope of extrapolating to their use case.

Kenn

On Tue, Aug 22, 2017 at 7:22 PM, Kenneth Knowles <k...@google.com> wrote:

OK, so adding these good ideas to the list:

8. Plain-English summary that comes before the nitty-gritty.
9. Comment on production readiness from maintainers. Maybe testimonials
are helpful if they can be obtained?
10. Versioning of all of the above

Any more thoughts? I'll summarize in a JIRA in a bit.

Kenn

On Tue, Aug 22, 2017 at 10:45 AM, Griselda Cuevas <g...@google.com.invalid

wrote:
Hi, I'd also like to ask if versioning as proposed in BEAM-166 <
https://issues.apache.org/jira/browse/BEAM-166> is still relevant? If it
is, would this be something we want to add to this proposal?

G

On 21 August 2017 at 08:31, Tyler Akidau <taki...@google.com.invalid>
wrote:

Is there any way we could add quantitative runner metrics to this as

well?

Like by having some benchmarks that process X amount of data, and then
detailing in the matrix latency, throughput, and (where possible) cost,
etc, numbers for each of the given runners? Semantic support is one

thing,

but there are other differences between runners that aren't captured by
just checking feature boxes. I'd be curious if anyone has other ideas in
this vein as well. The benchmark idea might not be the best way to go

about

it.

-Tyler

On Sun, Aug 20, 2017 at 9:43 AM Jesse Anderson <

je...@bigdatainstitute.io>

wrote:

It'd be awesome to see these updated. I'd add two more:

    1. A plain English summary of the runner's support in Beam. People

who

    are new to Beam won't understand the in-depth coverage and need a
general
    idea of how it is supported.
    2. The production readiness of the runner. Does the maintainer

think

    this runner is production ready?



On Sun, Aug 20, 2017 at 8:03 AM Kenneth Knowles

<k...@google.com.invalid>

wrote:

Hi all,

I want to revamp
https://beam.apache.org/documentation/runners/capability-matrix/

When Beam first started, we didn't work on feature branches for the

core

runners, and they had a lot more gaps compared to what goes on

`master`

today, so this tracked our progress in a way that was easy for

users to

read. Now it is still our best/only comparison page for users, but I

think

we could improve its usefulness.

For the benefit of the thread, let me inline all the capabilities

fully

here:

========================

"What is being computed?"
  - ParDo
  - GroupByKey
  - Flatten
  - Combine
  - Composite Transforms
  - Side Inputs
  - Source API
  - Splittable DoFn
  - Metrics
  - Stateful Processing

"Where in event time?"
  - Global windows
  - Fixed windows
  - Sliding windows
  - Session windows
  - Custom windows
  - Custom merging windows
  - Timestamp control

"When in processing time?"
  - Configurable triggering
  - Event-time triggers
  - Processing-time triggers
  - Count triggers
  - [Meta]data driven triggers
  - Composite triggers
  - Allowed lateness
  - Timers

"How do refinements relate?"
  - Discarding
  - Accumulating
  - Accumulating & Retracting

========================

Here are some issues I'd like to improve:

  - Rows that are impossible to not support (ParDo)
  - Rows where "support" doesn't really make sense (Composite

transforms)

  - Rows are actually the same model feature (non-merging windowfns)
  - Rows that represent optimizations (Combine)
  - Rows in the wrong place (Timers)
  - Rows have not been designed ([Meta]Data driven triggers)
  - Rows with names that appear no where else (Timestamp control)
  - No place to compare non-model differences between runners

I'm still pondering how to improve this, but I thought I'd send the

notion

out for discussion. Some imperfect ideas I've had:

1. Lump all the basic stuff (ParDo, GroupByKey, Read, Window) into

one

row

2. Make sections as users see them, like "ParDo" / "side Inputs" not
"What?" / "side inputs"
3. Add rows for non-model things, like portability framework

support,

metrics backends, etc
4. Drop rows that are not informative, like Composite transforms, or

not

designed
5. Reorganize the windowing section to be just support for merging /
non-merging windowing.
6. Switch to a more distinct color scheme than the solid vs faded

colors

currently used.
7. Find a web design to get short descriptions into the foreground

to

make

it easier to grok.

These are just a few thoughts, and not necessarily compatible with

each

other. What do you think?

Kenn

--
Thanks,

Jesse

Re: [DISCUSS] Capability Matrix revamp

Reply via email to