Hi,

I think Nexmark (https://github.com/apache/beam/tree/master/sdks/java/nexmark) could help in getting quantitative benchmark metrics for all the runners like Tyler suggested.

Another thing, the current matrix might be wrong on custom window merging: I think it should be*X *for Spark and Gearpump because of the tickets below (even though I haven't tested it lately, maybe the status has changed)

https://issues.apache.org/jira/browse/BEAM-2759**

https://issues.apache.org/jira/browse/BEAM-2499

But, as Kenn suggested grouping all windowing stuff in merging and non-merging windows sections, maybe this detail does not make sense anymore.

Best

Etienne



Le 23/08/2017 à 04:28, Kenneth Knowles a écrit :
Oh, I missed

11. Quantitative properties. This seems like an interesting and important
project all on its own. Since Beam is so generic, we need pretty diverse
measurements for a user to have a hope of extrapolating to their use case.

Kenn

On Tue, Aug 22, 2017 at 7:22 PM, Kenneth Knowles <k...@google.com> wrote:

OK, so adding these good ideas to the list:

8. Plain-English summary that comes before the nitty-gritty.
9. Comment on production readiness from maintainers. Maybe testimonials
are helpful if they can be obtained?
10. Versioning of all of the above

Any more thoughts? I'll summarize in a JIRA in a bit.

Kenn

On Tue, Aug 22, 2017 at 10:45 AM, Griselda Cuevas <g...@google.com.invalid
wrote:
Hi, I'd also like to ask if versioning as proposed in BEAM-166 <
https://issues.apache.org/jira/browse/BEAM-166> is still relevant? If it
is, would this be something we want to add to this proposal?

G

On 21 August 2017 at 08:31, Tyler Akidau <taki...@google.com.invalid>
wrote:

Is there any way we could add quantitative runner metrics to this as
well?
Like by having some benchmarks that process X amount of data, and then
detailing in the matrix latency, throughput, and (where possible) cost,
etc, numbers for each of the given runners? Semantic support is one
thing,
but there are other differences between runners that aren't captured by
just checking feature boxes. I'd be curious if anyone has other ideas in
this vein as well. The benchmark idea might not be the best way to go
about
it.

-Tyler

On Sun, Aug 20, 2017 at 9:43 AM Jesse Anderson <
je...@bigdatainstitute.io>
wrote:

It'd be awesome to see these updated. I'd add two more:

    1. A plain English summary of the runner's support in Beam. People
who
    are new to Beam won't understand the in-depth coverage and need a
general
    idea of how it is supported.
    2. The production readiness of the runner. Does the maintainer
think
    this runner is production ready?



On Sun, Aug 20, 2017 at 8:03 AM Kenneth Knowles
<k...@google.com.invalid>
wrote:

Hi all,

I want to revamp
https://beam.apache.org/documentation/runners/capability-matrix/

When Beam first started, we didn't work on feature branches for the
core
runners, and they had a lot more gaps compared to what goes on
`master`
today, so this tracked our progress in a way that was easy for
users to
read. Now it is still our best/only comparison page for users, but I
think
we could improve its usefulness.

For the benefit of the thread, let me inline all the capabilities
fully
here:

========================

"What is being computed?"
  - ParDo
  - GroupByKey
  - Flatten
  - Combine
  - Composite Transforms
  - Side Inputs
  - Source API
  - Splittable DoFn
  - Metrics
  - Stateful Processing

"Where in event time?"
  - Global windows
  - Fixed windows
  - Sliding windows
  - Session windows
  - Custom windows
  - Custom merging windows
  - Timestamp control

"When in processing time?"
  - Configurable triggering
  - Event-time triggers
  - Processing-time triggers
  - Count triggers
  - [Meta]data driven triggers
  - Composite triggers
  - Allowed lateness
  - Timers

"How do refinements relate?"
  - Discarding
  - Accumulating
  - Accumulating & Retracting

========================

Here are some issues I'd like to improve:

  - Rows that are impossible to not support (ParDo)
  - Rows where "support" doesn't really make sense (Composite
transforms)
  - Rows are actually the same model feature (non-merging windowfns)
  - Rows that represent optimizations (Combine)
  - Rows in the wrong place (Timers)
  - Rows have not been designed ([Meta]Data driven triggers)
  - Rows with names that appear no where else (Timestamp control)
  - No place to compare non-model differences between runners

I'm still pondering how to improve this, but I thought I'd send the
notion
out for discussion. Some imperfect ideas I've had:

1. Lump all the basic stuff (ParDo, GroupByKey, Read, Window) into
one
row
2. Make sections as users see them, like "ParDo" / "side Inputs" not
"What?" / "side inputs"
3. Add rows for non-model things, like portability framework
support,
metrics backends, etc
4. Drop rows that are not informative, like Composite transforms, or
not
designed
5. Reorganize the windowing section to be just support for merging /
non-merging windowing.
6. Switch to a more distinct color scheme than the solid vs faded
colors
currently used.
7. Find a web design to get short descriptions into the foreground
to
make
it easier to grok.

These are just a few thoughts, and not necessarily compatible with
each
other. What do you think?

Kenn

--
Thanks,

Jesse



Reply via email to