Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Valentyn Tymofieiev via dev Thu, 22 Aug 2024 14:07:54 -0700

>  Key to this will be a push to producing/consuming structured data (as
has been mentioned) and also well-structured,
language-agnostic configuration.


> Unstructured data (aka "everything is bytes with coders") is overrated
and should be an exception not the default. Structured data everywhere,
with specialized bytes columns.

+1.

I am seeing a tendency in distributed data processing engines to heavily
recommend and use relational APIs to express data-processing cases on
structured data, for example,

Flink has introduced the Table API:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tableapi/

Spark has recently evolved their Dataframe API into a language-agnostic
portability layer:
https://spark.apache.org/docs/latest/spark-connect-overview.html
Some less known and more recent data processing also offer a subset of
Dataframe or SQL, and  or a Dataframe API that is later translated into SQL.

In contrast, in Beam, SQL and Dataframe apis are more limited add-ons,
natively available in Java and Python SDKs respectively. It might be a
worthwhile consideration  to think whether introducing a first-class
citizen relational API would make sense in Beam 3, and how it would impact
Beam cross-runner portability story.

On Thu, Aug 22, 2024 at 12:21 PM Robert Bradshaw via dev <
dev@beam.apache.org> wrote:

> Echoing many of the comments here, but organizing them under a single
> theme, I would say a good focus for Beam 3.0 could be centering around
> being more "transform-centric." Specifically:
>
> - Make it easy to mix and match transforms across pipelines and
> environments (SDKs). Key to this will be a push to producing/consuming
> structured data (as has been mentioned) and also well-structured,
> language-agnostic configuration.
> - Better encapsulation for transforms. The main culprit here is update
> compatibility, but there may be other issues as well. Let's try to
> actually solve that for both primitives and composites.
> - Somewhat related to the above, I would love to actually solve the
> early/late output issue, and I think retractions and sink triggers are
> powerful paradigms we could develop to actually solve this issue in a
> novel way.
> - Continue to refine the idea of "best practices." This includes the
> points above, as well as things like robust error handling,
> monitoring, etc.
>
> Once we have these in place we are in a position to offer a powerful
> catalogue of easy-to-use, well-focused transforms, both first and
> third party.
>
> Note everything here can be backwards compatible. As a concrete
> milestone for when we "reach" 3.0 I would say that our core set of
> transforms have been updated to all reflect best practices (by
> default?) and we have a way for third parties to also publish such
> transforms.
>
> (One more bullet point, I would love to see us complete the migration
> to 100% portable runners, including local runners, which will help
> with the testing and development story, but will also be key to making
> the above vision work.)
>
> On Thu, Aug 22, 2024 at 8:00 AM Kenneth Knowles <k...@apache.org> wrote:
> >
> > I think this is a good idea. Fun fact - I think the first time we talked
> about "3.0" was 2018.
> >
> > I don't want to break users with 3.0 TBH, despite that being what a
> major version bump suggests. But I also don't want a triple-digit minor
> version. I think 3.0 is worthwhile if we have a new emphasis that is very
> meaningful to users and contributors.
> >
> >
> > A couple things I would say from experience with 2.0:
> >
> >  - A lot of new model features are dropped before completion. Can we
> make it easier to evolve? Maybe not, since in a way it is our "instruction
> set".
> >
> >  - Transforms that provide straightforward functionality have a big
> impact: RunInference, IOs, etc. I like that these get more discussion now,
> whereas early in the project a lot of focus was on primitives and runners.
> >
> >  - Integrations like YAML (and there will be plenty more I'm sure) that
> rely on transforms as true no-code black boxes with non-UDF configuration
> seem like the next step in abstraction and ease of use.
> >
> >  - Update compatibility needs, which break through all our abstractions,
> have blocked innovative changes and UX improvements, and had a chilling
> effect on refactoring and the things that make software continue to
> approach Quality.
> >
> >
> > And a few ideas I have about the future of the space, agreeing with XQ
> and Jan
> >
> >  - Unstructured data (aka "everything is bytes with coders") is
> overrated and should be an exception not the default. Structured data
> everywhere, with specialized bytes columns. We can make small steps in this
> direction (and we are already).
> >
> >  - Triggers are really not a great construct. "Sink triggers" map better
> to use cases but how to implement them is a long adventure. But we really
> can't live without *something* to manage early output / late input, and the
> options in all other systems I am aware of are even worse.
> >
> > And a last thought is that we shouldn't continue to work on last
> decade's problems, if we can avoid it. Maybe there is a core to Beam that
> is imperfect but good enough (unification of batch & streaming; integration
> of many languages; core primitives that apply to any engine capable of
> handling our use cases) and what we want to do is focus on what we can
> build on top of it. I think this is implied by everything in this thread so
> far but I just wanted to say it explicitly.
> >
> > Kenn
> >
> > On Tue, Aug 20, 2024 at 9:03 AM Jan Lukavský <je...@seznam.cz> wrote:
> >>
> >> Formatting and coloring. :)
> >>
> >> ----
> >>
> >> Hi XQ,
> >>
> >> thanks for starting this discussion!
> >>
> >> I agree we are getting to a point when discussion a major update of
> Apache Beam might be good idea. Because such window of opportunity happens
> only once in (quite many) years, I think we should try to use our current
> experience with the Beam model itself and check if there is any room for
> improvement there. First of all, we have some parts of the model itself
> that are not implemented in Beam 2.0, e.g. retractions. Second, there are
> parts that are known to be error-prone, e.g. triggers. Another topic are
> features that are missing in the current model, e.g. iterations (yes, I
> know, general iterations might not be even possible, but it seems we can
> create a reasonable constraints for them to work for cases that really
> matter), last but not least we might want to re-think how we structure
> transforms, because that has direct impact on how expensive it is to
> implement a new runner (GBK/Combine vs stateful ParDo).
> >>
> >> Having said that, my suggestion would be to take a higher-level look
> first, define which parts of the model are battle-tested enough we trust
> them as a definite part of the 3.0 model, question all the others and then
> iterate over this to come with a new proposition of the model, with focus
> on what you emphasize - use cases, user-friendly APIs and concepts that
> contain as few unexpected behavior as possible. A key part of this should
> be discussion about how we position Beam on the market - simplicity and
> correctness should be the key points, because practice shows people tend to
> misunderstand the streaming concepts (which is absolutely understandable!).
> >>
> >> Best,
> >>
> >>  Jan
> >>
> >> On 8/20/24 14:38, Jan Lukavský wrote:
> >>
> >> Hi XQ,
> >>
> >> thanks for starting this discussion!
> >>
> >> I agree we are getting to a point when discussion a major update of
> Apache Beam might be good idea. Because such window of opportunity happens
> only once in (quite many) years, I think we should try to use our current
> experience with the Beam model itself and check if there is any room for
> improvement there. First of all, we have some parts of the model itself
> that are not implemented in Beam 2.0, e.g. retractions. Second, there are
> parts that are known to be error-prone, e.g. triggers. Another topic are
> features that are missing in the current model, e.g. iterations (yes, I
> know, general iterations might not be even possible, but it seems we can
> create a reasonable constraints for them to work for cases that really
> matter), last but not least we might want to re-think how we structure
> transforms, because that has direct impact on how expensive it is to
> implement a new runner (GBK/Combine vs stateful ParDo).
> >>
> >> Having said that, my suggestion would be to take a higher-level look
> first, define which parts of the model are battle-tested enough we trust
> them as a definite part of the 3.0 model, question all the others and then
> iterate over this to come with a new proposition of the model, with focus
> on what you emphasize - use cases, user-friendly APIs and concepts that
> contain as few unexpected behavior as possible. A key part of this should
> be discussion about how we position Beam on the market - simplicity and
> correctness should be the key points, because practice shows people tend to
> misunderstand the streaming concepts (which is absolutely understandable!).
> >>
> >> Best,
> >>
> >>  Jan
> >>
> >> On 8/19/24 23:17, XQ Hu via dev wrote:
> >>
> >> Hi Beam Community,
> >>
> >> Lately, I have been thinking about the future of Beam and the potential
> roadmap towards Beam 3.0. After discussing this with my colleagues at
> Google, I would like to open a discussion about the path for us to move
> towards Beam 3.0. As we continue to enhance Beam 2 with new features and
> improvements, it's important to look ahead and consider the long-term
> vision for the project.
> >>
> >> Why Beam 3.0?
> >>
> >> I think there are several compelling reasons to start planning for Beam
> 3.0:
> >>
> >> Opportunity for Major Enhancements: We can introduce significant
> improvements and innovations.
> >>
> >> Mature Beam Primitives: We can re-evaluate and refine the core
> primitives, ensuring their maturity, stability, and ease of use for
> developers.
> >>
> >> Enhanced User Experience: We can introduce new features and APIs that
> significantly improve the developer experience and cater to evolving use
> cases, particularly in the machine learning domain.
> >>
> >>
> >> Potential Vision for Beam 3
> >>
> >> Best-in-Class for ML: Empower machine learning users with intuitive
> Python interfaces for data processing, model deployment, and evaluation.
> >>
> >> Rich, Portable Transforms: A cross-language library of standardized
> transforms, easily configured and managed via YAML.
> >>
> >> Streamlined Core: Simplified Beam primitives with clear semantics for
> easier development and maintenance.
> >>
> >> Turnkey Solutions: A curated set of powerful transforms for common data
> and ML tasks, including use-case-specific solutions.
> >>
> >> Simplified Streaming: Intuitive interfaces for streaming data with
> robust support for time-sorted input, metrics, and notifications.
> >>
> >> Enhanced Single Runner capabilities: For use cases where a single large
> box which can be kept effectively busy can solve the users needs.
> >>
> >> Key Themes
> >>
> >> User-Centric Design: Enhance the overall developer experience with
> simplified APIs and streamlined workflows.
> >>
> >> Runner Consistency: Ensure identical functionality between local and
> remote runners for seamless development and deployment.
> >>
> >> Ubiquitous Data Schema: Standardize data schemas for improved
> interoperability and robustness.
> >>
> >> Expanded SDK Capabilities: Enrich SDKs with powerful new features like
> splittable DataFrames, stable input guarantees, and time-sorted input
> processing.
> >>
> >> Thriving Transform Ecosystem: Foster a rich ecosystem of portable,
> managed turnkey transforms, available across all SDKs.
> >>
> >> Minimized Operational Overhead: Reduce complexity and maintenance
> burden by splitting Beam into smaller, more focused repositories.
> >>
> >> Next Steps:
> >>
> >> I propose we start by discussing the following:
> >>
> >> High-Level Goals/Vision/Themes: What are the most important goals and
> priorities for Beam 3.0?
> >>
> >> Potential Challenges: What are the biggest challenges we might face
> during the transition to Beam 3.0?
> >>
> >> Timeline: What would be a realistic timeline for planning, developing,
> and releasing Beam 3.0?
> >>
> >> This email thread primarily sparks conversations about the anticipated
> features of Beam 3.0, however, there is currently no official timeline
> commitment. To facilitate the discussions, I created a public doc that we
> can collaborate on.
> >>
> >> I am excited to work with all of you to shape the future of Beam and
> make it an even more powerful and user-friendly data processing framework!
> >>
> >> Meanwhile, I hope to see many of you at Beam Summit 2024 (
> https://beamsummit.org/), where we can have more in-depth conversations
> about the future of Beam.
> >>
> >> Thanks,
> >>
> >> XQ Hu (GitHub: liferoad)
> >>
> >> Public Doc for gathering feedback: [Public] Beam 3.0: a discussion doc
> (PTAL)
>

Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Reply via email to