Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

XQ Hu via dev Thu, 22 Aug 2024 15:50:38 -0700

Thanks a lot for these discussions so far! I really like all of the
thoughts.
If you have some time, please add these thoughts to these public doc:
https://docs.google.com/document/d/13r4NvuvFdysqjCTzMHLuUUXjKTIEY3d7oDNIHT6guww/
Everyone should have the write permission. Feel free to add/edit themes as
well.
Again, thanks a lot!
For any folks who will attend Beam Summit 2024, see you all there and let
us have more casual chats during the summit!


On Thu, Aug 22, 2024 at 5:07 PM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> >  Key to this will be a push to producing/consuming structured data (as
> has been mentioned) and also well-structured,
> language-agnostic configuration.
>
> > Unstructured data (aka "everything is bytes with coders") is overrated
> and should be an exception not the default. Structured data everywhere,
> with specialized bytes columns.
>
> +1.
>
> I am seeing a tendency in distributed data processing engines to heavily
> recommend and use relational APIs to express data-processing cases on
> structured data, for example,
>
> Flink has introduced the Table API:
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tableapi/
>
> Spark has recently evolved their Dataframe API into a language-agnostic
> portability layer:
> https://spark.apache.org/docs/latest/spark-connect-overview.html
> Some less known and more recent data processing also offer a subset of
> Dataframe or SQL, and  or a Dataframe API that is later translated into SQL.
>
> In contrast, in Beam, SQL and Dataframe apis are more limited add-ons,
> natively available in Java and Python SDKs respectively. It might be a
> worthwhile consideration  to think whether introducing a first-class
> citizen relational API would make sense in Beam 3, and how it would impact
> Beam cross-runner portability story.
>
> On Thu, Aug 22, 2024 at 12:21 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> Echoing many of the comments here, but organizing them under a single
>> theme, I would say a good focus for Beam 3.0 could be centering around
>> being more "transform-centric." Specifically:
>>
>> - Make it easy to mix and match transforms across pipelines and
>> environments (SDKs). Key to this will be a push to producing/consuming
>> structured data (as has been mentioned) and also well-structured,
>> language-agnostic configuration.
>> - Better encapsulation for transforms. The main culprit here is update
>> compatibility, but there may be other issues as well. Let's try to
>> actually solve that for both primitives and composites.
>> - Somewhat related to the above, I would love to actually solve the
>> early/late output issue, and I think retractions and sink triggers are
>> powerful paradigms we could develop to actually solve this issue in a
>> novel way.
>> - Continue to refine the idea of "best practices." This includes the
>> points above, as well as things like robust error handling,
>> monitoring, etc.
>>
>> Once we have these in place we are in a position to offer a powerful
>> catalogue of easy-to-use, well-focused transforms, both first and
>> third party.
>>
>> Note everything here can be backwards compatible. As a concrete
>> milestone for when we "reach" 3.0 I would say that our core set of
>> transforms have been updated to all reflect best practices (by
>> default?) and we have a way for third parties to also publish such
>> transforms.
>>
>> (One more bullet point, I would love to see us complete the migration
>> to 100% portable runners, including local runners, which will help
>> with the testing and development story, but will also be key to making
>> the above vision work.)
>>
>> On Thu, Aug 22, 2024 at 8:00 AM Kenneth Knowles <k...@apache.org> wrote:
>> >
>> > I think this is a good idea. Fun fact - I think the first time we
>> talked about "3.0" was 2018.
>> >
>> > I don't want to break users with 3.0 TBH, despite that being what a
>> major version bump suggests. But I also don't want a triple-digit minor
>> version. I think 3.0 is worthwhile if we have a new emphasis that is very
>> meaningful to users and contributors.
>> >
>> >
>> > A couple things I would say from experience with 2.0:
>> >
>> >  - A lot of new model features are dropped before completion. Can we
>> make it easier to evolve? Maybe not, since in a way it is our "instruction
>> set".
>> >
>> >  - Transforms that provide straightforward functionality have a big
>> impact: RunInference, IOs, etc. I like that these get more discussion now,
>> whereas early in the project a lot of focus was on primitives and runners.
>> >
>> >  - Integrations like YAML (and there will be plenty more I'm sure) that
>> rely on transforms as true no-code black boxes with non-UDF configuration
>> seem like the next step in abstraction and ease of use.
>> >
>> >  - Update compatibility needs, which break through all our
>> abstractions, have blocked innovative changes and UX improvements, and had
>> a chilling effect on refactoring and the things that make software continue
>> to approach Quality.
>> >
>> >
>> > And a few ideas I have about the future of the space, agreeing with XQ
>> and Jan
>> >
>> >  - Unstructured data (aka "everything is bytes with coders") is
>> overrated and should be an exception not the default. Structured data
>> everywhere, with specialized bytes columns. We can make small steps in this
>> direction (and we are already).
>> >
>> >  - Triggers are really not a great construct. "Sink triggers" map
>> better to use cases but how to implement them is a long adventure. But we
>> really can't live without *something* to manage early output / late input,
>> and the options in all other systems I am aware of are even worse.
>> >
>> > And a last thought is that we shouldn't continue to work on last
>> decade's problems, if we can avoid it. Maybe there is a core to Beam that
>> is imperfect but good enough (unification of batch & streaming; integration
>> of many languages; core primitives that apply to any engine capable of
>> handling our use cases) and what we want to do is focus on what we can
>> build on top of it. I think this is implied by everything in this thread so
>> far but I just wanted to say it explicitly.
>> >
>> > Kenn
>> >
>> > On Tue, Aug 20, 2024 at 9:03 AM Jan Lukavský <je...@seznam.cz> wrote:
>> >>
>> >> Formatting and coloring. :)
>> >>
>> >> ----
>> >>
>> >> Hi XQ,
>> >>
>> >> thanks for starting this discussion!
>> >>
>> >> I agree we are getting to a point when discussion a major update of
>> Apache Beam might be good idea. Because such window of opportunity happens
>> only once in (quite many) years, I think we should try to use our current
>> experience with the Beam model itself and check if there is any room for
>> improvement there. First of all, we have some parts of the model itself
>> that are not implemented in Beam 2.0, e.g. retractions. Second, there are
>> parts that are known to be error-prone, e.g. triggers. Another topic are
>> features that are missing in the current model, e.g. iterations (yes, I
>> know, general iterations might not be even possible, but it seems we can
>> create a reasonable constraints for them to work for cases that really
>> matter), last but not least we might want to re-think how we structure
>> transforms, because that has direct impact on how expensive it is to
>> implement a new runner (GBK/Combine vs stateful ParDo).
>> >>
>> >> Having said that, my suggestion would be to take a higher-level look
>> first, define which parts of the model are battle-tested enough we trust
>> them as a definite part of the 3.0 model, question all the others and then
>> iterate over this to come with a new proposition of the model, with focus
>> on what you emphasize - use cases, user-friendly APIs and concepts that
>> contain as few unexpected behavior as possible. A key part of this should
>> be discussion about how we position Beam on the market - simplicity and
>> correctness should be the key points, because practice shows people tend to
>> misunderstand the streaming concepts (which is absolutely understandable!).
>> >>
>> >> Best,
>> >>
>> >>  Jan
>> >>
>> >> On 8/20/24 14:38, Jan Lukavský wrote:
>> >>
>> >> Hi XQ,
>> >>
>> >> thanks for starting this discussion!
>> >>
>> >> I agree we are getting to a point when discussion a major update of
>> Apache Beam might be good idea. Because such window of opportunity happens
>> only once in (quite many) years, I think we should try to use our current
>> experience with the Beam model itself and check if there is any room for
>> improvement there. First of all, we have some parts of the model itself
>> that are not implemented in Beam 2.0, e.g. retractions. Second, there are
>> parts that are known to be error-prone, e.g. triggers. Another topic are
>> features that are missing in the current model, e.g. iterations (yes, I
>> know, general iterations might not be even possible, but it seems we can
>> create a reasonable constraints for them to work for cases that really
>> matter), last but not least we might want to re-think how we structure
>> transforms, because that has direct impact on how expensive it is to
>> implement a new runner (GBK/Combine vs stateful ParDo).
>> >>
>> >> Having said that, my suggestion would be to take a higher-level look
>> first, define which parts of the model are battle-tested enough we trust
>> them as a definite part of the 3.0 model, question all the others and then
>> iterate over this to come with a new proposition of the model, with focus
>> on what you emphasize - use cases, user-friendly APIs and concepts that
>> contain as few unexpected behavior as possible. A key part of this should
>> be discussion about how we position Beam on the market - simplicity and
>> correctness should be the key points, because practice shows people tend to
>> misunderstand the streaming concepts (which is absolutely understandable!).
>> >>
>> >> Best,
>> >>
>> >>  Jan
>> >>
>> >> On 8/19/24 23:17, XQ Hu via dev wrote:
>> >>
>> >> Hi Beam Community,
>> >>
>> >> Lately, I have been thinking about the future of Beam and the
>> potential roadmap towards Beam 3.0. After discussing this with my
>> colleagues at Google, I would like to open a discussion about the path for
>> us to move towards Beam 3.0. As we continue to enhance Beam 2 with new
>> features and improvements, it's important to look ahead and consider the
>> long-term vision for the project.
>> >>
>> >> Why Beam 3.0?
>> >>
>> >> I think there are several compelling reasons to start planning for
>> Beam 3.0:
>> >>
>> >> Opportunity for Major Enhancements: We can introduce significant
>> improvements and innovations.
>> >>
>> >> Mature Beam Primitives: We can re-evaluate and refine the core
>> primitives, ensuring their maturity, stability, and ease of use for
>> developers.
>> >>
>> >> Enhanced User Experience: We can introduce new features and APIs that
>> significantly improve the developer experience and cater to evolving use
>> cases, particularly in the machine learning domain.
>> >>
>> >>
>> >> Potential Vision for Beam 3
>> >>
>> >> Best-in-Class for ML: Empower machine learning users with intuitive
>> Python interfaces for data processing, model deployment, and evaluation.
>> >>
>> >> Rich, Portable Transforms: A cross-language library of standardized
>> transforms, easily configured and managed via YAML.
>> >>
>> >> Streamlined Core: Simplified Beam primitives with clear semantics for
>> easier development and maintenance.
>> >>
>> >> Turnkey Solutions: A curated set of powerful transforms for common
>> data and ML tasks, including use-case-specific solutions.
>> >>
>> >> Simplified Streaming: Intuitive interfaces for streaming data with
>> robust support for time-sorted input, metrics, and notifications.
>> >>
>> >> Enhanced Single Runner capabilities: For use cases where a single
>> large box which can be kept effectively busy can solve the users needs.
>> >>
>> >> Key Themes
>> >>
>> >> User-Centric Design: Enhance the overall developer experience with
>> simplified APIs and streamlined workflows.
>> >>
>> >> Runner Consistency: Ensure identical functionality between local and
>> remote runners for seamless development and deployment.
>> >>
>> >> Ubiquitous Data Schema: Standardize data schemas for improved
>> interoperability and robustness.
>> >>
>> >> Expanded SDK Capabilities: Enrich SDKs with powerful new features like
>> splittable DataFrames, stable input guarantees, and time-sorted input
>> processing.
>> >>
>> >> Thriving Transform Ecosystem: Foster a rich ecosystem of portable,
>> managed turnkey transforms, available across all SDKs.
>> >>
>> >> Minimized Operational Overhead: Reduce complexity and maintenance
>> burden by splitting Beam into smaller, more focused repositories.
>> >>
>> >> Next Steps:
>> >>
>> >> I propose we start by discussing the following:
>> >>
>> >> High-Level Goals/Vision/Themes: What are the most important goals and
>> priorities for Beam 3.0?
>> >>
>> >> Potential Challenges: What are the biggest challenges we might face
>> during the transition to Beam 3.0?
>> >>
>> >> Timeline: What would be a realistic timeline for planning, developing,
>> and releasing Beam 3.0?
>> >>
>> >> This email thread primarily sparks conversations about the anticipated
>> features of Beam 3.0, however, there is currently no official timeline
>> commitment. To facilitate the discussions, I created a public doc that we
>> can collaborate on.
>> >>
>> >> I am excited to work with all of you to shape the future of Beam and
>> make it an even more powerful and user-friendly data processing framework!
>> >>
>> >> Meanwhile, I hope to see many of you at Beam Summit 2024 (
>> https://beamsummit.org/), where we can have more in-depth conversations
>> about the future of Beam.
>> >>
>> >> Thanks,
>> >>
>> >> XQ Hu (GitHub: liferoad)
>> >>
>> >> Public Doc for gathering feedback: [Public] Beam 3.0: a discussion doc
>> (PTAL)
>>
>

Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Reply via email to