Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Jan Lukavský Tue, 20 Aug 2024 05:39:30 -0700

Hi XQ,

thanks for starting this discussion!

I agree we are getting to a point when discussion a major update ofApache Beam might be good idea. Because such window of opportunityhappens only once in (quite many) years, I think we should try to useour current experience with the Beam model itself and check if there isany room for improvement there. First of all, we have some parts of themodel itself that are not implemented in Beam 2.0, e.g. retractions.Second, there are parts that are known to be error-prone, e.g. triggers.Another topic are features that are missing in the current model, e.g.iterations (yes, I know, general iterations might not be even possible,but it seems we can create a reasonable constraints for them to work forcases that really matter), last but not least we might want to re-thinkhow we structure transforms, because that has direct impact on howexpensive it is to implement a new runner (GBK/Combine vs stateful ParDo).

Having said that, my suggestion would be to take a higher-level lookfirst, define which parts of the model are battle-tested enough we trustthem as a definite part of the 3.0 model, question all the others andthen iterate over this to come with a new proposition of the model, withfocus on what you emphasize - use cases, user-friendly APIs and conceptsthat contain as few unexpected behavior as possible. A key part of thisshould be discussion about how we position Beam on the market -simplicity and correctness should be the key points, because practiceshows people tend to misunderstand the streaming concepts (which isabsolutely understandable!).


Best,

 Jan

On 8/19/24 23:17, XQ Hu via dev wrote:

Hi Beam Community,
Lately, I have been thinking about the future of Beam and thepotential roadmap towards Beam 3.0. After discussing this with mycolleagues at Google, I would like to open a discussion about the pathfor us to move towards Beam 3.0. As we continue to enhance Beam 2 withnew features and improvements, it's important to look ahead andconsider the long-term vision for the project.
Why Beam 3.0?
I think there are several compelling reasons to start planning forBeam 3.0:
 *

    Opportunity for Major Enhancements:We can introduce significant
    improvements and innovations.

 *

    Mature Beam Primitives:We can re-evaluate and refine the core
    primitives, ensuring their maturity, stability, and ease of use
    for developers.

 *

    Enhanced User Experience:We can introduce new features and APIs
    that significantly improve the developer experience and cater to
    evolving use cases, particularly in the machine learning domain.


Potential Vision for Beam 3

 *

    Best-in-Class for ML:Empower machine learning users with intuitive
    Python interfaces for data processing, model deployment, and
    evaluation.

 *

    Rich, Portable Transforms:A cross-language library of standardized
    transforms, easily configured and managed via YAML.

 *

    Streamlined Core:Simplified Beam primitives with clear semantics
    for easier development and maintenance.

 *

    Turnkey Solutions:A curated set of powerful transforms for common
    data and ML tasks, including use-case-specific solutions.

 *

    Simplified Streaming:Intuitive interfaces for streaming data with
    robust support for time-sorted input, metrics, and notifications.

 *

    Enhanced Single Runner capabilities: For use cases where a single
    large box which can be kept effectively busy can solve the users
    needs.

Key Themes

 *

    User-Centric Design:Enhance the overall developer experience with
    simplified APIs and streamlined workflows.

 *

    Runner Consistency:Ensure identical functionality between local
    and remote runners for seamless development and deployment.

 *

    Ubiquitous Data Schema:Standardize data schemas for improved
    interoperability and robustness.

 *

    Expanded SDK Capabilities:Enrich SDKs with powerful new features
    like splittable DataFrames, stable input guarantees, and
    time-sorted input processing.

 *

    Thriving Transform Ecosystem:Foster a rich ecosystem of portable,
    managed turnkey transforms, available across all SDKs.

 *

    Minimized Operational Overhead:Reduce complexity and maintenance
    burden by splitting Beam into smaller, more focused repositories.

Next Steps:

I propose we start by discussing the following:

 *

    High-Level Goals/Vision/Themes:What are the most important goals
    and priorities for Beam 3.0?

 *

    Potential Challenges:What are the biggest challenges we might face
    during the transition to Beam 3.0?

 *

    Timeline:What would be a realistic timeline for planning,
    developing, and releasing Beam 3.0?
This email thread primarily sparks conversations about the anticipatedfeatures of Beam 3.0, however, there is currently no official timelinecommitment. To facilitate the discussions, I created a public doc<https://docs.google.com/document/d/13r4NvuvFdysqjCTzMHLuUUXjKTIEY3d7oDNIHT6guww/edit>thatwe can collaborate on.
I am excited to work with all of you to shape the future of Beam andmake it an even more powerful and user-friendly data processing framework!
Meanwhile, I hope to see many of you at Beam Summit 2024(https://beamsummit.org/ <https://beamsummit.org/>), where we can havemore in-depth conversations about the future of Beam.
Thanks,

XQ Hu (GitHub: liferoad <https://github.com/liferoad>)
Public Doc for gathering feedback: [Public] Beam 3.0: a discussion doc<https://docs.google.com/document/d/13r4NvuvFdysqjCTzMHLuUUXjKTIEY3d7oDNIHT6guww/edit>(PTAL)

Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Reply via email to