Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Robert Bradshaw via dev Thu, 22 Aug 2024 12:22:03 -0700

Echoing many of the comments here, but organizing them under a single
theme, I would say a good focus for Beam 3.0 could be centering around
being more "transform-centric." Specifically:


- Make it easy to mix and match transforms across pipelines and
environments (SDKs). Key to this will be a push to producing/consuming
structured data (as has been mentioned) and also well-structured,
language-agnostic configuration.
- Better encapsulation for transforms. The main culprit here is update
compatibility, but there may be other issues as well. Let's try to
actually solve that for both primitives and composites.
- Somewhat related to the above, I would love to actually solve the
early/late output issue, and I think retractions and sink triggers are
powerful paradigms we could develop to actually solve this issue in a
novel way.
- Continue to refine the idea of "best practices." This includes the
points above, as well as things like robust error handling,
monitoring, etc.

Once we have these in place we are in a position to offer a powerful
catalogue of easy-to-use, well-focused transforms, both first and
third party.

Note everything here can be backwards compatible. As a concrete
milestone for when we "reach" 3.0 I would say that our core set of
transforms have been updated to all reflect best practices (by
default?) and we have a way for third parties to also publish such
transforms.

(One more bullet point, I would love to see us complete the migration
to 100% portable runners, including local runners, which will help
with the testing and development story, but will also be key to making
the above vision work.)

On Thu, Aug 22, 2024 at 8:00 AM Kenneth Knowles <k...@apache.org> wrote:
>
> I think this is a good idea. Fun fact - I think the first time we talked 
> about "3.0" was 2018.
>
> I don't want to break users with 3.0 TBH, despite that being what a major 
> version bump suggests. But I also don't want a triple-digit minor version. I 
> think 3.0 is worthwhile if we have a new emphasis that is very meaningful to 
> users and contributors.
>
>
> A couple things I would say from experience with 2.0:
>
>  - A lot of new model features are dropped before completion. Can we make it 
> easier to evolve? Maybe not, since in a way it is our "instruction set".
>
>  - Transforms that provide straightforward functionality have a big impact: 
> RunInference, IOs, etc. I like that these get more discussion now, whereas 
> early in the project a lot of focus was on primitives and runners.
>
>  - Integrations like YAML (and there will be plenty more I'm sure) that rely 
> on transforms as true no-code black boxes with non-UDF configuration seem 
> like the next step in abstraction and ease of use.
>
>  - Update compatibility needs, which break through all our abstractions, have 
> blocked innovative changes and UX improvements, and had a chilling effect on 
> refactoring and the things that make software continue to approach Quality.
>
>
> And a few ideas I have about the future of the space, agreeing with XQ and Jan
>
>  - Unstructured data (aka "everything is bytes with coders") is overrated and 
> should be an exception not the default. Structured data everywhere, with 
> specialized bytes columns. We can make small steps in this direction (and we 
> are already).
>
>  - Triggers are really not a great construct. "Sink triggers" map better to 
> use cases but how to implement them is a long adventure. But we really can't 
> live without *something* to manage early output / late input, and the options 
> in all other systems I am aware of are even worse.
>
> And a last thought is that we shouldn't continue to work on last decade's 
> problems, if we can avoid it. Maybe there is a core to Beam that is imperfect 
> but good enough (unification of batch & streaming; integration of many 
> languages; core primitives that apply to any engine capable of handling our 
> use cases) and what we want to do is focus on what we can build on top of it. 
> I think this is implied by everything in this thread so far but I just wanted 
> to say it explicitly.
>
> Kenn
>
> On Tue, Aug 20, 2024 at 9:03 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>> Formatting and coloring. :)
>>
>> ----
>>
>> Hi XQ,
>>
>> thanks for starting this discussion!
>>
>> I agree we are getting to a point when discussion a major update of Apache 
>> Beam might be good idea. Because such window of opportunity happens only 
>> once in (quite many) years, I think we should try to use our current 
>> experience with the Beam model itself and check if there is any room for 
>> improvement there. First of all, we have some parts of the model itself that 
>> are not implemented in Beam 2.0, e.g. retractions. Second, there are parts 
>> that are known to be error-prone, e.g. triggers. Another topic are features 
>> that are missing in the current model, e.g. iterations (yes, I know, general 
>> iterations might not be even possible, but it seems we can create a 
>> reasonable constraints for them to work for cases that really matter), last 
>> but not least we might want to re-think how we structure transforms, because 
>> that has direct impact on how expensive it is to implement a new runner 
>> (GBK/Combine vs stateful ParDo).
>>
>> Having said that, my suggestion would be to take a higher-level look first, 
>> define which parts of the model are battle-tested enough we trust them as a 
>> definite part of the 3.0 model, question all the others and then iterate 
>> over this to come with a new proposition of the model, with focus on what 
>> you emphasize - use cases, user-friendly APIs and concepts that contain as 
>> few unexpected behavior as possible. A key part of this should be discussion 
>> about how we position Beam on the market - simplicity and correctness should 
>> be the key points, because practice shows people tend to misunderstand the 
>> streaming concepts (which is absolutely understandable!).
>>
>> Best,
>>
>>  Jan
>>
>> On 8/20/24 14:38, Jan Lukavský wrote:
>>
>> Hi XQ,
>>
>> thanks for starting this discussion!
>>
>> I agree we are getting to a point when discussion a major update of Apache 
>> Beam might be good idea. Because such window of opportunity happens only 
>> once in (quite many) years, I think we should try to use our current 
>> experience with the Beam model itself and check if there is any room for 
>> improvement there. First of all, we have some parts of the model itself that 
>> are not implemented in Beam 2.0, e.g. retractions. Second, there are parts 
>> that are known to be error-prone, e.g. triggers. Another topic are features 
>> that are missing in the current model, e.g. iterations (yes, I know, general 
>> iterations might not be even possible, but it seems we can create a 
>> reasonable constraints for them to work for cases that really matter), last 
>> but not least we might want to re-think how we structure transforms, because 
>> that has direct impact on how expensive it is to implement a new runner 
>> (GBK/Combine vs stateful ParDo).
>>
>> Having said that, my suggestion would be to take a higher-level look first, 
>> define which parts of the model are battle-tested enough we trust them as a 
>> definite part of the 3.0 model, question all the others and then iterate 
>> over this to come with a new proposition of the model, with focus on what 
>> you emphasize - use cases, user-friendly APIs and concepts that contain as 
>> few unexpected behavior as possible. A key part of this should be discussion 
>> about how we position Beam on the market - simplicity and correctness should 
>> be the key points, because practice shows people tend to misunderstand the 
>> streaming concepts (which is absolutely understandable!).
>>
>> Best,
>>
>>  Jan
>>
>> On 8/19/24 23:17, XQ Hu via dev wrote:
>>
>> Hi Beam Community,
>>
>> Lately, I have been thinking about the future of Beam and the potential 
>> roadmap towards Beam 3.0. After discussing this with my colleagues at 
>> Google, I would like to open a discussion about the path for us to move 
>> towards Beam 3.0. As we continue to enhance Beam 2 with new features and 
>> improvements, it's important to look ahead and consider the long-term vision 
>> for the project.
>>
>> Why Beam 3.0?
>>
>> I think there are several compelling reasons to start planning for Beam 3.0:
>>
>> Opportunity for Major Enhancements: We can introduce significant 
>> improvements and innovations.
>>
>> Mature Beam Primitives: We can re-evaluate and refine the core primitives, 
>> ensuring their maturity, stability, and ease of use for developers.
>>
>> Enhanced User Experience: We can introduce new features and APIs that 
>> significantly improve the developer experience and cater to evolving use 
>> cases, particularly in the machine learning domain.
>>
>>
>> Potential Vision for Beam 3
>>
>> Best-in-Class for ML: Empower machine learning users with intuitive Python 
>> interfaces for data processing, model deployment, and evaluation.
>>
>> Rich, Portable Transforms: A cross-language library of standardized 
>> transforms, easily configured and managed via YAML.
>>
>> Streamlined Core: Simplified Beam primitives with clear semantics for easier 
>> development and maintenance.
>>
>> Turnkey Solutions: A curated set of powerful transforms for common data and 
>> ML tasks, including use-case-specific solutions.
>>
>> Simplified Streaming: Intuitive interfaces for streaming data with robust 
>> support for time-sorted input, metrics, and notifications.
>>
>> Enhanced Single Runner capabilities: For use cases where a single large box 
>> which can be kept effectively busy can solve the users needs.
>>
>> Key Themes
>>
>> User-Centric Design: Enhance the overall developer experience with 
>> simplified APIs and streamlined workflows.
>>
>> Runner Consistency: Ensure identical functionality between local and remote 
>> runners for seamless development and deployment.
>>
>> Ubiquitous Data Schema: Standardize data schemas for improved 
>> interoperability and robustness.
>>
>> Expanded SDK Capabilities: Enrich SDKs with powerful new features like 
>> splittable DataFrames, stable input guarantees, and time-sorted input 
>> processing.
>>
>> Thriving Transform Ecosystem: Foster a rich ecosystem of portable, managed 
>> turnkey transforms, available across all SDKs.
>>
>> Minimized Operational Overhead: Reduce complexity and maintenance burden by 
>> splitting Beam into smaller, more focused repositories.
>>
>> Next Steps:
>>
>> I propose we start by discussing the following:
>>
>> High-Level Goals/Vision/Themes: What are the most important goals and 
>> priorities for Beam 3.0?
>>
>> Potential Challenges: What are the biggest challenges we might face during 
>> the transition to Beam 3.0?
>>
>> Timeline: What would be a realistic timeline for planning, developing, and 
>> releasing Beam 3.0?
>>
>> This email thread primarily sparks conversations about the anticipated 
>> features of Beam 3.0, however, there is currently no official timeline 
>> commitment. To facilitate the discussions, I created a public doc that we 
>> can collaborate on.
>>
>> I am excited to work with all of you to shape the future of Beam and make it 
>> an even more powerful and user-friendly data processing framework!
>>
>> Meanwhile, I hope to see many of you at Beam Summit 2024 
>> (https://beamsummit.org/), where we can have more in-depth conversations 
>> about the future of Beam.
>>
>> Thanks,
>>
>> XQ Hu (GitHub: liferoad)
>>
>> Public Doc for gathering feedback: [Public] Beam 3.0: a discussion doc (PTAL)

Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Reply via email to