> Key to this will be a push to producing/consuming structured data (as has been mentioned) and also well-structured, language-agnostic configuration.
> Unstructured data (aka "everything is bytes with coders") is overrated and should be an exception not the default. Structured data everywhere, with specialized bytes columns. +1. I am seeing a tendency in distributed data processing engines to heavily recommend and use relational APIs to express data-processing cases on structured data, for example, Flink has introduced the Table API: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tableapi/ Spark has recently evolved their Dataframe API into a language-agnostic portability layer: https://spark.apache.org/docs/latest/spark-connect-overview.html Some less known and more recent data processing also offer a subset of Dataframe or SQL, and or a Dataframe API that is later translated into SQL. In contrast, in Beam, SQL and Dataframe apis are more limited add-ons, natively available in Java and Python SDKs respectively. It might be a worthwhile consideration to think whether introducing a first-class citizen relational API would make sense in Beam 3, and how it would impact Beam cross-runner portability story. On Thu, Aug 22, 2024 at 12:21 PM Robert Bradshaw via dev < dev@beam.apache.org> wrote: > Echoing many of the comments here, but organizing them under a single > theme, I would say a good focus for Beam 3.0 could be centering around > being more "transform-centric." Specifically: > > - Make it easy to mix and match transforms across pipelines and > environments (SDKs). Key to this will be a push to producing/consuming > structured data (as has been mentioned) and also well-structured, > language-agnostic configuration. > - Better encapsulation for transforms. The main culprit here is update > compatibility, but there may be other issues as well. Let's try to > actually solve that for both primitives and composites. > - Somewhat related to the above, I would love to actually solve the > early/late output issue, and I think retractions and sink triggers are > powerful paradigms we could develop to actually solve this issue in a > novel way. > - Continue to refine the idea of "best practices." This includes the > points above, as well as things like robust error handling, > monitoring, etc. > > Once we have these in place we are in a position to offer a powerful > catalogue of easy-to-use, well-focused transforms, both first and > third party. > > Note everything here can be backwards compatible. As a concrete > milestone for when we "reach" 3.0 I would say that our core set of > transforms have been updated to all reflect best practices (by > default?) and we have a way for third parties to also publish such > transforms. > > (One more bullet point, I would love to see us complete the migration > to 100% portable runners, including local runners, which will help > with the testing and development story, but will also be key to making > the above vision work.) > > On Thu, Aug 22, 2024 at 8:00 AM Kenneth Knowles <k...@apache.org> wrote: > > > > I think this is a good idea. Fun fact - I think the first time we talked > about "3.0" was 2018. > > > > I don't want to break users with 3.0 TBH, despite that being what a > major version bump suggests. But I also don't want a triple-digit minor > version. I think 3.0 is worthwhile if we have a new emphasis that is very > meaningful to users and contributors. > > > > > > A couple things I would say from experience with 2.0: > > > > - A lot of new model features are dropped before completion. Can we > make it easier to evolve? Maybe not, since in a way it is our "instruction > set". > > > > - Transforms that provide straightforward functionality have a big > impact: RunInference, IOs, etc. I like that these get more discussion now, > whereas early in the project a lot of focus was on primitives and runners. > > > > - Integrations like YAML (and there will be plenty more I'm sure) that > rely on transforms as true no-code black boxes with non-UDF configuration > seem like the next step in abstraction and ease of use. > > > > - Update compatibility needs, which break through all our abstractions, > have blocked innovative changes and UX improvements, and had a chilling > effect on refactoring and the things that make software continue to > approach Quality. > > > > > > And a few ideas I have about the future of the space, agreeing with XQ > and Jan > > > > - Unstructured data (aka "everything is bytes with coders") is > overrated and should be an exception not the default. Structured data > everywhere, with specialized bytes columns. We can make small steps in this > direction (and we are already). > > > > - Triggers are really not a great construct. "Sink triggers" map better > to use cases but how to implement them is a long adventure. But we really > can't live without *something* to manage early output / late input, and the > options in all other systems I am aware of are even worse. > > > > And a last thought is that we shouldn't continue to work on last > decade's problems, if we can avoid it. Maybe there is a core to Beam that > is imperfect but good enough (unification of batch & streaming; integration > of many languages; core primitives that apply to any engine capable of > handling our use cases) and what we want to do is focus on what we can > build on top of it. I think this is implied by everything in this thread so > far but I just wanted to say it explicitly. > > > > Kenn > > > > On Tue, Aug 20, 2024 at 9:03 AM Jan Lukavský <je...@seznam.cz> wrote: > >> > >> Formatting and coloring. :) > >> > >> ---- > >> > >> Hi XQ, > >> > >> thanks for starting this discussion! > >> > >> I agree we are getting to a point when discussion a major update of > Apache Beam might be good idea. Because such window of opportunity happens > only once in (quite many) years, I think we should try to use our current > experience with the Beam model itself and check if there is any room for > improvement there. First of all, we have some parts of the model itself > that are not implemented in Beam 2.0, e.g. retractions. Second, there are > parts that are known to be error-prone, e.g. triggers. Another topic are > features that are missing in the current model, e.g. iterations (yes, I > know, general iterations might not be even possible, but it seems we can > create a reasonable constraints for them to work for cases that really > matter), last but not least we might want to re-think how we structure > transforms, because that has direct impact on how expensive it is to > implement a new runner (GBK/Combine vs stateful ParDo). > >> > >> Having said that, my suggestion would be to take a higher-level look > first, define which parts of the model are battle-tested enough we trust > them as a definite part of the 3.0 model, question all the others and then > iterate over this to come with a new proposition of the model, with focus > on what you emphasize - use cases, user-friendly APIs and concepts that > contain as few unexpected behavior as possible. A key part of this should > be discussion about how we position Beam on the market - simplicity and > correctness should be the key points, because practice shows people tend to > misunderstand the streaming concepts (which is absolutely understandable!). > >> > >> Best, > >> > >> Jan > >> > >> On 8/20/24 14:38, Jan Lukavský wrote: > >> > >> Hi XQ, > >> > >> thanks for starting this discussion! > >> > >> I agree we are getting to a point when discussion a major update of > Apache Beam might be good idea. Because such window of opportunity happens > only once in (quite many) years, I think we should try to use our current > experience with the Beam model itself and check if there is any room for > improvement there. First of all, we have some parts of the model itself > that are not implemented in Beam 2.0, e.g. retractions. Second, there are > parts that are known to be error-prone, e.g. triggers. Another topic are > features that are missing in the current model, e.g. iterations (yes, I > know, general iterations might not be even possible, but it seems we can > create a reasonable constraints for them to work for cases that really > matter), last but not least we might want to re-think how we structure > transforms, because that has direct impact on how expensive it is to > implement a new runner (GBK/Combine vs stateful ParDo). > >> > >> Having said that, my suggestion would be to take a higher-level look > first, define which parts of the model are battle-tested enough we trust > them as a definite part of the 3.0 model, question all the others and then > iterate over this to come with a new proposition of the model, with focus > on what you emphasize - use cases, user-friendly APIs and concepts that > contain as few unexpected behavior as possible. A key part of this should > be discussion about how we position Beam on the market - simplicity and > correctness should be the key points, because practice shows people tend to > misunderstand the streaming concepts (which is absolutely understandable!). > >> > >> Best, > >> > >> Jan > >> > >> On 8/19/24 23:17, XQ Hu via dev wrote: > >> > >> Hi Beam Community, > >> > >> Lately, I have been thinking about the future of Beam and the potential > roadmap towards Beam 3.0. After discussing this with my colleagues at > Google, I would like to open a discussion about the path for us to move > towards Beam 3.0. As we continue to enhance Beam 2 with new features and > improvements, it's important to look ahead and consider the long-term > vision for the project. > >> > >> Why Beam 3.0? > >> > >> I think there are several compelling reasons to start planning for Beam > 3.0: > >> > >> Opportunity for Major Enhancements: We can introduce significant > improvements and innovations. > >> > >> Mature Beam Primitives: We can re-evaluate and refine the core > primitives, ensuring their maturity, stability, and ease of use for > developers. > >> > >> Enhanced User Experience: We can introduce new features and APIs that > significantly improve the developer experience and cater to evolving use > cases, particularly in the machine learning domain. > >> > >> > >> Potential Vision for Beam 3 > >> > >> Best-in-Class for ML: Empower machine learning users with intuitive > Python interfaces for data processing, model deployment, and evaluation. > >> > >> Rich, Portable Transforms: A cross-language library of standardized > transforms, easily configured and managed via YAML. > >> > >> Streamlined Core: Simplified Beam primitives with clear semantics for > easier development and maintenance. > >> > >> Turnkey Solutions: A curated set of powerful transforms for common data > and ML tasks, including use-case-specific solutions. > >> > >> Simplified Streaming: Intuitive interfaces for streaming data with > robust support for time-sorted input, metrics, and notifications. > >> > >> Enhanced Single Runner capabilities: For use cases where a single large > box which can be kept effectively busy can solve the users needs. > >> > >> Key Themes > >> > >> User-Centric Design: Enhance the overall developer experience with > simplified APIs and streamlined workflows. > >> > >> Runner Consistency: Ensure identical functionality between local and > remote runners for seamless development and deployment. > >> > >> Ubiquitous Data Schema: Standardize data schemas for improved > interoperability and robustness. > >> > >> Expanded SDK Capabilities: Enrich SDKs with powerful new features like > splittable DataFrames, stable input guarantees, and time-sorted input > processing. > >> > >> Thriving Transform Ecosystem: Foster a rich ecosystem of portable, > managed turnkey transforms, available across all SDKs. > >> > >> Minimized Operational Overhead: Reduce complexity and maintenance > burden by splitting Beam into smaller, more focused repositories. > >> > >> Next Steps: > >> > >> I propose we start by discussing the following: > >> > >> High-Level Goals/Vision/Themes: What are the most important goals and > priorities for Beam 3.0? > >> > >> Potential Challenges: What are the biggest challenges we might face > during the transition to Beam 3.0? > >> > >> Timeline: What would be a realistic timeline for planning, developing, > and releasing Beam 3.0? > >> > >> This email thread primarily sparks conversations about the anticipated > features of Beam 3.0, however, there is currently no official timeline > commitment. To facilitate the discussions, I created a public doc that we > can collaborate on. > >> > >> I am excited to work with all of you to shape the future of Beam and > make it an even more powerful and user-friendly data processing framework! > >> > >> Meanwhile, I hope to see many of you at Beam Summit 2024 ( > https://beamsummit.org/), where we can have more in-depth conversations > about the future of Beam. > >> > >> Thanks, > >> > >> XQ Hu (GitHub: liferoad) > >> > >> Public Doc for gathering feedback: [Public] Beam 3.0: a discussion doc > (PTAL) >