Thanks a lot for these discussions so far! I really like all of the thoughts. If you have some time, please add these thoughts to these public doc: https://docs.google.com/document/d/13r4NvuvFdysqjCTzMHLuUUXjKTIEY3d7oDNIHT6guww/ Everyone should have the write permission. Feel free to add/edit themes as well. Again, thanks a lot! For any folks who will attend Beam Summit 2024, see you all there and let us have more casual chats during the summit!
On Thu, Aug 22, 2024 at 5:07 PM Valentyn Tymofieiev via dev < dev@beam.apache.org> wrote: > > Key to this will be a push to producing/consuming structured data (as > has been mentioned) and also well-structured, > language-agnostic configuration. > > > Unstructured data (aka "everything is bytes with coders") is overrated > and should be an exception not the default. Structured data everywhere, > with specialized bytes columns. > > +1. > > I am seeing a tendency in distributed data processing engines to heavily > recommend and use relational APIs to express data-processing cases on > structured data, for example, > > Flink has introduced the Table API: > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tableapi/ > > Spark has recently evolved their Dataframe API into a language-agnostic > portability layer: > https://spark.apache.org/docs/latest/spark-connect-overview.html > Some less known and more recent data processing also offer a subset of > Dataframe or SQL, and or a Dataframe API that is later translated into SQL. > > In contrast, in Beam, SQL and Dataframe apis are more limited add-ons, > natively available in Java and Python SDKs respectively. It might be a > worthwhile consideration to think whether introducing a first-class > citizen relational API would make sense in Beam 3, and how it would impact > Beam cross-runner portability story. > > On Thu, Aug 22, 2024 at 12:21 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> Echoing many of the comments here, but organizing them under a single >> theme, I would say a good focus for Beam 3.0 could be centering around >> being more "transform-centric." Specifically: >> >> - Make it easy to mix and match transforms across pipelines and >> environments (SDKs). Key to this will be a push to producing/consuming >> structured data (as has been mentioned) and also well-structured, >> language-agnostic configuration. >> - Better encapsulation for transforms. The main culprit here is update >> compatibility, but there may be other issues as well. Let's try to >> actually solve that for both primitives and composites. >> - Somewhat related to the above, I would love to actually solve the >> early/late output issue, and I think retractions and sink triggers are >> powerful paradigms we could develop to actually solve this issue in a >> novel way. >> - Continue to refine the idea of "best practices." This includes the >> points above, as well as things like robust error handling, >> monitoring, etc. >> >> Once we have these in place we are in a position to offer a powerful >> catalogue of easy-to-use, well-focused transforms, both first and >> third party. >> >> Note everything here can be backwards compatible. As a concrete >> milestone for when we "reach" 3.0 I would say that our core set of >> transforms have been updated to all reflect best practices (by >> default?) and we have a way for third parties to also publish such >> transforms. >> >> (One more bullet point, I would love to see us complete the migration >> to 100% portable runners, including local runners, which will help >> with the testing and development story, but will also be key to making >> the above vision work.) >> >> On Thu, Aug 22, 2024 at 8:00 AM Kenneth Knowles <k...@apache.org> wrote: >> > >> > I think this is a good idea. Fun fact - I think the first time we >> talked about "3.0" was 2018. >> > >> > I don't want to break users with 3.0 TBH, despite that being what a >> major version bump suggests. But I also don't want a triple-digit minor >> version. I think 3.0 is worthwhile if we have a new emphasis that is very >> meaningful to users and contributors. >> > >> > >> > A couple things I would say from experience with 2.0: >> > >> > - A lot of new model features are dropped before completion. Can we >> make it easier to evolve? Maybe not, since in a way it is our "instruction >> set". >> > >> > - Transforms that provide straightforward functionality have a big >> impact: RunInference, IOs, etc. I like that these get more discussion now, >> whereas early in the project a lot of focus was on primitives and runners. >> > >> > - Integrations like YAML (and there will be plenty more I'm sure) that >> rely on transforms as true no-code black boxes with non-UDF configuration >> seem like the next step in abstraction and ease of use. >> > >> > - Update compatibility needs, which break through all our >> abstractions, have blocked innovative changes and UX improvements, and had >> a chilling effect on refactoring and the things that make software continue >> to approach Quality. >> > >> > >> > And a few ideas I have about the future of the space, agreeing with XQ >> and Jan >> > >> > - Unstructured data (aka "everything is bytes with coders") is >> overrated and should be an exception not the default. Structured data >> everywhere, with specialized bytes columns. We can make small steps in this >> direction (and we are already). >> > >> > - Triggers are really not a great construct. "Sink triggers" map >> better to use cases but how to implement them is a long adventure. But we >> really can't live without *something* to manage early output / late input, >> and the options in all other systems I am aware of are even worse. >> > >> > And a last thought is that we shouldn't continue to work on last >> decade's problems, if we can avoid it. Maybe there is a core to Beam that >> is imperfect but good enough (unification of batch & streaming; integration >> of many languages; core primitives that apply to any engine capable of >> handling our use cases) and what we want to do is focus on what we can >> build on top of it. I think this is implied by everything in this thread so >> far but I just wanted to say it explicitly. >> > >> > Kenn >> > >> > On Tue, Aug 20, 2024 at 9:03 AM Jan Lukavský <je...@seznam.cz> wrote: >> >> >> >> Formatting and coloring. :) >> >> >> >> ---- >> >> >> >> Hi XQ, >> >> >> >> thanks for starting this discussion! >> >> >> >> I agree we are getting to a point when discussion a major update of >> Apache Beam might be good idea. Because such window of opportunity happens >> only once in (quite many) years, I think we should try to use our current >> experience with the Beam model itself and check if there is any room for >> improvement there. First of all, we have some parts of the model itself >> that are not implemented in Beam 2.0, e.g. retractions. Second, there are >> parts that are known to be error-prone, e.g. triggers. Another topic are >> features that are missing in the current model, e.g. iterations (yes, I >> know, general iterations might not be even possible, but it seems we can >> create a reasonable constraints for them to work for cases that really >> matter), last but not least we might want to re-think how we structure >> transforms, because that has direct impact on how expensive it is to >> implement a new runner (GBK/Combine vs stateful ParDo). >> >> >> >> Having said that, my suggestion would be to take a higher-level look >> first, define which parts of the model are battle-tested enough we trust >> them as a definite part of the 3.0 model, question all the others and then >> iterate over this to come with a new proposition of the model, with focus >> on what you emphasize - use cases, user-friendly APIs and concepts that >> contain as few unexpected behavior as possible. A key part of this should >> be discussion about how we position Beam on the market - simplicity and >> correctness should be the key points, because practice shows people tend to >> misunderstand the streaming concepts (which is absolutely understandable!). >> >> >> >> Best, >> >> >> >> Jan >> >> >> >> On 8/20/24 14:38, Jan Lukavský wrote: >> >> >> >> Hi XQ, >> >> >> >> thanks for starting this discussion! >> >> >> >> I agree we are getting to a point when discussion a major update of >> Apache Beam might be good idea. Because such window of opportunity happens >> only once in (quite many) years, I think we should try to use our current >> experience with the Beam model itself and check if there is any room for >> improvement there. First of all, we have some parts of the model itself >> that are not implemented in Beam 2.0, e.g. retractions. Second, there are >> parts that are known to be error-prone, e.g. triggers. Another topic are >> features that are missing in the current model, e.g. iterations (yes, I >> know, general iterations might not be even possible, but it seems we can >> create a reasonable constraints for them to work for cases that really >> matter), last but not least we might want to re-think how we structure >> transforms, because that has direct impact on how expensive it is to >> implement a new runner (GBK/Combine vs stateful ParDo). >> >> >> >> Having said that, my suggestion would be to take a higher-level look >> first, define which parts of the model are battle-tested enough we trust >> them as a definite part of the 3.0 model, question all the others and then >> iterate over this to come with a new proposition of the model, with focus >> on what you emphasize - use cases, user-friendly APIs and concepts that >> contain as few unexpected behavior as possible. A key part of this should >> be discussion about how we position Beam on the market - simplicity and >> correctness should be the key points, because practice shows people tend to >> misunderstand the streaming concepts (which is absolutely understandable!). >> >> >> >> Best, >> >> >> >> Jan >> >> >> >> On 8/19/24 23:17, XQ Hu via dev wrote: >> >> >> >> Hi Beam Community, >> >> >> >> Lately, I have been thinking about the future of Beam and the >> potential roadmap towards Beam 3.0. After discussing this with my >> colleagues at Google, I would like to open a discussion about the path for >> us to move towards Beam 3.0. As we continue to enhance Beam 2 with new >> features and improvements, it's important to look ahead and consider the >> long-term vision for the project. >> >> >> >> Why Beam 3.0? >> >> >> >> I think there are several compelling reasons to start planning for >> Beam 3.0: >> >> >> >> Opportunity for Major Enhancements: We can introduce significant >> improvements and innovations. >> >> >> >> Mature Beam Primitives: We can re-evaluate and refine the core >> primitives, ensuring their maturity, stability, and ease of use for >> developers. >> >> >> >> Enhanced User Experience: We can introduce new features and APIs that >> significantly improve the developer experience and cater to evolving use >> cases, particularly in the machine learning domain. >> >> >> >> >> >> Potential Vision for Beam 3 >> >> >> >> Best-in-Class for ML: Empower machine learning users with intuitive >> Python interfaces for data processing, model deployment, and evaluation. >> >> >> >> Rich, Portable Transforms: A cross-language library of standardized >> transforms, easily configured and managed via YAML. >> >> >> >> Streamlined Core: Simplified Beam primitives with clear semantics for >> easier development and maintenance. >> >> >> >> Turnkey Solutions: A curated set of powerful transforms for common >> data and ML tasks, including use-case-specific solutions. >> >> >> >> Simplified Streaming: Intuitive interfaces for streaming data with >> robust support for time-sorted input, metrics, and notifications. >> >> >> >> Enhanced Single Runner capabilities: For use cases where a single >> large box which can be kept effectively busy can solve the users needs. >> >> >> >> Key Themes >> >> >> >> User-Centric Design: Enhance the overall developer experience with >> simplified APIs and streamlined workflows. >> >> >> >> Runner Consistency: Ensure identical functionality between local and >> remote runners for seamless development and deployment. >> >> >> >> Ubiquitous Data Schema: Standardize data schemas for improved >> interoperability and robustness. >> >> >> >> Expanded SDK Capabilities: Enrich SDKs with powerful new features like >> splittable DataFrames, stable input guarantees, and time-sorted input >> processing. >> >> >> >> Thriving Transform Ecosystem: Foster a rich ecosystem of portable, >> managed turnkey transforms, available across all SDKs. >> >> >> >> Minimized Operational Overhead: Reduce complexity and maintenance >> burden by splitting Beam into smaller, more focused repositories. >> >> >> >> Next Steps: >> >> >> >> I propose we start by discussing the following: >> >> >> >> High-Level Goals/Vision/Themes: What are the most important goals and >> priorities for Beam 3.0? >> >> >> >> Potential Challenges: What are the biggest challenges we might face >> during the transition to Beam 3.0? >> >> >> >> Timeline: What would be a realistic timeline for planning, developing, >> and releasing Beam 3.0? >> >> >> >> This email thread primarily sparks conversations about the anticipated >> features of Beam 3.0, however, there is currently no official timeline >> commitment. To facilitate the discussions, I created a public doc that we >> can collaborate on. >> >> >> >> I am excited to work with all of you to shape the future of Beam and >> make it an even more powerful and user-friendly data processing framework! >> >> >> >> Meanwhile, I hope to see many of you at Beam Summit 2024 ( >> https://beamsummit.org/), where we can have more in-depth conversations >> about the future of Beam. >> >> >> >> Thanks, >> >> >> >> XQ Hu (GitHub: liferoad) >> >> >> >> Public Doc for gathering feedback: [Public] Beam 3.0: a discussion doc >> (PTAL) >> >