Ning, thank you for the heads up. All, this is a proposed work for improving interactive Beam experience. As mentioned in Ning's email, new concepts are being introduced. And in addition iBeam as a name is used as a new reference. I hope that bringing the discussion to the mailing list will give it the additional visibility and more people could share their feedback.
(cc'ing a few folks that might be interested +Robert Bradshaw <rober...@google.com> +Valentyn Tymofieiev <valen...@google.com> +Sindy Li <qiny...@google.com> +Brian Hulette <bhule...@google.com> ) Ahmet On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote: > To whom may concern, > > This is Ning from Google. We are currently making efforts to leverage an > interactive runner under python beam sdk. > > There is already an interactive Beam (iBeam for short) runner with jupyter > notebook in the repo > <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive> > . > Following the instructions on that page, one can set up an interactive > environment to develop and execute Beam pipeline interactively. > > However, there are many issues with existing iBeam. One issue is that it > uses a concept of leaf PCollection to cache and materialize intermediate > PCollection. If the user wants to reuse/introspect a non-leaf PCollection, > the interactive runner will run into errors. > > Our initial effort will be fixing the existing issues. And we also want to > make iBeam easy to use. Since iBeam uses the same model Beam uses, there > isn't really any difference for users between creating a pipeline with > interactive runner and other runners. > So we want to minimize the interfaces a user needs to learn while giving > the user some capability to interact with the interactive environment. > > See this initial PR <https://github.com/apache/beam/pull/9278>, the > interactive_beam module will provide mainly 4 interfaces: > > - For advanced users who define pipeline outside __main__, let them > tell current interactive environment where they define their pipeline: > watch() > - This is very useful for tests where pipeline can be defined in > test methods. > - If the user simply creates pipeline in a Jupyter notebook or a > plain Python script, they don't have to know/use this feature at all. > - Let users create an interactive pipeline: create_pipeline() > - invoking create_pipeline(), the user gets a Pipeline object that > works as any other Pipeline object created from apache_beam.Pipeline() > - However, the pipeline object p, when invoking p.run(), does some > extra interactive magic. > - We'll support interactive execution for DirectRunner at this > moment. > - Let users run the interactive pipeline as a normal pipeline: > run_pipeline() > - In an interactive environment, a user only needs to add and > execute 1 line of code run_pipeline(pipeline) to execute any existing > interactive pipeline object as normal pipeline in any selected platform. > - We'll probably support Dataflow only. Other implementations can > be added though. > - Let users introspect any intermediate PCollection they have handler > to: visualize() > - If a user ever writes pcoll = p | "Some Transform" >> > some_transform() ..., they can visualize(pcoll) once the pipeline p is > executed. > - p can be batch or streaming > - The visualization will be some plot graph of data for the given > PCollection as if it's materialized. If the PCollection is unbounded, > the > graph is dynamic. > > The PR will implement 1 and 2. > > We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top > level JIRA and add blocking JIRAs as development goes. > > External Beam users will not worry about any of the underlying > implementation details. > Except the 4 interfaces above, they learn and write normal Beam code and > can execute the pipeline immediately when they are done with prototyping. > > Ning. >