Q1: The document is shared (https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing). If inside Google, short link (go/ibeam-external). I cannot share internal documents, but you can reach out if you need internal engineering plan.
Q2: Yes, watch() is optimization used for using visualization() and building further on the pipeline. And the user doesn't need to call it if they simply define the pipeline in the notebook. Q3 and Q4: I'm only focusing on direct runner as underlying runner. We'll get rid of many of existing interactive Beam implementation. We can't provide portability for interactivity. Users can run the pipeline with other runners though due to the pipeline portability. Our work is to reduce the new concepts a user needs to know when they want to run interactive Beam. The implementation could be arbitrarily complicated and open sourced though. Currently, the interactive runner looks like as if it's supporting all kinds of underlying runners. We want to rid of it too. On 2019/08/08 00:01:06, Ahmet Altay <[email protected]> wrote: > Ning, thank you for the heads up. > > All, this is a proposed work for improving interactive Beam experience. As > mentioned in Ning's email, new concepts are being introduced. And in > addition iBeam as a name is used as a new reference. I hope that bringing > the discussion to the mailing list will give it the additional > visibility and more people could share their feedback. > > (cc'ing a few folks that might be interested +Robert Bradshaw > <[email protected]> +Valentyn Tymofieiev <[email protected]> +Sindy Li > <[email protected]> +Brian Hulette <[email protected]> ) > > Ahmet > > > On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <[email protected]> wrote: > > > To whom may concern, > > > > This is Ning from Google. We are currently making efforts to leverage an > > interactive runner under python beam sdk. > > > > There is already an interactive Beam (iBeam for short) runner with jupyter > > notebook in the repo > > <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive> > > . > > Following the instructions on that page, one can set up an interactive > > environment to develop and execute Beam pipeline interactively. > > > > However, there are many issues with existing iBeam. One issue is that it > > uses a concept of leaf PCollection to cache and materialize intermediate > > PCollection. If the user wants to reuse/introspect a non-leaf PCollection, > > the interactive runner will run into errors. > > > > Our initial effort will be fixing the existing issues. And we also want to > > make iBeam easy to use. Since iBeam uses the same model Beam uses, there > > isn't really any difference for users between creating a pipeline with > > interactive runner and other runners. > > So we want to minimize the interfaces a user needs to learn while giving > > the user some capability to interact with the interactive environment. > > > > See this initial PR <https://github.com/apache/beam/pull/9278>, the > > interactive_beam module will provide mainly 4 interfaces: > > > > - For advanced users who define pipeline outside __main__, let them > > tell current interactive environment where they define their pipeline: > > watch() > > - This is very useful for tests where pipeline can be defined in > > test methods. > > - If the user simply creates pipeline in a Jupyter notebook or a > > plain Python script, they don't have to know/use this feature at all. > > - Let users create an interactive pipeline: create_pipeline() > > - invoking create_pipeline(), the user gets a Pipeline object that > > works as any other Pipeline object created from apache_beam.Pipeline() > > - However, the pipeline object p, when invoking p.run(), does some > > extra interactive magic. > > - We'll support interactive execution for DirectRunner at this > > moment. > > - Let users run the interactive pipeline as a normal pipeline: > > run_pipeline() > > - In an interactive environment, a user only needs to add and > > execute 1 line of code run_pipeline(pipeline) to execute any existing > > interactive pipeline object as normal pipeline in any selected > > platform. > > - We'll probably support Dataflow only. Other implementations can > > be added though. > > - Let users introspect any intermediate PCollection they have handler > > to: visualize() > > - If a user ever writes pcoll = p | "Some Transform" >> > > some_transform() ..., they can visualize(pcoll) once the pipeline p is > > executed. > > - p can be batch or streaming > > - The visualization will be some plot graph of data for the given > > PCollection as if it's materialized. If the PCollection is unbounded, > > the > > graph is dynamic. > > > > The PR will implement 1 and 2. > > > > We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top > > level JIRA and add blocking JIRAs as development goes. > > > > External Beam users will not worry about any of the underlying > > implementation details. > > Except the 4 interfaces above, they learn and write normal Beam code and > > can execute the pipeline immediately when they are done with prototyping. > > > > Ning. > > >
