Ning, thank you for the heads up.

All, this is a proposed work for improving interactive Beam experience. As
mentioned in Ning's email, new concepts are being introduced. And in
addition iBeam as a name is used as a new reference. I hope that bringing
the discussion to the mailing list will give it the additional
visibility and more people could share their feedback.

(cc'ing a few folks that might be interested +Robert Bradshaw
<rober...@google.com> +Valentyn Tymofieiev <valen...@google.com> +Sindy Li
<qiny...@google.com> +Brian Hulette <bhule...@google.com> )

Ahmet


On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <ni...@google.com> wrote:

> To whom may concern,
>
> This is Ning from Google. We are currently making efforts to leverage an
> interactive runner under python beam sdk.
>
> There is already an interactive Beam (iBeam for short) runner with jupyter
> notebook in the repo
> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
> .
> Following the instructions on that page, one can set up an interactive
> environment to develop and execute Beam pipeline interactively.
>
> However, there are many issues with existing iBeam. One issue is that it
> uses a concept of leaf PCollection to cache and materialize intermediate
> PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
> the interactive runner will run into errors.
>
> Our initial effort will be fixing the existing issues. And we also want to
> make iBeam easy to use. Since iBeam uses the same model Beam uses, there
> isn't really any difference for users between creating a pipeline with
> interactive runner and other runners.
> So we want to minimize the interfaces a user needs to learn while giving
> the user some capability to interact with the interactive environment.
>
> See this initial PR <https://github.com/apache/beam/pull/9278>, the
> interactive_beam module will provide mainly 4 interfaces:
>
>    - For advanced users who define pipeline outside __main__, let them
>    tell current interactive environment where they define their pipeline:
>    watch()
>       - This is very useful for tests where pipeline can be defined in
>       test methods.
>       - If the user simply creates pipeline in a Jupyter notebook or a
>       plain Python script, they don't have to know/use this feature at all.
>    - Let users create an interactive pipeline: create_pipeline()
>       - invoking create_pipeline(), the user gets a Pipeline object that
>       works as any other Pipeline object created from apache_beam.Pipeline()
>       - However, the pipeline object p, when invoking p.run(), does some
>       extra interactive magic.
>       - We'll support interactive execution for DirectRunner at this
>       moment.
>    - Let users run the interactive pipeline as a normal pipeline:
>    run_pipeline()
>       - In an interactive environment, a user only needs to add and
>       execute 1 line of code run_pipeline(pipeline) to execute any existing
>       interactive pipeline object as normal pipeline in any selected platform.
>       - We'll probably support Dataflow only. Other implementations can
>       be added though.
>    - Let users introspect any intermediate PCollection they have handler
>    to: visualize()
>       - If a user ever writes pcoll = p | "Some Transform" >>
>       some_transform() ..., they can visualize(pcoll) once the pipeline p is
>       executed.
>       - p can be batch or streaming
>       - The visualization will be some plot graph of data for the given
>       PCollection as if it's materialized. If the PCollection is unbounded, 
> the
>       graph is dynamic.
>
> The PR will implement 1 and 2.
>
> We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
> level JIRA and add blocking JIRAs as development goes.
>
> External Beam users will not worry about any of the underlying
> implementation details.
> Except the 4 interfaces above, they learn and write normal Beam code and
> can execute the pipeline immediately when they are done with prototyping.
>
> Ning.
>

Reply via email to