To whom may concern,

This is Ning from Google. We are currently making efforts to leverage an
interactive runner under python beam sdk.

There is already an interactive Beam (iBeam for short) runner with jupyter
notebook in the repo
<https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
.
Following the instructions on that page, one can set up an interactive
environment to develop and execute Beam pipeline interactively.

However, there are many issues with existing iBeam. One issue is that it
uses a concept of leaf PCollection to cache and materialize intermediate
PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
the interactive runner will run into errors.

Our initial effort will be fixing the existing issues. And we also want to
make iBeam easy to use. Since iBeam uses the same model Beam uses, there
isn't really any difference for users between creating a pipeline with
interactive runner and other runners.
So we want to minimize the interfaces a user needs to learn while giving
the user some capability to interact with the interactive environment.

See this initial PR <https://github.com/apache/beam/pull/9278>, the
interactive_beam module will provide mainly 4 interfaces:

   - For advanced users who define pipeline outside __main__, let them tell
   current interactive environment where they define their pipeline: watch()
      - This is very useful for tests where pipeline can be defined in test
      methods.
      - If the user simply creates pipeline in a Jupyter notebook or a
      plain Python script, they don't have to know/use this feature at all.
   - Let users create an interactive pipeline: create_pipeline()
      - invoking create_pipeline(), the user gets a Pipeline object that
      works as any other Pipeline object created from apache_beam.Pipeline()
      - However, the pipeline object p, when invoking p.run(), does some
      extra interactive magic.
      - We'll support interactive execution for DirectRunner at this moment.
   - Let users run the interactive pipeline as a normal pipeline:
   run_pipeline()
      - In an interactive environment, a user only needs to add and execute
      1 line of code run_pipeline(pipeline) to execute any existing interactive
      pipeline object as normal pipeline in any selected platform.
      - We'll probably support Dataflow only. Other implementations can be
      added though.
   - Let users introspect any intermediate PCollection they have handler
   to: visualize()
      - If a user ever writes pcoll = p | "Some Transform" >>
      some_transform() ..., they can visualize(pcoll) once the pipeline p is
      executed.
      - p can be batch or streaming
      - The visualization will be some plot graph of data for the given
      PCollection as if it's materialized. If the PCollection is unbounded, the
      graph is dynamic.

The PR will implement 1 and 2.

We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top level
JIRA and add blocking JIRAs as development goes.

External Beam users will not worry about any of the underlying
implementation details.
Except the 4 interfaces above, they learn and write normal Beam code and
can execute the pipeline immediately when they are done with prototyping.

Ning.

Reply via email to