Hi Pablo,
Thanks for reviewing the doc.
I think I can grasp some of the concepts, but it is not yet 100% clear to
> me why it's necessary to define a new abstraction to have interactivity.
> Could you elaborate?
>
It's not clear to me what the "new abstraction" you are mentioning is. But
if it means the new interactive_beam module, yes, I can elaborate.
Currently the interactive_runner module holds everything that makes a
pipeline interactive:
- A cache manager instance that is shared within an interactive session.
And this is how past pipeline execution preserves hidden states for future
pipelines until
- the pipeline object itself is re-evaluated (e.g., the user reruns
or create a new "p = beam.Pipeline(...)")
- a new interactive session is started (e.g., the user restarts the
IPython kernel in a notebook)
- Some pipeline graph renderer that utilizes graphviz to render DOT of
the pipeline
The disadvantage of doing this is
- The interactive_runner module that interactive Beam user needs to
learn and create pipeline with contains all the implementation details of
the runner that a Beam user shouldn't care.
- If the user just wants to create a Beam pipeline with
interactivity, we should simply provide a factory to them. Instead of
throwing an arbitrary new runner with 200+ lines of code and ask the user
to initialize with constructor.
- The runner module contains too many components irrelevant to a runner
such as the cache manager or pipeline graph renderer.
- These components has nothing to do with Beam. They should be
decoupled from the pipeline runner logic into utilities.
- These components need a scope that spans across multiple pipeline
runs. So they are better suited to be components of the
interactive session
rather than a runner attached to a single instance of pipeline.
- E.g., if I run a pipeline object with a different interactive
runner instance, the interactivity should still work.
- The relationship should look like: session --> interactivity +
pipeline --> runner
- But current design heavily relies on: pipeline is session -->
runner --> interactivity.
And we want to add more features
- Visualization of PCollection
- Run pipeline with other runners
If we put them into interactive runner:
- They don't really belong to a Beam runner's functionality
- They don't fit into a runner's scope in interactive environment
- They add complexity to the implementation details of the interactive
runner
- Interactive or normal Beam users don't care about the details but we
are throwing the code to them anyway
What we want to achieve:
- Ease of use.
- Interactive Beam users can immediately learn the features by
reading the interactive_beam module code and pydoc even if they opt-out
reading any user guide or examples which users always do.
- Users only need to learn this one module. The interactive_beam
module hides all implementation details from end users. Users will see
several static methods with tons of documentation but only a few lines of
code.
- We only maintain usability of these exposed features no matter how the
underlying implementation iterates.
What we are doing will reduce the things a user needs to learn when using
Interactive Beam.
I can see what's the need for the watch.
As explained in the document, code and unit tests, if you define a pipeline
in __main__, you don't need to use it. So if a user writes pipeline
directly in notebook cells, they don't need to watch().
However, we need a way for users to tell us where their pipelines are if
they define pipelines in other places.
If I'm a user that defines a function and run it in __main__:
def run_pipeline():
watch(locals()) # A user could put this line anywhere within this
function to apply interactivity even in a local scope.
p = beam.Pipeline()
...
x = p | "TX" >> tx()
...
p.run()
...
run_pipeline()
There is no way for us to make it interactive automatically because the
pipeline is defined in a local scope. (It's even worse if we use original
interactive runner implementation, because it's a local scope down into the
p object)
However, in an interactive environment, users can add code in any place and
execute arbitrary subset of code from any place.
If users add more statement to transform or examine data in above "..." and
only re-execute those parts, watch() is the least amount of code a user
needs to tell Interactive Beam to support their pipeline because all the
hidden states are now stored within global scope of this interactive
session.
It's already not intuitive when you define something in a function and wish
interactivity in the outer scope. Like in interactive Python, you don't
really have access to p or x. But with watch(), we can support interactive
pipeline defined anywhere in user code.
Can you also tell us more about how a user would use visualize? Do they
> pass the kind of plot to have?
There will be a separate design doc for that. I can give you an internal
link to it and an external link will be available once internally reviewed.
But we do want to support the minimum argument usage:
visualize(pcoll)
This will auto-magically plot something for PCollection object pcoll.
Additionally, we can have a set of optional arguments specifying:
1. What columns of data from pcoll a user wants to render
2. What will be the dimensions of data rendered, so multiple columns can
be aggregated into a single dimension
3. Schema and metadata can be supplied to help rendering legends, axes,
titles and etc.
4. An enum of plotting types can be selected
5. By default, columns, dimensions, schema and plotting type (can be a
composite of scatter, line and bar charts) can be deduced automatically
from pcoll element type.
Ning.
On Wed, Aug 14, 2019 at 7:10 PM Pablo Estrada <[email protected]> wrote:
> Hi Ning!
> Thanks for the design doc and the explanations.
>
> I think I can grasp some of the concepts, but it is not yet 100% clear to
> me why it's necessary to define a new abstraction to have interactivity.
> Could you elaborate? Perhaps as a section in the doc? : )
>
> A lot of the motivation for this doc seems related to how we decide which
> PCollections to cache - so as to avoid rerunning parts of a pipeline
> whenever a user decides to visualize specific parts. I think that makes
> sense (and probably helps to have interactivity on streaming).
>
> I agree that it's a little odd that InteractiveRunner receives an
> underlying runner. That certainly suggests that the functionality is
> orthogonal.
>
> So, in short: I think my feedback is similar to others: Can you justify
> further (or reconsider) why pipeline creation and execution need to be
> different?
>
> I can see what's the need for the watch. Can you also tell us more about
> how a user would use visualize? Do they pass the kind of plot to have?
>
> Thanks!
> -P.
>
> On Wed, Aug 14, 2019 at 12:03 PM Ning Kang <[email protected]> wrote:
>
>> Q1:
>> The document is shared (
>> https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing).
>> If inside Google, short link (go/ibeam-external
>> <https://goto.google.com/ibeam-external>). I cannot share internal
>> documents, but you can reach out if you need internal engineering plan.
>>
>> Q2:
>> Yes, watch() is optimization used for using visualization() and building
>> further on the pipeline. And the user doesn't need to call it if they
>> simply define the pipeline in the notebook.
>>
>> Q3 and Q4:
>> I'm only focusing on direct runner as underlying runner. We'll get rid of
>> many of existing interactive Beam implementation. We can't provide
>> portability for interactivity. Users can run the pipeline with other
>> runners though due to the pipeline portability.
>> Our work is to reduce the new concepts a user needs to know when they
>> want to run interactive Beam. The implementation could be arbitrarily
>> complicated and open sourced though. Currently, the interactive runner
>> looks like as if it's supporting all kinds of underlying runners. We want
>> to rid of it too.
>>
>> On 2019/08/08 00:01:06, Ahmet Altay <[email protected]> wrote:
>> > Ning, thank you for the heads up.
>> >
>> > All, this is a proposed work for improving interactive Beam experience.
>> As
>> > mentioned in Ning's email, new concepts are being introduced. And in
>> > addition iBeam as a name is used as a new reference. I hope that
>> bringing
>> > the discussion to the mailing list will give it the additional
>> > visibility and more people could share their feedback.
>> >
>> > (cc'ing a few folks that might be interested +Robert Bradshaw
>> > <[email protected]> +Valentyn Tymofieiev <[email protected]>
>> +Sindy Li
>> > <[email protected]> +Brian Hulette <[email protected]> )
>> >
>> > Ahmet
>> >
>> >
>> > On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <[email protected]> wrote:
>> >
>> > > To whom may concern,
>> > >
>> > > This is Ning from Google. We are currently making efforts to leverage
>> an
>> > > interactive runner under python beam sdk.
>> > >
>> > > There is already an interactive Beam (iBeam for short) runner with
>> jupyter
>> > > notebook in the repo
>> > > <
>> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive
>> >
>> > > .
>> > > Following the instructions on that page, one can set up an interactive
>> > > environment to develop and execute Beam pipeline interactively.
>> > >
>> > > However, there are many issues with existing iBeam. One issue is that
>> it
>> > > uses a concept of leaf PCollection to cache and materialize
>> intermediate
>> > > PCollection. If the user wants to reuse/introspect a non-leaf
>> PCollection,
>> > > the interactive runner will run into errors.
>> > >
>> > > Our initial effort will be fixing the existing issues. And we also
>> want to
>> > > make iBeam easy to use. Since iBeam uses the same model Beam uses,
>> there
>> > > isn't really any difference for users between creating a pipeline with
>> > > interactive runner and other runners.
>> > > So we want to minimize the interfaces a user needs to learn while
>> giving
>> > > the user some capability to interact with the interactive environment.
>> > >
>> > > See this initial PR <https://github.com/apache/beam/pull/9278>, the
>> > > interactive_beam module will provide mainly 4 interfaces:
>> > >
>> > > - For advanced users who define pipeline outside __main__, let them
>> > > tell current interactive environment where they define their
>> pipeline:
>> > > watch()
>> > > - This is very useful for tests where pipeline can be defined in
>> > > test methods.
>> > > - If the user simply creates pipeline in a Jupyter notebook or a
>> > > plain Python script, they don't have to know/use this feature
>> at all.
>> > > - Let users create an interactive pipeline: create_pipeline()
>> > > - invoking create_pipeline(), the user gets a Pipeline object
>> that
>> > > works as any other Pipeline object created from
>> apache_beam.Pipeline()
>> > > - However, the pipeline object p, when invoking p.run(), does
>> some
>> > > extra interactive magic.
>> > > - We'll support interactive execution for DirectRunner at this
>> > > moment.
>> > > - Let users run the interactive pipeline as a normal pipeline:
>> > > run_pipeline()
>> > > - In an interactive environment, a user only needs to add and
>> > > execute 1 line of code run_pipeline(pipeline) to execute any
>> existing
>> > > interactive pipeline object as normal pipeline in any selected
>> platform.
>> > > - We'll probably support Dataflow only. Other implementations
>> can
>> > > be added though.
>> > > - Let users introspect any intermediate PCollection they have
>> handler
>> > > to: visualize()
>> > > - If a user ever writes pcoll = p | "Some Transform" >>
>> > > some_transform() ..., they can visualize(pcoll) once the
>> pipeline p is
>> > > executed.
>> > > - p can be batch or streaming
>> > > - The visualization will be some plot graph of data for the
>> given
>> > > PCollection as if it's materialized. If the PCollection is
>> unbounded, the
>> > > graph is dynamic.
>> > >
>> > > The PR will implement 1 and 2.
>> > >
>> > > We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
>> > > level JIRA and add blocking JIRAs as development goes.
>> > >
>> > > External Beam users will not worry about any of the underlying
>> > > implementation details.
>> > > Except the 4 interfaces above, they learn and write normal Beam code
>> and
>> > > can execute the pipeline immediately when they are done with
>> prototyping.
>> > >
>> > > Ning.
>> > >
>> >
>>
>