Re: Brief of interactive Beam

Robert Bradshaw Mon, 26 Aug 2019 13:53:21 -0700

On Fri, Aug 23, 2019 at 4:25 PM Ning Kang <[email protected]> wrote:

> On Aug 23, 2019, at 3:09 PM, Robert Bradshaw <[email protected]> wrote:
>
> Cool, sounds like we're getting closer to the same page. Some more replies
> below.
>
> On Fri, Aug 23, 2019 at 1:47 PM Ning Kang <[email protected]> wrote:
>
>> Thanks for the feedback, Robert! I think I got your idea.
>> Let me summarize it to see if it’s correct:
>> 1. You want everything about
>>
>> standard Beam concepts
>>
>>  to follow existing pattern: so we can shot down create_pipeline() and
>> keep the InteractiveRunner notion when constructing pipeline, I agree with
>> it. A runner can delegate another runner, also agreed. Let’s keep it that
>> way.
>>
>
> Despite everything I've written, I'm not convinced that exposing this as a
> Runner is the most intuitive way to get interactivity either. Given that
> the "magic" of interactivity is being able to watch PCollections (for
> inspection and further construction), and if no PCollecitons are watched
> execution proceeds as normal, what are your thoughts about making all
> pipelines "interactive" and just doing the magic iff there are PCollections
> to watch? (The opt-in incantation here would be ibeam.watch(globals()) or
> similar.)
>
> FWIW, Flume has something similar (called marking collections as to be
> materialized). It has its pros and cons.
>
> By default __main__ is watched, similar to the watch(globals()). If no
> PCollection variable is being watched, it’s not doing any magic.
> I’m not sure about making all pipelines “interactive” such as by adding an
> “interactive=True/False” option when constructing pipeline.
>


My point was that watch(globals()) (or anything else) would be the explicit
op in to interactive, instead of doing interactive=True or manually
constructing an InteractiveRunner or anything else.


> Since we couldn’t decide which one is more intuitive, I would stick to the
> existing InteractiveRunner constructor that is open sourced.
> And we try to avoid changing any code outside …/runners/interactive/.
>
> Yes, we can stick with what's already there for now to avoid blocking any
implementation work.

> 2. watch() and visualize() can be in the independent interactive beam
>> module since they are
>>
>> concepts that are unique to being interactive
>>
>> 3. I'll add some example for the run_pipeline() in design doc. The short
>> answer is run_pipeline() != p.run(). Thanks for sharing the doc (
>> https://s.apache.org/no-beam-pipeline).
>> As described in the doc, when constructing the pipeline, we still want to
>> bundle a runner and options to the constructed pipeline even in the future.
>> So if the runner is InteractiveRunner, the interactivity instrument
>> (implicitly applied read/write cache PTransform and input/output wiring) is
>> only applied when "run_pipeline()" of the runner implementation is invoked.
>> p.run() will apply the instrument. However, this static function
>> run_pipeline() takes in a new runner and options,
>> invoking “run_pipeline()” implementation of the new runner and wouldn’t
>> have the instrument, thus no interactivity.
>> Because you cannot (don’t want to, as seen in the doc, users cannot
>> access the bundled pipeline/options in the future) change the runner easily
>> without re-executing all the notebook cells, this shorthand function allows
>> a user to run pipeline without interactivity immediately anywhere in a
>> notebook. In the meantime, the pipeline is still bundled with the original
>> Interactive Runner. The users can keep developing further pipelines.
>> The usage of this function is not intuitive until you put it in a
>> notebook user scenario where users develop, test in prod-like env and
>> develop further. And it’s equivalent to users writing
>> "from_runner_api(to_runner_api(pipeline))” in their notebook. It’s just a
>> shorthand.
>>
>
> What you're trying to work around here is the flaw in the existing API
> that a user binds the choice of Runner before pipeline construction, rather
> than at the point of execution. I propose we look at fixing this in Beam
> itself.
>
> Then I would propose not exposing this. If late runner binding is
> supported, we wouldn’t even need this. We can write it in an example
> notebook rather than exposing it.
>

Sounds good.


> 4. And we both agree that implicit cache is palatable and should be the
>> only thing we use to support interactivity. Cache and watched pipeline
>> definition (which tells us what to cache) are the main “hidden state” I
>> meant. Because the cache mechanism is totally implicit and hidden from the
>> user. A cache is either read or written in a p.run(). If an existing cache
>> is not used in a p.run(), it expires. If the user restarts the IPython
>> kernel, all cache should expire too.
>>
>
> Depending on how we label items in the cache, they could survive kernel
> restarts as well. This relates to another useful feature in Beam where if a
> batch pipeline fails towards the end, one may want to resume/rerun it from
> there after fixing the bug without redoing all the work.
>
> I would suggest kernel restart resetting everything. The lifespan of the
> cache or any interactivity shouldn’t exceed the kernel session. And the
> lifespan of a PCollection cache shouldn’t even exceed 2 consecutive
> pipeline runs if the second run doesn’t use or produce it.After all,
> resource of a running notebook is limited. We might even need cache
> eviction or full pipeline re-execution when memory or disk space is not
> enough.
>

There are pros and cons, but generally the user experience will be better
the more that is cached (even across sessions--history preserved across
sessions is one of the big wins of iPython vs. the built in prompt) up to
the point where resource constraints become prohibitive.


>    Existing InteractiveRunner has the following portability
>> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive#portability>.
>> That’s why I said the interactivity (implementation) needs to be tailored
>> for different underlying runners.
>>    If we allow users to pass in all kinds of underlying runners (even
>> their in-house ones), we have to support the interactivity for all of them
>> which we probably don't. That’s why we wanted a create_pipeline() wrapper
>> so that in notebook, when building a pipeline, bundle to DirectRunner by
>> default.
>>    The focus on the Direct Runner is also related to our objective: we
>> want to provide easy-to-use notebook and some notebook environment where
>> users can interactively execute pipelines without worrying about setup
>> (especially when the setup is not Beam but Interactive Beam related).
>> 6. We don’t fix typo for user defined transforms
>>
>> I'm talking about pruning like having a cell with
>>
>>     pcoll | beam.Map(lambda x: expression_with_typo)
>>
>> and then fixing it (and re-evaluating) with
>>
>>     pcoll | beam.Map(lambda x: expression_with_typo_fixed)
>>
>> where the former Map would *always* fail and never get removed from the
>> pipeline.
>>
>> We never change the pipeline defined by the user. Interactivity is
>> applied to a copy of user defined pipeline.
>>
>
> Sure. But does the executed (copy) of the pipeline contain the bad Map
> operation in it? If so, it in essence "poisons" the entire pipeline,
> forcing a user to re-create and re-define it from the start to make forward
> progress (which results in quite a poor user experience--the errors in cell
> N manifest in cell M, but worse fixing and re-executing cell N doesn't fix
> cell M). If not, how is it intelligently excluded (and in a way that is not
> too dis-similar from non-interactive mode, and doesn't cause surprises with
> the p.run vs. run_pipeline difference)?
>
>
> The copy is always one-shot. When p.run(), a copy is instrumented with
> additional PTrasnfroms and some re-wiring. It’s executed by the underlying
> runner. And it’s gone.
> I think I know what you are saying here. It’s similar to re-executing a
> cell with anonymous PTransform. Should the user expect transforms in
> parallel (no pruning) or the original transforms replaced (with pruning)?
> If the bad transform is in the middle of several statements in a cell,
> there is no rollback. The user could re-execute the cell but cannot replace
> the failed/bad PTransfrom or remove succeeded/good PTransforms that are
> duplicated.
> Re-executing such cells after fixing the typo would “append” several
> PTransforms again, but those not needed anymore are still dangling branches
> in the pipeline.
> It might not affect the correctness of the pipeline, but does change the
> shape of the pipeline from the user’s expectation if they expect a
> replacement.
> It could be a great feature. [email protected]. I would propose
>   1. Either not implementing it for now because the behavior is consistent
> to building pipeline in non-interactive mode, and such replacement might
> not always be valid (see next answer).
>   2. Or if we decide that the interactive mode in this situation will have
> different behavior than non-interactive mode, we can implement it, try
> replacing the PTransforms (by appending new and pruning old) if the
> input/output wiring is compatible and print some warning (succeed or fail)
> saying it’s interactive mode.
> I prefer 1, because we are not trying too hard to implicitly change user’s
> pipeline when the user can expect no pruning (who are familiar with Beam
> and IPython) and pruning (who are not familiar with Beam or IPython) while
> we cannot guarantee a pruning is going to happen when the new transform is
> not compatible with the one with typo. A user could re-execute a cell 5
> times, but we cannot determine if the user wants the same transform applied
> 5 times from the same input (might be useless but valid) or the user has
> made some valid change and changed mind for 5 times. I still like the
> behavior “executing a cell” is equivalent to "appending the code for
> execution” in the IPython session. What’s executed is executed. We
> shouldn’t take the metadata of source code change in a executed cell into
> the pipeline construction process.
> But I’m open for 2 with the trying best and notifying users route.
>

I don't see any good ways to resolve the ambiguity with (2), so I think
we're stuck with (1), at least for v1.


>
>
> 7.
>>
>> One approach was that the pipeline construction is re-executed every time
>> (i.e the "pipeline" object to run is really a callback, like a callable or
>> a PTransform) and then there's no ambiguity here.
>>
>> I didn’t quite get it. Pipeline construction only happens when a user
>> executes a cell with the pipeline construction code.
>> Are you suggesting changing the logic in pipeline.apply() to always
>> reapply/replace a NamedPTransform? I don’t think we (Interactive Beam) can
>> decide that because it changes the behavior of Beam.
>> We had some thought of subclassing pipeline and use the create_pipeline()
>> method to create the subclassed pipeline object. Then intercept the
>> pipeline.apply() to always replace PTransform with existing full label and
>> apply logic of parent pipeline’s apply() logic.
>> It seems to be a no go to me now.
>>
>
> Sorry I wasn't clear. I'm referring to the style in the above doc about
> getting rid of the Pipeline object (as a long-lived thing at least). In
> this case the actual execution of pipeline construction never spans
> multiple cells (though its implementation might via function calls) so one
> never has out-of-date transforms dangling off the pipeline object.
>
> I think the desired experience is allowing users to define their pipelines
> across multiple cells or even multiple notebooks. Because the users can
> build pipeline, run, add more PTransforms and run. And the pipeline object
> is long lived like any other variables in notebook until it’s re-evaluated.
>

I'm not convinced a long-lived pipeline is going to be a good experience
due to the append-only, poisoning issues described above, and would be
interested in an experience that avoids this.

The scenario we both feel that is not intuitive is “re-execute some
> previous cells with modified PTransforms and run”. It isn’t always valid to
> replace a PTransform in a full grown pipeline (so I think existing raising
> error for named PTransform and appending more for anonymous PTransform
> behavior is still the best option). Re-executing cells is going to append
> more PTransforms. In that case, they have to re-execute all cells from the
> beginning to construct a new pipeline object. It’s like in a notebook,
> define class Foo with string field x and other fields lazily initialized
> from x. Then in a cell, the user set x to an integer. Syntactically, the
> code still works (in Python). But when constructing the Foo object, lazily
> initialized fields cannot be evaluated.
>

 Yep, and in this case a good design would be to not make x a public(ly
settable) member, and definitely not encourage patterns that require
setting x after construction. I'm trying to similarly avoid error-prone
patterns here. But it's possible that this issue won't be resolved until we
have enough of the infrastructure to play around with it and get even more
hands-on and third-party experience.

Re: Brief of interactive Beam

Reply via email to