On Fri, Aug 23, 2019 at 4:25 PM Ning Kang <[email protected]> wrote: > On Aug 23, 2019, at 3:09 PM, Robert Bradshaw <[email protected]> wrote: > > Cool, sounds like we're getting closer to the same page. Some more replies > below. > > On Fri, Aug 23, 2019 at 1:47 PM Ning Kang <[email protected]> wrote: > >> Thanks for the feedback, Robert! I think I got your idea. >> Let me summarize it to see if it’s correct: >> 1. You want everything about >> >> standard Beam concepts >> >> to follow existing pattern: so we can shot down create_pipeline() and >> keep the InteractiveRunner notion when constructing pipeline, I agree with >> it. A runner can delegate another runner, also agreed. Let’s keep it that >> way. >> > > Despite everything I've written, I'm not convinced that exposing this as a > Runner is the most intuitive way to get interactivity either. Given that > the "magic" of interactivity is being able to watch PCollections (for > inspection and further construction), and if no PCollecitons are watched > execution proceeds as normal, what are your thoughts about making all > pipelines "interactive" and just doing the magic iff there are PCollections > to watch? (The opt-in incantation here would be ibeam.watch(globals()) or > similar.) > > FWIW, Flume has something similar (called marking collections as to be > materialized). It has its pros and cons. > > By default __main__ is watched, similar to the watch(globals()). If no > PCollection variable is being watched, it’s not doing any magic. > I’m not sure about making all pipelines “interactive” such as by adding an > “interactive=True/False” option when constructing pipeline. >
My point was that watch(globals()) (or anything else) would be the explicit op in to interactive, instead of doing interactive=True or manually constructing an InteractiveRunner or anything else. > Since we couldn’t decide which one is more intuitive, I would stick to the > existing InteractiveRunner constructor that is open sourced. > And we try to avoid changing any code outside …/runners/interactive/. > > Yes, we can stick with what's already there for now to avoid blocking any implementation work. > 2. watch() and visualize() can be in the independent interactive beam >> module since they are >> >> concepts that are unique to being interactive >> >> 3. I'll add some example for the run_pipeline() in design doc. The short >> answer is run_pipeline() != p.run(). Thanks for sharing the doc ( >> https://s.apache.org/no-beam-pipeline). >> As described in the doc, when constructing the pipeline, we still want to >> bundle a runner and options to the constructed pipeline even in the future. >> So if the runner is InteractiveRunner, the interactivity instrument >> (implicitly applied read/write cache PTransform and input/output wiring) is >> only applied when "run_pipeline()" of the runner implementation is invoked. >> p.run() will apply the instrument. However, this static function >> run_pipeline() takes in a new runner and options, >> invoking “run_pipeline()” implementation of the new runner and wouldn’t >> have the instrument, thus no interactivity. >> Because you cannot (don’t want to, as seen in the doc, users cannot >> access the bundled pipeline/options in the future) change the runner easily >> without re-executing all the notebook cells, this shorthand function allows >> a user to run pipeline without interactivity immediately anywhere in a >> notebook. In the meantime, the pipeline is still bundled with the original >> Interactive Runner. The users can keep developing further pipelines. >> The usage of this function is not intuitive until you put it in a >> notebook user scenario where users develop, test in prod-like env and >> develop further. And it’s equivalent to users writing >> "from_runner_api(to_runner_api(pipeline))” in their notebook. It’s just a >> shorthand. >> > > What you're trying to work around here is the flaw in the existing API > that a user binds the choice of Runner before pipeline construction, rather > than at the point of execution. I propose we look at fixing this in Beam > itself. > > Then I would propose not exposing this. If late runner binding is > supported, we wouldn’t even need this. We can write it in an example > notebook rather than exposing it. > Sounds good. > 4. And we both agree that implicit cache is palatable and should be the >> only thing we use to support interactivity. Cache and watched pipeline >> definition (which tells us what to cache) are the main “hidden state” I >> meant. Because the cache mechanism is totally implicit and hidden from the >> user. A cache is either read or written in a p.run(). If an existing cache >> is not used in a p.run(), it expires. If the user restarts the IPython >> kernel, all cache should expire too. >> > > Depending on how we label items in the cache, they could survive kernel > restarts as well. This relates to another useful feature in Beam where if a > batch pipeline fails towards the end, one may want to resume/rerun it from > there after fixing the bug without redoing all the work. > > I would suggest kernel restart resetting everything. The lifespan of the > cache or any interactivity shouldn’t exceed the kernel session. And the > lifespan of a PCollection cache shouldn’t even exceed 2 consecutive > pipeline runs if the second run doesn’t use or produce it.After all, > resource of a running notebook is limited. We might even need cache > eviction or full pipeline re-execution when memory or disk space is not > enough. > There are pros and cons, but generally the user experience will be better the more that is cached (even across sessions--history preserved across sessions is one of the big wins of iPython vs. the built in prompt) up to the point where resource constraints become prohibitive. > Existing InteractiveRunner has the following portability >> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive#portability>. >> That’s why I said the interactivity (implementation) needs to be tailored >> for different underlying runners. >> If we allow users to pass in all kinds of underlying runners (even >> their in-house ones), we have to support the interactivity for all of them >> which we probably don't. That’s why we wanted a create_pipeline() wrapper >> so that in notebook, when building a pipeline, bundle to DirectRunner by >> default. >> The focus on the Direct Runner is also related to our objective: we >> want to provide easy-to-use notebook and some notebook environment where >> users can interactively execute pipelines without worrying about setup >> (especially when the setup is not Beam but Interactive Beam related). >> 6. We don’t fix typo for user defined transforms >> >> I'm talking about pruning like having a cell with >> >> pcoll | beam.Map(lambda x: expression_with_typo) >> >> and then fixing it (and re-evaluating) with >> >> pcoll | beam.Map(lambda x: expression_with_typo_fixed) >> >> where the former Map would *always* fail and never get removed from the >> pipeline. >> >> We never change the pipeline defined by the user. Interactivity is >> applied to a copy of user defined pipeline. >> > > Sure. But does the executed (copy) of the pipeline contain the bad Map > operation in it? If so, it in essence "poisons" the entire pipeline, > forcing a user to re-create and re-define it from the start to make forward > progress (which results in quite a poor user experience--the errors in cell > N manifest in cell M, but worse fixing and re-executing cell N doesn't fix > cell M). If not, how is it intelligently excluded (and in a way that is not > too dis-similar from non-interactive mode, and doesn't cause surprises with > the p.run vs. run_pipeline difference)? > > > The copy is always one-shot. When p.run(), a copy is instrumented with > additional PTrasnfroms and some re-wiring. It’s executed by the underlying > runner. And it’s gone. > I think I know what you are saying here. It’s similar to re-executing a > cell with anonymous PTransform. Should the user expect transforms in > parallel (no pruning) or the original transforms replaced (with pruning)? > If the bad transform is in the middle of several statements in a cell, > there is no rollback. The user could re-execute the cell but cannot replace > the failed/bad PTransfrom or remove succeeded/good PTransforms that are > duplicated. > Re-executing such cells after fixing the typo would “append” several > PTransforms again, but those not needed anymore are still dangling branches > in the pipeline. > It might not affect the correctness of the pipeline, but does change the > shape of the pipeline from the user’s expectation if they expect a > replacement. > It could be a great feature. [email protected]. I would propose > 1. Either not implementing it for now because the behavior is consistent > to building pipeline in non-interactive mode, and such replacement might > not always be valid (see next answer). > 2. Or if we decide that the interactive mode in this situation will have > different behavior than non-interactive mode, we can implement it, try > replacing the PTransforms (by appending new and pruning old) if the > input/output wiring is compatible and print some warning (succeed or fail) > saying it’s interactive mode. > I prefer 1, because we are not trying too hard to implicitly change user’s > pipeline when the user can expect no pruning (who are familiar with Beam > and IPython) and pruning (who are not familiar with Beam or IPython) while > we cannot guarantee a pruning is going to happen when the new transform is > not compatible with the one with typo. A user could re-execute a cell 5 > times, but we cannot determine if the user wants the same transform applied > 5 times from the same input (might be useless but valid) or the user has > made some valid change and changed mind for 5 times. I still like the > behavior “executing a cell” is equivalent to "appending the code for > execution” in the IPython session. What’s executed is executed. We > shouldn’t take the metadata of source code change in a executed cell into > the pipeline construction process. > But I’m open for 2 with the trying best and notifying users route. > I don't see any good ways to resolve the ambiguity with (2), so I think we're stuck with (1), at least for v1. > > > 7. >> >> One approach was that the pipeline construction is re-executed every time >> (i.e the "pipeline" object to run is really a callback, like a callable or >> a PTransform) and then there's no ambiguity here. >> >> I didn’t quite get it. Pipeline construction only happens when a user >> executes a cell with the pipeline construction code. >> Are you suggesting changing the logic in pipeline.apply() to always >> reapply/replace a NamedPTransform? I don’t think we (Interactive Beam) can >> decide that because it changes the behavior of Beam. >> We had some thought of subclassing pipeline and use the create_pipeline() >> method to create the subclassed pipeline object. Then intercept the >> pipeline.apply() to always replace PTransform with existing full label and >> apply logic of parent pipeline’s apply() logic. >> It seems to be a no go to me now. >> > > Sorry I wasn't clear. I'm referring to the style in the above doc about > getting rid of the Pipeline object (as a long-lived thing at least). In > this case the actual execution of pipeline construction never spans > multiple cells (though its implementation might via function calls) so one > never has out-of-date transforms dangling off the pipeline object. > > I think the desired experience is allowing users to define their pipelines > across multiple cells or even multiple notebooks. Because the users can > build pipeline, run, add more PTransforms and run. And the pipeline object > is long lived like any other variables in notebook until it’s re-evaluated. > I'm not convinced a long-lived pipeline is going to be a good experience due to the append-only, poisoning issues described above, and would be interested in an experience that avoids this. The scenario we both feel that is not intuitive is “re-execute some > previous cells with modified PTransforms and run”. It isn’t always valid to > replace a PTransform in a full grown pipeline (so I think existing raising > error for named PTransform and appending more for anonymous PTransform > behavior is still the best option). Re-executing cells is going to append > more PTransforms. In that case, they have to re-execute all cells from the > beginning to construct a new pipeline object. It’s like in a notebook, > define class Foo with string field x and other fields lazily initialized > from x. Then in a cell, the user set x to an integer. Syntactically, the > code still works (in Python). But when constructing the Foo object, lazily > initialized fields cannot be evaluated. > Yep, and in this case a good design would be to not make x a public(ly settable) member, and definitely not encourage patterns that require setting x after construction. I'm trying to similarly avoid error-prone patterns here. But it's possible that this issue won't be resolved until we have enough of the infrastructure to play around with it and get even more hands-on and third-party experience.
