Hi All!


I am happy to announce that an improved Interactive Runner is now available
on master. This Python runner allows for the interactive development of
Beam pipelines in a notebook (and IPython) environment.



The runner still has some bugs that need to be fixed as well as some
refactoring, but it is in a good enough shape to start using it.



Here are the new things you can do with the Interactive Runner:

   -

   Create and execute pipelines within a REPL
   -

   Visualize elements as the pipeline is running
   -

   Materialize PCollections to DataFrames
   -

   Record unbounded sources for deterministic replay
   -

   Replay cached unbounded sources including watermark advancements

The code lives in sdks/python/apache_beam/runners/interactive
<https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive>
and example notebooks are in
sdks/python/apache_beam/runners/interactive/examples
<https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive/examples>
.



To install, use `pip install -e .[interactive]` in your <project
root>/sdks/python directory.

To run, here’s a quick example:

```

import apache_beam as beam

from apache_beam.runners.interactive.interactive_runner import
InteractiveRunner

import apache_beam.runners.interactive.interactive_beam as ib



p = beam.Pipeline(InteractiveRunner())

words = p | beam.Create(['To', 'be', 'or', 'not', 'to', 'be'])

counts = words | 'count' >> beam.combiners.Count.PerElement()



# Shows a dynamically updating display of the PCollection elements

ib.show(counts)



# We can now visualize the data using standard pandas operations.

df = ib.collect(counts)

print(df.info())

print(df.describe())



# Plot the top-10 counted words

df = df.sort_values(by=1, ascending=False)

df.head(n=10).plot(x=0, y=1)

```



Currently, Batch is supported on any runner. Streaming is only supported on
the DirectRunner (non-FnAPI).



I would like to thank the great work of Sindy (@sindyli) and Harsh
(@ananvay) for the initial implementation,

David Yan (@davidyan) who led the project, Ning (@ningk) and myself
(@srohde) for the implementation and design, and Ahmet (@altay), Daniel
(@millsd), Pablo (@pabloem), and Robert (@robertwb) who all contributed a
lot of their time to help with the design and code reviews.



It was a team effort and we wouldn't have been able to complete it without
the help of everyone involved.



Regards,

Sam

Reply via email to