Hello team, As a user of pyspark, I've been following along the development of Apache Beam with some interest. My interest was specifically piqued when I saw the investment in the Dataframe AP, as well as the Notebook based Interactive Runner.
I had a few questions that I would love to understand better, so any pointers would be appreciated. 1. For the interactive runner <https://beam.apache.org/releases/pydoc/2.6.0/_modules/apache_beam/runners/interactive/interactive_runner.html#InteractiveRunner>, while the default is direct runner, are we able to use Dataflow here instead? I ask because I would love to process large amounts of data that won't fit on my notebook's machine interactively and then inspect it. Specifically, I'm trying to replicate this functionality from spark in beam: # read some data from GCS that won't fit in memory df = spark.read.parquet(...) # groupby and summarize data, shuffle is distributed, because otherwise Notebook machine would OOM result_df = df.groupby(...).agg(...) # We interactively inspect a random sample of rows from the dataframe, need not be in order result_df.show(...) 2. Are there any close demo notebooks planned between the Interactive Runner and the Dataframe API? I ask this more leadingly, as I hope that give the large number of interactive notebook users out there who primarily deal in dataframes, this would be a natural audience for you to market the APIs to. I appreciate any discussion and thoughts. Thanks, Sayan -- Sayan Sanyal Data Scientist on Notifications
