Hello team,

As a user of pyspark, I've been following along the development of Apache
Beam with some interest. My interest was specifically piqued when I saw the
investment in the Dataframe AP, as well as the Notebook based Interactive
Runner.

I had a few questions that I would love to understand better, so any
pointers would be appreciated.

1. For the interactive runner
<https://beam.apache.org/releases/pydoc/2.6.0/_modules/apache_beam/runners/interactive/interactive_runner.html#InteractiveRunner>,
while the default is direct runner, are we able to use Dataflow here
instead? I ask because I would love to process large amounts of data that
won't fit on my notebook's machine interactively and then inspect it.
Specifically, I'm trying to replicate this functionality from spark in beam:

# read some data from GCS that won't fit in memory
df = spark.read.parquet(...)

# groupby and summarize data, shuffle is distributed, because otherwise
Notebook machine would OOM
result_df = df.groupby(...).agg(...)

# We interactively inspect a random sample of rows from the dataframe, need
not be in order
result_df.show(...)

2. Are there any close demo notebooks planned between the Interactive
Runner and the Dataframe API? I ask this more leadingly, as I hope that
give the large number of interactive notebook users out there who primarily
deal in dataframes, this would be a natural audience for you to market the
APIs to.

I appreciate any discussion and thoughts.

Thanks,
Sayan

-- 

Sayan Sanyal

Data Scientist on Notifications

Reply via email to