Re: Using dataflow from a notebook via Interactive Runner

Ning Kang Tue, 05 Jan 2021 14:47:28 -0800

Sorry I might have misunderstood the second topic. The idea is that
Dataframe API "should" naturally work with the InteractiveRunner inside a
notebook by converting the data set between DefferedDataframe and
PCollection.
Still there is additional work needed to fully support that. Like how the
pipeline is defined and used in the REPL environment? Are all the APIs and
transforms supported?
We'll have these notebooks out once those work are done.


On Tue, Jan 5, 2021 at 1:55 PM Ning Kang <[email protected]> wrote:

> Hi Sayan,
>
> 1. It's not yet officially supported to use the DataflowRunner as the
> underlying runner with InteractiveRunner (It's possible to set up GCS
> buckets with the underlying source recording and PCollection cache
> mechanism to work with the DataflowRunner, but it's not recommended).
> You can use the default DirectRunner with a sample of data when creating
> the pipeline, then run the pipeline with a DataflowRunner using the full
> set of data.
>
> 2. Beam Dataframes
> <https://beam.apache.org/documentation/dsls/dataframes/overview/> has
> been announced.
> You should be able to use the Dataframe APIs and convert them to
> PCollections with `from apache_beam.dataframe.convert import
> to_pcollection`
>
> On Tue, Jan 5, 2021 at 8:49 AM Sayan Sanyal <[email protected]> wrote:
>
>> Hello team,
>>
>> As a user of pyspark, I've been following along the development of Apache
>> Beam with some interest. My interest was specifically piqued when I saw the
>> investment in the Dataframe AP, as well as the Notebook based Interactive
>> Runner.
>>
>> I had a few questions that I would love to understand better, so any
>> pointers would be appreciated.
>>
>> 1. For the interactive runner
>> <https://beam.apache.org/releases/pydoc/2.6.0/_modules/apache_beam/runners/interactive/interactive_runner.html#InteractiveRunner>,
>> while the default is direct runner, are we able to use Dataflow here
>> instead? I ask because I would love to process large amounts of data that
>> won't fit on my notebook's machine interactively and then inspect it.
>> Specifically, I'm trying to replicate this functionality from spark in beam:
>>
>> # read some data from GCS that won't fit in memory
>> df = spark.read.parquet(...)
>>
>> # groupby and summarize data, shuffle is distributed, because otherwise
>> Notebook machine would OOM
>> result_df = df.groupby(...).agg(...)
>>
>> # We interactively inspect a random sample of rows from the dataframe,
>> need not be in order
>> result_df.show(...)
>>
>> 2. Are there any close demo notebooks planned between the Interactive
>> Runner and the Dataframe API? I ask this more leadingly, as I hope that
>> give the large number of interactive notebook users out there who primarily
>> deal in dataframes, this would be a natural audience for you to market the
>> APIs to.
>>
>> I appreciate any discussion and thoughts.
>>
>> Thanks,
>> Sayan
>>
>> --
>>
>> Sayan Sanyal
>>
>> Data Scientist on Notifications
>>
>>
>>

Re: Using dataflow from a notebook via Interactive Runner

Reply via email to