Re: Staging a PCollection in Beam | Dataflow Runner

Israel Herraiz via user Wed, 19 Oct 2022 01:02:10 -0700

I think that would be a Reshuffle
<https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html>,
but only within the context of the same job (e.g. if there is a failure and
a retry, the retry would start from the checkpoint created by the
reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by
key, stateful dofns and I think splittable dofns will also have the same
effect of creating a checkpoint (any shuffling operation will always create
a checkpoint).


If you want to start a different job (slightly updated code, starting from
a previous point of a previous job), in Dataflow that would be a snapshot
<https://cloud.google.com/dataflow/docs/guides/using-snapshots>, I think.
Snapshots only work in streaming pipelines.

On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor <[email protected]> wrote:

> Hi Team,
> Can we stage a PCollection<TableRows> or  PCollection<Row> data? Lets say
> to save  the expensive operations between two complex BQ tables time and
> again and materialize it in some temp view which will be deleted after the
> session.
>
> Is it possible to do that in the Beam Pipeline?
> We can later use the temp view in another pipeline to read the data from
> and do processing.
>
> Or In general I would like to know Do we ever stage the PCollection.
> Let's say I want to create another instance of the same job which has
> complex processing.
> Does the pipeline re perform the computation or would it pick the already
> processed data in the previous instance that must be staged somewhere?
>
> Like in spark we do have notions of createOrReplaceTempView which is used
> to create temp table from a spark dataframe or dataset.
>
> Please advise.
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564 <+91%2098187%2064564>
> [email protected]
>

Re: Staging a PCollection in Beam | Dataflow Runner

Reply via email to