Re: Staging a PCollection in Beam | Dataflow Runner

Ravi Kapoor Wed, 19 Oct 2022 02:14:04 -0700

I am talking about in batch context. Can we do checkpointing in batch mode
as well?
I am looking for any failure or retry algorithm.
The requirement is to simply materialize a PCollection which can be used
across the jobs /within the job   in some view/temp table which is
auto deleted
I believe Reshuffle
<https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html>
is
for streaming. Right?


Thanks,
Ravi

On Wed, Oct 19, 2022 at 1:32 PM Israel Herraiz via dev <dev@beam.apache.org>
wrote:

> I think that would be a Reshuffle
> <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html>,
> but only within the context of the same job (e.g. if there is a failure and
> a retry, the retry would start from the checkpoint created by the
> reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by
> key, stateful dofns and I think splittable dofns will also have the same
> effect of creating a checkpoint (any shuffling operation will always create
> a checkpoint).
>
> If you want to start a different job (slightly updated code, starting from
> a previous point of a previous job), in Dataflow that would be a snapshot
> <https://cloud.google.com/dataflow/docs/guides/using-snapshots>, I think.
> Snapshots only work in streaming pipelines.
>
> On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor <kapoorrav...@gmail.com> wrote:
>
>> Hi Team,
>> Can we stage a PCollection<TableRows> or  PCollection<Row> data? Lets say
>> to save  the expensive operations between two complex BQ tables time and
>> again and materialize it in some temp view which will be deleted after the
>> session.
>>
>> Is it possible to do that in the Beam Pipeline?
>> We can later use the temp view in another pipeline to read the data from
>> and do processing.
>>
>> Or In general I would like to know Do we ever stage the PCollection.
>> Let's say I want to create another instance of the same job which has
>> complex processing.
>> Does the pipeline re perform the computation or would it pick the already
>> processed data in the previous instance that must be staged somewhere?
>>
>> Like in spark we do have notions of createOrReplaceTempView which is used
>> to create temp table from a spark dataframe or dataset.
>>
>> Please advise.
>>
>> --
>> Thanks,
>> Ravi Kapoor
>> +91-9818764564 <+91%2098187%2064564>
>> kapoorrav...@gmail.com
>>
>

-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com

Re: Staging a PCollection in Beam | Dataflow Runner

Reply via email to