Re: Staging a PCollection in Beam | Dataflow Runner

Ravi Kapoor Wed, 19 Oct 2022 03:01:07 -0700

On Wed, Oct 19, 2022 at 2:43 PM Ravi Kapoor <kapoorrav...@gmail.com> wrote:


> I am talking about in batch context. Can we do checkpointing in batch mode
> as well?
> I am *not* looking for any failure or retry algorithm.
> The requirement is to simply materialize a PCollection which can be used
> across the jobs /within the job   in some view/temp table which is
> auto deleted
> I believe Reshuffle
> <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html>
>  is
> for streaming. Right?
>
> Thanks,
> Ravi
>
> On Wed, Oct 19, 2022 at 1:32 PM Israel Herraiz via dev <
> dev@beam.apache.org> wrote:
>
>> I think that would be a Reshuffle
>> <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html>,
>> but only within the context of the same job (e.g. if there is a failure and
>> a retry, the retry would start from the checkpoint created by the
>> reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by
>> key, stateful dofns and I think splittable dofns will also have the same
>> effect of creating a checkpoint (any shuffling operation will always create
>> a checkpoint).
>>
>> If you want to start a different job (slightly updated code, starting
>> from a previous point of a previous job), in Dataflow that would be a
>> snapshot <https://cloud.google.com/dataflow/docs/guides/using-snapshots>,
>> I think. Snapshots only work in streaming pipelines.
>>
>> On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor <kapoorrav...@gmail.com> wrote:
>>
>>> Hi Team,
>>> Can we stage a PCollection<TableRows> or  PCollection<Row> data? Lets
>>> say to save  the expensive operations between two complex BQ tables time
>>> and again and materialize it in some temp view which will be deleted after
>>> the session.
>>>
>>> Is it possible to do that in the Beam Pipeline?
>>> We can later use the temp view in another pipeline to read the data from
>>> and do processing.
>>>
>>> Or In general I would like to know Do we ever stage the PCollection.
>>> Let's say I want to create another instance of the same job which has
>>> complex processing.
>>> Does the pipeline re perform the computation or would it pick the
>>> already processed data in the previous instance that must be staged
>>> somewhere?
>>>
>>> Like in spark we do have notions of createOrReplaceTempView which is
>>> used to create temp table from a spark dataframe or dataset.
>>>
>>> Please advise.
>>>
>>> --
>>> Thanks,
>>> Ravi Kapoor
>>> +91-9818764564 <+91%2098187%2064564>
>>> kapoorrav...@gmail.com
>>>
>>
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564
> kapoorrav...@gmail.com
>


-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com

Re: Staging a PCollection in Beam | Dataflow Runner

Reply via email to