Re: Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Reuven Lax via dev
PCollections's usually are persistent within a pipeline, so you can reuse
them in other parts of a pipeline with no problem.

There is no notion of state across pipelines - every pipeline is
independent. If you want state across pipelines you can write the
PCollection out to a set of files which are read back in in the new
pipeline.

On Tue, Oct 18, 2022 at 11:45 PM Ravi Kapoor  wrote:

> Hi Team,
> Can we stage a PCollection or  PCollection data? Lets say
> to save  the expensive operations between two complex BQ tables time and
> again and materialize it in some temp view which will be deleted after the
> session.
>
> Is it possible to do that in the Beam Pipeline?
> We can later use the temp view in another pipeline to read the data from
> and do processing.
>
> Or In general I would like to know Do we ever stage the PCollection.
> Let's say I want to create another instance of the same job which has
> complex processing.
> Does the pipeline re perform the computation or would it pick the already
> processed data in the previous instance that must be staged somewhere?
>
> Like in spark we do have notions of createOrReplaceTempView which is used
> to create temp table from a spark dataframe or dataset.
>
> Please advise.
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564 <+91%2098187%2064564>
> kapoorrav...@gmail.com
>


Re: Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Ravi Kapoor
On Wed, Oct 19, 2022 at 2:43 PM Ravi Kapoor  wrote:

> I am talking about in batch context. Can we do checkpointing in batch mode
> as well?
> I am *not* looking for any failure or retry algorithm.
> The requirement is to simply materialize a PCollection which can be used
> across the jobs /within the job   in some view/temp table which is
> auto deleted
> I believe Reshuffle
> 
>  is
> for streaming. Right?
>
> Thanks,
> Ravi
>
> On Wed, Oct 19, 2022 at 1:32 PM Israel Herraiz via dev <
> dev@beam.apache.org> wrote:
>
>> I think that would be a Reshuffle
>> ,
>> but only within the context of the same job (e.g. if there is a failure and
>> a retry, the retry would start from the checkpoint created by the
>> reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by
>> key, stateful dofns and I think splittable dofns will also have the same
>> effect of creating a checkpoint (any shuffling operation will always create
>> a checkpoint).
>>
>> If you want to start a different job (slightly updated code, starting
>> from a previous point of a previous job), in Dataflow that would be a
>> snapshot ,
>> I think. Snapshots only work in streaming pipelines.
>>
>> On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor  wrote:
>>
>>> Hi Team,
>>> Can we stage a PCollection or  PCollection data? Lets
>>> say to save  the expensive operations between two complex BQ tables time
>>> and again and materialize it in some temp view which will be deleted after
>>> the session.
>>>
>>> Is it possible to do that in the Beam Pipeline?
>>> We can later use the temp view in another pipeline to read the data from
>>> and do processing.
>>>
>>> Or In general I would like to know Do we ever stage the PCollection.
>>> Let's say I want to create another instance of the same job which has
>>> complex processing.
>>> Does the pipeline re perform the computation or would it pick the
>>> already processed data in the previous instance that must be staged
>>> somewhere?
>>>
>>> Like in spark we do have notions of createOrReplaceTempView which is
>>> used to create temp table from a spark dataframe or dataset.
>>>
>>> Please advise.
>>>
>>> --
>>> Thanks,
>>> Ravi Kapoor
>>> +91-9818764564 <+91%2098187%2064564>
>>> kapoorrav...@gmail.com
>>>
>>
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564
> kapoorrav...@gmail.com
>


-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com


Re: Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Ravi Kapoor
I am talking about in batch context. Can we do checkpointing in batch mode
as well?
I am looking for any failure or retry algorithm.
The requirement is to simply materialize a PCollection which can be used
across the jobs /within the job   in some view/temp table which is
auto deleted
I believe Reshuffle

is
for streaming. Right?

Thanks,
Ravi

On Wed, Oct 19, 2022 at 1:32 PM Israel Herraiz via dev 
wrote:

> I think that would be a Reshuffle
> ,
> but only within the context of the same job (e.g. if there is a failure and
> a retry, the retry would start from the checkpoint created by the
> reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by
> key, stateful dofns and I think splittable dofns will also have the same
> effect of creating a checkpoint (any shuffling operation will always create
> a checkpoint).
>
> If you want to start a different job (slightly updated code, starting from
> a previous point of a previous job), in Dataflow that would be a snapshot
> , I think.
> Snapshots only work in streaming pipelines.
>
> On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor  wrote:
>
>> Hi Team,
>> Can we stage a PCollection or  PCollection data? Lets say
>> to save  the expensive operations between two complex BQ tables time and
>> again and materialize it in some temp view which will be deleted after the
>> session.
>>
>> Is it possible to do that in the Beam Pipeline?
>> We can later use the temp view in another pipeline to read the data from
>> and do processing.
>>
>> Or In general I would like to know Do we ever stage the PCollection.
>> Let's say I want to create another instance of the same job which has
>> complex processing.
>> Does the pipeline re perform the computation or would it pick the already
>> processed data in the previous instance that must be staged somewhere?
>>
>> Like in spark we do have notions of createOrReplaceTempView which is used
>> to create temp table from a spark dataframe or dataset.
>>
>> Please advise.
>>
>> --
>> Thanks,
>> Ravi Kapoor
>> +91-9818764564 <+91%2098187%2064564>
>> kapoorrav...@gmail.com
>>
>

-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com


Re: Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Israel Herraiz via dev
I think that would be a Reshuffle
,
but only within the context of the same job (e.g. if there is a failure and
a retry, the retry would start from the checkpoint created by the
reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by
key, stateful dofns and I think splittable dofns will also have the same
effect of creating a checkpoint (any shuffling operation will always create
a checkpoint).

If you want to start a different job (slightly updated code, starting from
a previous point of a previous job), in Dataflow that would be a snapshot
, I think.
Snapshots only work in streaming pipelines.

On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor  wrote:

> Hi Team,
> Can we stage a PCollection or  PCollection data? Lets say
> to save  the expensive operations between two complex BQ tables time and
> again and materialize it in some temp view which will be deleted after the
> session.
>
> Is it possible to do that in the Beam Pipeline?
> We can later use the temp view in another pipeline to read the data from
> and do processing.
>
> Or In general I would like to know Do we ever stage the PCollection.
> Let's say I want to create another instance of the same job which has
> complex processing.
> Does the pipeline re perform the computation or would it pick the already
> processed data in the previous instance that must be staged somewhere?
>
> Like in spark we do have notions of createOrReplaceTempView which is used
> to create temp table from a spark dataframe or dataset.
>
> Please advise.
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564 <+91%2098187%2064564>
> kapoorrav...@gmail.com
>


Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Ravi Kapoor
Hi Team,
Can we stage a PCollection or  PCollection data? Lets say
to save  the expensive operations between two complex BQ tables time and
again and materialize it in some temp view which will be deleted after the
session.

Is it possible to do that in the Beam Pipeline?
We can later use the temp view in another pipeline to read the data from
and do processing.

Or In general I would like to know Do we ever stage the PCollection.
Let's say I want to create another instance of the same job which has
complex processing.
Does the pipeline re perform the computation or would it pick the already
processed data in the previous instance that must be staged somewhere?

Like in spark we do have notions of createOrReplaceTempView which is used
to create temp table from a spark dataframe or dataset.

Please advise.

-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com