On Wed, Oct 19, 2022 at 2:43 PM Ravi Kapoor <kapoorrav...@gmail.com> wrote:
> I am talking about in batch context. Can we do checkpointing in batch mode > as well? > I am *not* looking for any failure or retry algorithm. > The requirement is to simply materialize a PCollection which can be used > across the jobs /within the job in some view/temp table which is > auto deleted > I believe Reshuffle > <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html> > is > for streaming. Right? > > Thanks, > Ravi > > On Wed, Oct 19, 2022 at 1:32 PM Israel Herraiz via dev < > dev@beam.apache.org> wrote: > >> I think that would be a Reshuffle >> <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html>, >> but only within the context of the same job (e.g. if there is a failure and >> a retry, the retry would start from the checkpoint created by the >> reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by >> key, stateful dofns and I think splittable dofns will also have the same >> effect of creating a checkpoint (any shuffling operation will always create >> a checkpoint). >> >> If you want to start a different job (slightly updated code, starting >> from a previous point of a previous job), in Dataflow that would be a >> snapshot <https://cloud.google.com/dataflow/docs/guides/using-snapshots>, >> I think. Snapshots only work in streaming pipelines. >> >> On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor <kapoorrav...@gmail.com> wrote: >> >>> Hi Team, >>> Can we stage a PCollection<TableRows> or PCollection<Row> data? Lets >>> say to save the expensive operations between two complex BQ tables time >>> and again and materialize it in some temp view which will be deleted after >>> the session. >>> >>> Is it possible to do that in the Beam Pipeline? >>> We can later use the temp view in another pipeline to read the data from >>> and do processing. >>> >>> Or In general I would like to know Do we ever stage the PCollection. >>> Let's say I want to create another instance of the same job which has >>> complex processing. >>> Does the pipeline re perform the computation or would it pick the >>> already processed data in the previous instance that must be staged >>> somewhere? >>> >>> Like in spark we do have notions of createOrReplaceTempView which is >>> used to create temp table from a spark dataframe or dataset. >>> >>> Please advise. >>> >>> -- >>> Thanks, >>> Ravi Kapoor >>> +91-9818764564 <+91%2098187%2064564> >>> kapoorrav...@gmail.com >>> >> > > -- > Thanks, > Ravi Kapoor > +91-9818764564 > kapoorrav...@gmail.com > -- Thanks, Ravi Kapoor +91-9818764564 kapoorrav...@gmail.com