I am talking about in batch context. Can we do checkpointing in batch mode as well? I am looking for any failure or retry algorithm. The requirement is to simply materialize a PCollection which can be used across the jobs /within the job in some view/temp table which is auto deleted I believe Reshuffle <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html> is for streaming. Right?
Thanks, Ravi On Wed, Oct 19, 2022 at 1:32 PM Israel Herraiz via dev <dev@beam.apache.org> wrote: > I think that would be a Reshuffle > <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html>, > but only within the context of the same job (e.g. if there is a failure and > a retry, the retry would start from the checkpoint created by the > reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by > key, stateful dofns and I think splittable dofns will also have the same > effect of creating a checkpoint (any shuffling operation will always create > a checkpoint). > > If you want to start a different job (slightly updated code, starting from > a previous point of a previous job), in Dataflow that would be a snapshot > <https://cloud.google.com/dataflow/docs/guides/using-snapshots>, I think. > Snapshots only work in streaming pipelines. > > On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor <kapoorrav...@gmail.com> wrote: > >> Hi Team, >> Can we stage a PCollection<TableRows> or PCollection<Row> data? Lets say >> to save the expensive operations between two complex BQ tables time and >> again and materialize it in some temp view which will be deleted after the >> session. >> >> Is it possible to do that in the Beam Pipeline? >> We can later use the temp view in another pipeline to read the data from >> and do processing. >> >> Or In general I would like to know Do we ever stage the PCollection. >> Let's say I want to create another instance of the same job which has >> complex processing. >> Does the pipeline re perform the computation or would it pick the already >> processed data in the previous instance that must be staged somewhere? >> >> Like in spark we do have notions of createOrReplaceTempView which is used >> to create temp table from a spark dataframe or dataset. >> >> Please advise. >> >> -- >> Thanks, >> Ravi Kapoor >> +91-9818764564 <+91%2098187%2064564> >> kapoorrav...@gmail.com >> > -- Thanks, Ravi Kapoor +91-9818764564 kapoorrav...@gmail.com