I think that would be a Reshuffle <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/Reshuffle.html>, but only within the context of the same job (e.g. if there is a failure and a retry, the retry would start from the checkpoint created by the reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by key, stateful dofns and I think splittable dofns will also have the same effect of creating a checkpoint (any shuffling operation will always create a checkpoint).
If you want to start a different job (slightly updated code, starting from a previous point of a previous job), in Dataflow that would be a snapshot <https://cloud.google.com/dataflow/docs/guides/using-snapshots>, I think. Snapshots only work in streaming pipelines. On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor <kapoorrav...@gmail.com> wrote: > Hi Team, > Can we stage a PCollection<TableRows> or PCollection<Row> data? Lets say > to save the expensive operations between two complex BQ tables time and > again and materialize it in some temp view which will be deleted after the > session. > > Is it possible to do that in the Beam Pipeline? > We can later use the temp view in another pipeline to read the data from > and do processing. > > Or In general I would like to know Do we ever stage the PCollection. > Let's say I want to create another instance of the same job which has > complex processing. > Does the pipeline re perform the computation or would it pick the already > processed data in the previous instance that must be staged somewhere? > > Like in spark we do have notions of createOrReplaceTempView which is used > to create temp table from a spark dataframe or dataset. > > Please advise. > > -- > Thanks, > Ravi Kapoor > +91-9818764564 <+91%2098187%2064564> > kapoorrav...@gmail.com >