cheers for the reply that's helpful.
Can anyone answer if it's okay for a pipeline can be run more than once? As near as I see it is, I have some code that I want run after the pipeline executes and for which I wait to finish...I don't want to just create a new pipeline as I have pcollections I still want to use. thanks On Sunday, July 11, 2021 10:50 PM, Alex Koay <[email protected]> wrote: > From my understanding, you need the Pipeline for mainly two things: > > > 1\. Marking the start of any processing flows (it serves as the PBegin > "PCollection") so any sources that follows it will run. > > > 2\. Running / executing / deploying the pipeline -- this happens > automatically with the context manager in your example, but otherwise you > can run pipeline.run() to get the same effect. > > > > > > On Mon, Jul 12, 2021 at 10:04 AM > <[[email protected]](mailto:[email protected])> > wrote: > > >> Hi, > >> >> > >> >> When using the python sdk I'm a little confused as to when the pipeline object is actually needed. I gather one needs it initially to create a pcollection, just because this is when I most often see it consistently used ex: > >> >> > >> >> with beam.Pipeline() as pipeline: > >> >> dict_pc = ( > >> >> pipeline > >> >> | beam.io.fileio.MatchFiles("./*.csv") > >> >> | 'Read matched files' >> beam.io.fileio.ReadMatches() > >> >> | 'Get CSV data as a dict' >> beam.FlatMap(my_csv_reader) > >> >> ) > >> >> > >> >> # do stuff with dict_pc and other operations > >> >> > >> >> But beyond this when do one need the pipeline object? It seems like the transforms expect a pcollection and output a pcollection so I'm confused and not finding documentation that addresses this. thank you. > >> >> >
