cheers for the reply that's helpful.  

  

Can anyone answer if it's okay for a pipeline can be run more than once? As
near as I see it is, I have some code that I want run after the pipeline
executes and for which I wait to finish...I don't want to just create a new
pipeline as I have pcollections I still want to use.  

  

thanks  

  

  

On Sunday, July 11, 2021 10:50 PM, Alex Koay <[email protected]> wrote:  

> From my understanding, you need the Pipeline for mainly two things:  
>
>
> 1\. Marking the start of any processing flows (it serves as the PBegin
> "PCollection") so any sources that follows it will run.  
>
>
> 2\. Running / executing / deploying the pipeline -- this happens
> automatically with the context manager in your example, but otherwise you
> can run pipeline.run() to get the same effect.  
>
>
>  
>
>
> On Mon, Jul 12, 2021 at 10:04 AM
> <[[email protected]](mailto:[email protected])>
> wrote:  
>
>

>> Hi,  
>
>>

>>  
>
>>

>> When using the python sdk I'm a little confused as to when the pipeline
object is actually needed. I gather one needs it initially to create a
pcollection, just because this is when I most often see it consistently used
ex:  
>
>>

>>  
>
>>

>> with beam.Pipeline() as pipeline:  
>
>>

>> dict_pc = (  
>
>>

>> pipeline  
>
>>

>> | beam.io.fileio.MatchFiles("./*.csv")  
>
>>

>> | 'Read matched files' >> beam.io.fileio.ReadMatches()  
>
>>

>> | 'Get CSV data as a dict' >> beam.FlatMap(my_csv_reader)  
>
>>

>> )  
>
>>

>>  
>
>>

>> # do stuff with dict_pc and other operations  
>
>>

>>  
>
>>

>> But beyond this when do one need the pipeline object? It seems like the
transforms expect a pcollection and output a pcollection so I'm confused and
not finding documentation that addresses this. thank you.  
>
>>

>>  
>

  

Reply via email to