Re: Environmental variables not accessible in Dataflow pipeline

2023-12-22 Thread Evan Galpin
I assume from the previous messages that GCP Dataflow is being used as the pipeline runner. Even without Flex Templates, the v2 runner can use docker containers to install all dependencies from various sources[1]. I have used docker containers to solve the same problem you mention: installing a p

Re: Can apache beam be used for control flow (ETL workflow)

2023-12-22 Thread Chad Dombrova
Hi, I'm the guy who gave the Movie Magic talk. Since it's possible to write stateful transforms with Beam, it is capable of some very sophisticated flow control. I've not seen a python framework that combines this with streaming data nearly as well. That said, there aren't a lot of great workin

Re: [External Sender] Re: [Question] S3 Token Expiration during Read Step

2023-12-22 Thread Ramya Prasad via user
Oops, not sure I replied to all but I'm using ParquetIO: PCollection records = pipeline.apply("Read parquet file in as Generic Records", ParquetIO.read(finalSchema).from(beamReadPath).withConfiguration(configuration)); The variable beamReadPath starts with the s3 prefix, and I set the initial cre

Re: [External Sender] Re: [Question] S3 Token Expiration during Read Step

2023-12-22 Thread Ramya Prasad via user
Yes, I'm using ParquetIO as below: PCollection records = pipeline.apply("Read parquet file in as Generic Records", ParquetIO.read(finalSchema).from(beamReadPath).withConfiguration(configuration)); On Fri, Dec 22, 2023 at 10:39 AM XQ Hu via user wrote: > Can you share some code snippets about h

Re: [Question] S3 Token Expiration during Read Step

2023-12-22 Thread XQ Hu via user
Can you share some code snippets about how to read from S3? Do you use the builtin TextIO? On Fri, Dec 22, 2023 at 11:28 AM Ramya Prasad via user wrote: > Hello, > > I am a developer trying to use Apache Beam, and I have a nuanced problem I > need help with. I have a pipeline which has to read i

Re: [Question] WaitOn for Reading Step

2023-12-22 Thread XQ Hu via user
When I search the Beam code base, there are plenty of places which use Wait.on. You could check these code for some insights. If this doesn't work, it would be better to create a small test case to reproduce the problem and open a Github issue. Sorry, I cannot help too much with this. On Fri, Dec

[Question] WaitOn for Reading Step

2023-12-22 Thread Ramya Prasad via user
Hello, I am a developer trying to use Apache Beam, and I am running into an issue where my WaitOn step is not working as expected. I want my pipeline to read all the data from an S3 bucket using ParquetIO before moving on to the rest of the steps in my pipeline. However, I see in my DAG that even

[Question] S3 Token Expiration during Read Step

2023-12-22 Thread Ramya Prasad via user
Hello, I am a developer trying to use Apache Beam, and I have a nuanced problem I need help with. I have a pipeline which has to read in 40 million records from multiple Parquet files from AWS S3. The only way I can get the credentials I need for this particular bucket is to call an API, which I d

Re: Environmental variables not accessible in Dataflow pipeline

2023-12-22 Thread XQ Hu via user
You can use the same docker image for both template launcher and Dataflow job. Here is one example: https://github.com/google/dataflow-ml-starter/blob/main/tensorflow_gpu.flex.Dockerfile#L60 On Fri, Dec 22, 2023 at 8:04 AM Sumit Desai wrote: > Yes, I will have to try it out. > > Regards > Sumit

Re: Environmental variables not accessible in Dataflow pipeline

2023-12-22 Thread Sumit Desai via user
Yes, I will have to try it out. Regards Sumit Desai On Fri, Dec 22, 2023 at 3:53 PM Sofia’s World wrote: > I guess so, i am not an expert on using env variables in dataflow > pipelines as any config dependencies i need, i pass them as job input > params > > But perhaps you can configure variab

Re: Environmental variables not accessible in Dataflow pipeline

2023-12-22 Thread Sofia’s World
I guess so, i am not an expert on using env variables in dataflow pipelines as any config dependencies i need, i pass them as job input params But perhaps you can configure variables in your docker file (i am not an expert in this either), as flex templates use Docker? https://cloud.google.com

Re: Environmental variables not accessible in Dataflow pipeline

2023-12-22 Thread Sumit Desai via user
We are using an external non-public package which expects environmental variables only. If environmental variables are not found, it will throw an error. We can't change source of this package. Does this mean we will face same problem with flex templates also? On Fri, 22 Dec 2023, 3:39 pm Sofia’s

Re: Environmental variables not accessible in Dataflow pipeline

2023-12-22 Thread Sofia’s World
The flex template will allow you to pass input params with dynamic values to your data flow job so you could replace the env variable with that input? That is, unless you have to have env bars..but from your snippets it appears you are just using them to configure one of your components? Hth On Fr

Re: Environmental variables not accessible in Dataflow pipeline

2023-12-22 Thread Sumit Desai via user
Hi Sofia and XQ, The application is failing because I have loggers defined in every file and the method to create a logger tries to create an object of UplightTelemetry. If I use flex templated, will the environmental variables I supply be loaded before the application gets loaded? If not, it woul