Re: Using Dataflow with Pubsub input connector in batch mode

2024-01-21 Thread Sumit Desai via user
Thanks Reuven and Alex. Yes, we are considering specifying the max time to read to the Pub/sub input connector first. If it doesn't work out due to some reason, will consider the approach with GCS. Thanks for your inputs. Regards, Sumit Desai On Mon, Jan 22, 2024 at 4:13 AM Reuven Lax via user

Re: Using Dataflow with Pubsub input connector in batch mode

2024-01-21 Thread Reuven Lax via user
Cloud Storage subscriptions are a reasonable way to backup data to storage, and you can then run a batch pipeline over the GCS files. Keep in mind that these files might contain duplicates (the storage subscriptions do not guarantee exactly-once writes). If this is a problem, you should add a dedup

Re: Using Dataflow with Pubsub input connector in batch mode

2024-01-21 Thread Alex Van Boxel
There are some valid use cases where you want to handle data going over Pubsub to handle in batch. It's way too expensive to run a simple daily extract from the data over Pubsub; batch is a lot cheaper. What we do is backup the data to Cloud Storage; Pubsub has recently added a nice feature that c

Re: Using Dataflow with Pubsub input connector in batch mode

2024-01-18 Thread Reuven Lax via user
Some comments here: 1. All messages in a PubSub topic is not a well-defined statement, as there can always be more messages published. You may know that nobody will publish any more messages, but the pipeline does not. 2. While it's possible to read from Pub/Sub in batch, it's usually not rec

Using Dataflow with Pubsub input connector in batch mode

2024-01-18 Thread Sumit Desai via user
Hi all, I want to create a Dataflow pipeline using Pub/sub as an input connector but I want to run it in batch mode and not streaming mode. I know it's not possible in Python but how can I achieve this in Java? Basically, I want my pipeline to read all messages in a Pubsub topic, process and termi