Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Ruben Vargas
That one looks interesting What is not clear to me is what are the advantages of using it? Is only the error/retry handling? anything in terms of performance? My PCollection is unbounded but I was thinking of sending my messages in batches to the external API in order to gain some performance (do

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread XQ Hu via user
For the new web API IO, the page lists these features: - developers provide minimal code that invokes Web API endpoint - delegate to the transform to handle request retries and exponential backoff - optional caching of request and response associations - optional metrics On Mon, A

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Ruben Vargas
Interesting I think the cache feature could be interesting for some use cases I have. On Mon, Apr 15, 2024 at 9:18 AM XQ Hu wrote: > > For the new web API IO, the page lists these features: > > developers provide minimal code that invokes Web API endpoint > delegate to the transform to handle re

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Ruben Vargas
Is there a way to do batching in that transformation? I'm assuming for now no. or may be using in conjuntion with GoupIntoBatches On Mon, Apr 15, 2024 at 9:29 AM Ruben Vargas wrote: > > Interesting > > I think the cache feature could be interesting for some use cases I have. > > On Mon, Apr 15, 2

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread XQ Hu via user
I am not sure you still need to do batching since Web API can handle caching. If you really need it, I think GoupIntoBatches is a good way to go. On Mon, Apr 15, 2024 at 11:30 AM Ruben Vargas wrote: > Is there a way to do batching in that transformation? I'm assuming for > now no. or may be usi

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Ruben Vargas
Yeah unfortunately the data on the endpoint could change at any point in time and I need to make sure to have the latest one :/ That limits my options here. But I also have other sources that can benefit from this caching :) Thank you very much! On Mon, Apr 15, 2024 at 9:37 AM XQ Hu wrote: > >

Hot update in dataflow without lossing messages

2024-04-15 Thread Juan Romero
Hi guys. Good morning. I haven't done some test in apache beam over data flow in order to see if i can do an hot update or hot swap meanwhile the pipeline is processing a bunch of messages that fall in a time window of 10 minutes. What I saw is that when I do a hot update over the pipeline and cur

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Damon Douglas
Good day, Ruben, Would you be able to compute a shasum on the group of IDs to use as the key? Best, Damon On 2024/04/12 19:22:45 Ruben Vargas wrote: > Hello guys > > Maybe this question was already answered, but I cannot find it and > want some more input on this topic. > > I have some messa

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Reuven Lax via user
There are various strategies. Here is an example of how Beam does it (taken from Reshuffle.viaRandomKey().withNumBuckets(N) Note that this does some extra hashing to work around issues with the Spark runner. If you don't care about that, you could implement something simpler (e.g. initialize shard

Re: Hot update in dataflow without lossing messages

2024-04-15 Thread Robert Bradshaw via user
Are you draining[1] your pipeline or simply canceling it and starting a new one? Draining should close open windows and attempt to flush all in-flight data before shutting down. For PubSub you may also need to read from subscriptions rather than topics to ensure messages are processed by either one

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Robert Bradshaw via user
On Fri, Apr 12, 2024 at 1:39 PM Ruben Vargas wrote: > On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim wrote: > > > > Here is an example from a book that I'm reading now and it may be > applicable. > > > > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100 > > PYTHON - ord(id[0]) % 100 > or abs(hash(

Re: Hot update in dataflow without lossing messages

2024-04-15 Thread Juan Romero
The deployment in the job is made by terraform. I verified and seems that terraform do it incorrectly under the hood because it stop the current job and starts and new one. Thanks for the information ! On Mon, 15 Apr 2024 at 6:42 PM Robert Bradshaw via user < user@beam.apache.org> wrote: > Are yo