Re: Any recomendation for key for GroupIntoBatches

XQ Hu via user Mon, 15 Apr 2024 08:38:36 -0700

I am not sure you still need to do batching since Web API can handle
caching.


If you really need it, I think GoupIntoBatches is a good way to go.

On Mon, Apr 15, 2024 at 11:30 AM Ruben Vargas <ruben.var...@metova.com>
wrote:

> Is there a way to do batching in that transformation? I'm assuming for
> now no. or may be using in conjuntion with GoupIntoBatches
>
> On Mon, Apr 15, 2024 at 9:29 AM Ruben Vargas <ruben.var...@metova.com>
> wrote:
> >
> > Interesting
> >
> > I think the cache feature could be interesting for some use cases I have.
> >
> > On Mon, Apr 15, 2024 at 9:18 AM XQ Hu <x...@google.com> wrote:
> > >
> > > For the new web API IO, the page lists these features:
> > >
> > > developers provide minimal code that invokes Web API endpoint
> > > delegate to the transform to handle request retries and exponential
> backoff
> > > optional caching of request and response associations
> > > optional metrics
> > >
> > >
> > > On Mon, Apr 15, 2024 at 10:38 AM Ruben Vargas <ruben.var...@metova.com>
> wrote:
> > >>
> > >> That one looks interesting
> > >>
> > >> What is not clear to me is what are the advantages of using it? Is
> > >> only the error/retry handling? anything in terms of performance?
> > >>
> > >> My PCollection is unbounded but I was thinking of sending my messages
> > >> in batches to the external API in order to gain some performance
> > >> (don't expect to send 1 http request per message).
> > >>
> > >> Thank you very much for all your responses!
> > >>
> > >>
> > >> On Sun, Apr 14, 2024 at 8:28 AM XQ Hu via user <user@beam.apache.org>
> wrote:
> > >> >
> > >> > To enrich your data, have you checked
> https://cloud.google.com/dataflow/docs/guides/enrichment?
> > >> >
> > >> > This transform is built on top of
> https://beam.apache.org/documentation/io/built-in/webapis/
> > >> >
> > >> > On Fri, Apr 12, 2024 at 4:38 PM Ruben Vargas <
> ruben.var...@metova.com> wrote:
> > >> >>
> > >> >> On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim <dott...@gmail.com>
> wrote:
> > >> >> >
> > >> >> > Here is an example from a book that I'm reading now and it may
> be applicable.
> > >> >> >
> > >> >> > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100
> > >> >> > PYTHON - ord(id[0]) % 100
> > >> >>
> > >> >> Maybe this is what I'm looking for. I'll give it a try. Thanks!
> > >> >>
> > >> >> >
> > >> >> > On Sat, 13 Apr 2024 at 06:12, George Dekermenjian <
> ged1...@gmail.com> wrote:
> > >> >> >>
> > >> >> >> How about just keeping track of a buffer and flush the buffer
> after 100 messages and if there is a buffer on finish_bundle as well?
> > >> >> >>
> > >> >> >>
> > >> >>
> > >> >> If this is in memory, It could lead to potential loss of data.
> That is
> > >> >> why the state is used or at least that is my understanding. but
> maybe
> > >> >> there is a way to do this in the state?
> > >> >>
> > >> >>
> > >> >> >> On Fri, Apr 12, 2024 at 21.23 Ruben Vargas <
> ruben.var...@metova.com> wrote:
> > >> >> >>>
> > >> >> >>> Hello guys
> > >> >> >>>
> > >> >> >>> Maybe this question was already answered, but I cannot find
> it  and
> > >> >> >>> want some more input on this topic.
> > >> >> >>>
> > >> >> >>> I have some messages that don't have any particular key
> candidate,
> > >> >> >>> except the ID,  but I don't want to use it because the idea is
> to
> > >> >> >>> group multiple IDs in the same batch.
> > >> >> >>>
> > >> >> >>> This is my use case:
> > >> >> >>>
> > >> >> >>> I have an endpoint where I'm gonna send the message ID, this
> endpoint
> > >> >> >>> is gonna return me certain information which I will use to
> enrich my
> > >> >> >>> message. In order to avoid fetching the endpoint per message I
> want to
> > >> >> >>> batch it in 100 and send the 100 IDs in one request ( the
> endpoint
> > >> >> >>> supports it) . I was thinking on using GroupIntoBatches.
> > >> >> >>>
> > >> >> >>> - If I choose the ID as the key, my understanding is that it
> won't
> > >> >> >>> work in the way I want (because it will form batches of the
> same ID).
> > >> >> >>> - Use a constant will be a problem for parallelism, is that
> correct?
> > >> >> >>>
> > >> >> >>> Then my question is, what should I use as a key? Maybe
> something
> > >> >> >>> regarding the timestamp? so I can have groups of messages that
> arrive
> > >> >> >>> at a certain second?
> > >> >> >>>
> > >> >> >>> Any suggestions would be appreciated
> > >> >> >>>
> > >> >> >>> Thanks.
>

Re: Any recomendation for key for GroupIntoBatches

Reply via email to