On Thu, Oct 7, 2021 at 10:07 AM Daniel Collins <[email protected]> wrote:
> Hi Luke, > > The whole point here is actually to write to Pubsub on the other side of > the PTransform, so what you suggested was exactly my intent. Although you > could conceivably publish directly to Pub/Sub within the SDF, this is not > super extensible. > > > once the client retries you'll publish a duplicate > > This is acceptable, since the sink is probably Pubsub anyway. The client > could conceivably attach a deduplication ID to the source requests before > sending them. > > > Does the order in which the requests you get from a client or across > clients matter? > > No, not at all, since the sink would be Pubsub (which, given beam doesn't > support ordering internally or ordering keys, I wouldn't be concerned with). > > I think the main question is still outstanding though, is there a way to > ensure that on all tasks the pipeline JAR is loaded on, it actually will > run to avoid stranding user messages? > > Not that I'm aware of. > -Dan > > On Thu, Oct 7, 2021 at 12:53 PM Luke Cwik <[email protected]> wrote: > >> I would suggest that you instead write the requests received within the >> splittable DoFn directly to a queue based sink and in another part of the >> pipeline read from that queue. For example if you were using Pubsub for the >> queue, your pipeline would look like: >> Create(LB config + pubsub topic A) -> ParDo(SDF get request from client >> and write to pubsub and then ack client) >> Pubsub(Read from A) -> ?Deduplicate? -> ... downstream processing ... >> Since the SDF will write to pubsub before it acknowledges the message you >> may write data that is not acked and once the client retries you'll publish >> a duplicate. If downstream processing is not resilient to duplicates then >> you'll want to have some unique piece of information to deduplicate on. >> >> Does the order in which the requests you get from a client or across >> clients matter? >> If yes, then you need to be aware that the parallel processing will >> impact the order in which you see things and you might need to have data >> sorted/ordered within the pipeline. >> >> >> >> On Wed, Oct 6, 2021 at 3:56 PM Daniel Collins <[email protected]> >> wrote: >> >>> Hi all, >>> >>> Bear with me, this is a bit of a weird one. I've been toying around with >>> an idea to do http ingestion using a beam (specifically dataflow) pipeline. >>> The concept would be that you spin up an HTTP server on each running task >>> with a well known port as a static member of some class in the JAR (or upon >>> initialization of a SDF the first time), then accept requests, but don't >>> acknowledge them back to the client until the bundle finalizer >>> <https://javadoc.io/static/org.apache.beam/beam-sdks-java-core/2.29.0/org/apache/beam/sdk/transforms/DoFn.BundleFinalizer.html> >>> so >>> you know they're persisted/ have moved down the pipeline. You could then >>> use a load balancer pointed at the instance group created by dataflow as >>> the target for incoming requests, and create a PCollection from incoming >>> user requests. >>> >>> The only part of this I don't think would work is preventing user >>> requests from being stranded on a server that will never run the SDF that >>> will complete them due to load balancing constraints. So my question is: is >>> there a way to force an SDF to be run on every task where the JAR is loaded? >>> >>> Thanks! >>> >>> -Dan >>> >>
