On Thu, Oct 7, 2021 at 10:07 AM Daniel Collins <[email protected]> wrote:

> Hi Luke,
>
> The whole point here is actually to write to Pubsub on the other side of
> the PTransform, so what you suggested was exactly my intent. Although you
> could conceivably publish directly to Pub/Sub within the SDF, this is not
> super extensible.
>
> > once the client retries you'll publish a duplicate
>
> This is acceptable, since the sink is probably Pubsub anyway. The client
> could conceivably attach a deduplication ID to the source requests before
> sending them.
>
> > Does the order in which the requests you get from a client or across
> clients matter?
>
> No, not at all, since the sink would be Pubsub (which, given beam doesn't
> support ordering internally or ordering keys, I wouldn't be concerned with).
>
> I think the main question is still outstanding though, is there a way to
> ensure that on all tasks the pipeline JAR is loaded on, it actually will
> run to avoid stranding user messages?
>
>
Not that I'm aware of.


> -Dan
>
> On Thu, Oct 7, 2021 at 12:53 PM Luke Cwik <[email protected]> wrote:
>
>> I would suggest that you instead write the requests received within the
>> splittable DoFn directly to a queue based sink and in another part of the
>> pipeline read from that queue. For example if you were using Pubsub for the
>> queue, your pipeline would look like:
>> Create(LB config + pubsub topic A) -> ParDo(SDF get request from client
>> and write to pubsub and then ack client)
>> Pubsub(Read from A) -> ?Deduplicate? -> ... downstream processing ...
>> Since the SDF will write to pubsub before it acknowledges the message you
>> may write data that is not acked and once the client retries you'll publish
>> a duplicate. If downstream processing is not resilient to duplicates then
>> you'll want to have some unique piece of information to deduplicate on.
>>
>> Does the order in which the requests you get from a client or across
>> clients matter?
>> If yes, then you need to be aware that the parallel processing will
>> impact the order in which you see things and you might need to have data
>> sorted/ordered within the pipeline.
>>
>>
>>
>> On Wed, Oct 6, 2021 at 3:56 PM Daniel Collins <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> Bear with me, this is a bit of a weird one. I've been toying around with
>>> an idea to do http ingestion using a beam (specifically dataflow) pipeline.
>>> The concept would be that you spin up an HTTP server on each running task
>>> with a well known port as a static member of some class in the JAR (or upon
>>> initialization of a SDF the first time), then accept requests, but don't
>>> acknowledge them back to the client until the bundle finalizer
>>> <https://javadoc.io/static/org.apache.beam/beam-sdks-java-core/2.29.0/org/apache/beam/sdk/transforms/DoFn.BundleFinalizer.html>
>>>  so
>>> you know they're persisted/ have moved down the pipeline. You could then
>>> use a load balancer pointed at the instance group created by dataflow as
>>> the target for incoming requests, and create a PCollection from incoming
>>> user requests.
>>>
>>> The only part of this I don't think would work is preventing user
>>> requests from being stranded on a server that will never run the SDF that
>>> will complete them due to load balancing constraints. So my question is: is
>>> there a way to force an SDF to be run on every task where the JAR is loaded?
>>>
>>> Thanks!
>>>
>>> -Dan
>>>
>>

Reply via email to