Re: Force SDF to run on every task the JAR is loaded on?

Steve Niemitz Thu, 07 Oct 2021 13:37:09 -0700

> Not sure what firewall rules you're referring to

If you SSH into a random dataflow worker and run `sudo iptables -L -v` you
can see that the INPUT policy is DROP, and only a few ports are allowed in.


On Thu, Oct 7, 2021 at 4:22 PM Daniel Collins <dpcoll...@google.com> wrote:

> Yeah that could work, I'll give it a try.
>
> Not sure what firewall rules you're referring to- Luke seemed to think
> there weren't any? But I'll see if I hit them. Dataflow workers are now
> deployed with managed instance groups
> <https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#workers>,
> so I assume I can always ignore the warning there and mess around with the
> instance group settings in a totally unsupported way if I have to.
>
> Thanks for the help!
>
>
> On Thu, Oct 7, 2021 at 2:15 PM Steve Niemitz <sniem...@apache.org> wrote:
>
>> hm, how about this?  Start/stop the server in @Setup/@Teardown (probably
>> with ref counting since you can have multiple instances on one host) and
>> then use the health checks on the load balancer to prevent routing to
>> instances w/o the SDF running?
>>
>> Curious how you'll get around the firewall rules though, I've wanted to
>> do something similar in the past and that was the biggest issue :)
>>
>> On Thu, Oct 7, 2021 at 1:50 PM Daniel Collins <dpcoll...@google.com>
>> wrote:
>>
>>> > Yes a JvmInitializer would work great as well.
>>>
>>> To be clear, I don't think there's an issue with JvmInitializer instead
>>> of a java static. However, it doesn't seem to solve the problem, which is
>>> ensuring messages received by the server are not stranded on a machine that
>>> will never accept them for processing (i.e. will never run the SDF
>>> implementation which pulls them out of the buffer).
>>>
>>> I actually am not super concerned about splitting, since there is no
>>> concept of progress through the restriction of "all messages the user may
>>> send in the future", you can just divide the restriction infinitely without
>>> issue (i.e. always allow a split), as long as there is some way to ensure
>>> the runner will run the SDF on every JVM it is loaded on.
>>>
>>> On Thu, Oct 7, 2021 at 1:38 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> Yes a JvmInitializer would work great as well.
>>>>
>>>> The difference with an SDF would be that you could use other pipeline
>>>> constructs before the SDF to add more http listeners with different
>>>> configurations. Also, once there is a runner that supports dynamic
>>>> splitting you would be able to scale up and down based upon incoming
>>>> requests and might be able to reduce the burden/load on the load balancers
>>>> themselves.
>>>>
>>>> On Thu, Oct 7, 2021 at 10:16 AM Steve Niemitz <sniem...@apache.org>
>>>> wrote:
>>>>
>>>>> unrelated to the actual question, but iirc dataflow workers have
>>>>> iptables rules that drop all inbound traffic (other than a few 
>>>>> exceptions).
>>>>>
>>>>> In any case, do you actually need the server part to be "inside" the
>>>>> pipeline?  Could you just use a JvmInitalizer to launch the http server 
>>>>> and
>>>>> do the pubsub publishing there?
>>>>>
>>>>> On Thu, Oct 7, 2021 at 12:53 PM Luke Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> I would suggest that you instead write the requests received within
>>>>>> the splittable DoFn directly to a queue based sink and in another part of
>>>>>> the pipeline read from that queue. For example if you were using Pubsub 
>>>>>> for
>>>>>> the queue, your pipeline would look like:
>>>>>> Create(LB config + pubsub topic A) -> ParDo(SDF get request from
>>>>>> client and write to pubsub and then ack client)
>>>>>> Pubsub(Read from A) -> ?Deduplicate? -> ... downstream processing ...
>>>>>> Since the SDF will write to pubsub before it acknowledges the
>>>>>> message you may write data that is not acked and once the client retries
>>>>>> you'll publish a duplicate. If downstream processing is not resilient to
>>>>>> duplicates then you'll want to have some unique piece of information to
>>>>>> deduplicate on.
>>>>>>
>>>>>> Does the order in which the requests you get from a client or across
>>>>>> clients matter?
>>>>>> If yes, then you need to be aware that the parallel processing will
>>>>>> impact the order in which you see things and you might need to have data
>>>>>> sorted/ordered within the pipeline.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 6, 2021 at 3:56 PM Daniel Collins <dpcoll...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Bear with me, this is a bit of a weird one. I've been toying around
>>>>>>> with an idea to do http ingestion using a beam (specifically dataflow)
>>>>>>> pipeline. The concept would be that you spin up an HTTP server on each
>>>>>>> running task with a well known port as a static member of some class in 
>>>>>>> the
>>>>>>> JAR (or upon initialization of a SDF the first time), then accept 
>>>>>>> requests,
>>>>>>> but don't acknowledge them back to the client until the bundle
>>>>>>> finalizer
>>>>>>> <https://javadoc.io/static/org.apache.beam/beam-sdks-java-core/2.29.0/org/apache/beam/sdk/transforms/DoFn.BundleFinalizer.html>
>>>>>>>  so
>>>>>>> you know they're persisted/ have moved down the pipeline. You could then
>>>>>>> use a load balancer pointed at the instance group created by dataflow as
>>>>>>> the target for incoming requests, and create a PCollection from incoming
>>>>>>> user requests.
>>>>>>>
>>>>>>> The only part of this I don't think would work is preventing user
>>>>>>> requests from being stranded on a server that will never run the SDF 
>>>>>>> that
>>>>>>> will complete them due to load balancing constraints. So my question 
>>>>>>> is: is
>>>>>>> there a way to force an SDF to be run on every task where the JAR is 
>>>>>>> loaded?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> -Dan
>>>>>>>
>>>>>>

Re: Force SDF to run on every task the JAR is loaded on?

Reply via email to