Hi Lokesh, Abhijeet, Alex, First, thanks for jumping into this thread, the purpose of the deprecation is in a big part to try to collect the requirements of possibly existing users. Mind you that the rare times we hear about SPOE is only because of problems, so it's difficult to figure what to keep and what to cut from the existing design.
More on that below: On Fri, Mar 15, 2024 at 05:15:14PM +0000, Lokesh Jindal wrote: > Hey Christopher > > Adding to what my colleague, Abhijeet, said. > > > 1. We plan to ramp traffic to HAProxy very soon where we will heavily rely > on SPOA. In our testing, we are satisfied with SPOE in terms of > performance. The flexibility to write SPOA in any language not only allows > us to handle "complex use cases" like connecting to non-http downstreams, > but also helps in observability - metrics and logging. Interesting, the initial internal e-mail in 2016 that ignited the SPOE design was driven from the same observations: in web environments it's quite rare to find developers who are at ease with system-level languages, yet they are the ones most likely to request to extend the proxy. For this reason we wanted to offer the possibility to call code written in other languages. In addition it was estimated that the ability to connect to the agent over the network and using secure connections was absolutely essential. It brings the ability to scale the processing engine without adding more LB nodes, and even to offload that to another DC, infrastructure or even to delegate it to a 3rd party. Among the use cases which immediately came to mind were authentication, database accesses, name resolution, "remote maps", request classification, IP reputation, etc. In addition we thought that limiting ourselves to short request/responses like this was probably limiting and that it would have been useful to support streaming so that we could implement image recompression, caching, WAFs etc. The first PoC implementation that was merged in version 1.7 lacked the streaming abilities, and it's still the current implementation. It took a while before we received feedback on it, since then caching was implemented, the demand for image recompression is close to non-existing, and WAFs users have well accommodated to dealing with extra layers by now it seems. So basically we're left with something ultra-complex that deals with short request-responses most of the time, and that suffers from the initial design which was way bigger than the use cases. > 2. What is the best alternative to SPOE? I don't know yet, and the purpose of this deprecation precisely is to engage discussion with users. One could think about various HTTP-based protocols for which it is easier to implement a server, some gRPC maybe, possibly even a stripped-down version of the SPOP protocol if we figure that everyone is running on a reasonable subset that's much easier to deal with than the whole stuff we have now. > Two options that we are aware of > - Write fetchers/converters in lua or write filters in other languages > using the Lua API. In your experience, how do they compare to SPOE in terms > of: > * Performance > * Fault isolation The benefits of SPOE as you've found, clearly are in terms of flexibility as it allows to scale the number of analysers independently on the number of LBs, and it almost makes problems almost unnoticeable. For example, if your code relied on unstable 3rd party libraries, crashing your SPOA doesn't bring down the whole proxy. Similarly in terms of added latency, all the latency is in the external component, the rest of the traffic is not affected as it would be by placing some heavy processing directly inside the haproxy process. > 3. As Abhijeet said, can you share a list of issues with SPOE that make it > hard to maintain? On the top of my head I can enumerate a few: - the load balancing is integraly reimplemented at the SPOE level between applets - the load balancing is affected by the support of response-less messages, which prevent haproxy from having an estimate of the server's load, which means that an algo such as leastconn would not make sense in such a condition - the idle connections management is integraly reimplemented at the SPOE level using applets as well. An idle SPOE connection is in fact a full application-layer stream with one applet on one side and a connection on the other side. These cannot be migrated between threads, which require even more complex stream-to-stream thread-safe communication mechanisms and synchronisation. - it's possible to receive a response on a different connection than the one that saw the request. This adds complexity in the matching request lookup, needs for extra synchronisation so that the other stream doesn't vanish while we're delivering the reponse, and it also makes it harder to keep track of the number of in-flight requests. Some "hacks" were developed so that a server doesn't try to respond over a connection from another process or thread (by naming the connections) but that further adds complexity on both sides. - the protocol supports pipelining without having an idea of the servers' load nor the number of outstanding requests since it's not mandatory to respond, so the default use case is to flood a given connection. But if you don't do that you can end up with tons of connections to the servers and it's difficult to decide how many to use. There are many hard-coded heuristics that were implemented based on feedback from production deployments that were experiencing difficulties. - these "idle" connections cannot be easily shrunk when FDs are missing, yet it's difficult to consider them when enforcing resource limits since their number will depend on the apparent server load. I'm sure there are other ones, but to be honest, the SPOE design pre-dates the shared idle conns and threads support, and when such features appeared, it didn't appear really feasible to retrofit them into that, so since it was merged, it has only been source of more suffering for the maintainers. I just made the exercise of classifying the commits affecting the subsystem since 1.7 between "DEV" (non-bug) and "BUG": DEV BUG v1.7.0..v1.8.0 29 6 v1.8.0..v2.0.0 23 25 v2.0.0..v2.2.0 3 7 v2.2.0..v2.4.0 4 5 v2.4.0..v2.6.0 4 4 v2.6.0..v2.8.0 3 5 v2.8.0..master 0 4 It's pretty obvious that the dev finished in 1.8 with the addition of fragmentation/pipelining and I don't remember what, and after that it has essentially been bugs. > Be it SPOE or an alternate solution that allows us to handle complex use > cases with good performance, fault isolation (as much as possible) and > observability, we will be happy to help develop/maintain it. That's the purpose of this discussion. I'm pretty sure that we need to just give up on the original streaming idea. Let's face it, there are no users, and it adds some tremendous complexity everywhere, including at the config level where it's hard to express rules that would apply to the input or output traffic. But once we can figure what's used and what's not used, what users like and what they don't like, it will be easier to design something that suits everyone's needs better. I don't want to remove the feature before we have an alternative and as long as users are in need for it. But marking it as deprecated will at least encourage users to ask about alternate option and jump into the design discussion before they start to implement code for something that might disappear. I tend to think that something HTTP-based could possibly be easier to implement on the server side in various languages (Python, Go, NodeJS and whatever else). It would automatically benefit from our native and fast support for shared idle connections, queuing with maxconn, ability to use existing LB algorithms etc. But maybe HTTP is not the panacea and other solutions are better. Maybe some other mechanism exist and are quite popular among some communities, I don't know. Can you tell us for example if your agents are using pipelining, fragmentation, the ability to respond on another connection, or if they're consuming data without ever responding ? Just this would already be a good start. Thanks! Willy