On Mon, Apr 4, 2022 at 11:50 AM Keith Turner <[email protected]> wrote: > > On Mon, Apr 4, 2022 at 11:17 AM Christopher <[email protected]> wrote: > > > > However, I'm reluctant to include #2422, because I don't think it's near > > ready enough, and by the time it is, it will be very last minute, and I > > don't want to delay 2.1 further for it. Even if it's included as an > > experimental feature, I think it has huge potential to be disruptive, or to > > have a lot of churn by the time people actually have a chance to review it > > thoroughly. Furthermore, I think there are possible alternatives (like a > > fully client-side implementation, based on offline scanners) that would > > avoid the tight coupling of a new service to Accumulo's core code. This > > There are some advantages to scan servers over direct file access to > consider. One is scalability of computation, if a web server is > serving N client queries with scan servers those can potentially go to > different scan servers. With direct file access, all N queries and > their iterator stacks would have to run in the web server. Another is > scalability of caching/memory. When web servers send queries to scan > servers using a sticky algorithm for assigning tablets to groups of > scan servers, it could lead to good cache utilization and sharing that > may not be possible when running scans directly in the web server. So > scan servers allow scaling cache and computations for queries > independently of web servers in way that may not be possible with > direct file access. > > Another advantage to consider is isolation. With direct file access > and queries running directly in a web server, a bad query could bring > down a web server and lots of unrelated queries. Having a bad query > bring down a scan server may be less disruptive. >
I've forked this thread into its own discussion with a new subject line, because, as I suggested in my original reply, my intent was not to hijack the 2.1 planning thread with a discussion of the ScanServer implementation details. I'm fine with all those benefits (even if all the "could" and "may" were turned into concrete "will"). My objection is not an objection to the feature. It's an objection to including the feature in 2.1, based on: * readiness of the feature branch, * availability of time to review/test such a big feature without delaying 2.1, * its tight coupling to the core code in the implementation, and * the possibility that solutions may exist with the above benefits that are less tightly coupled has not yet been explored. I would be more okay with including it if: * it is ready, * it has been tested and reviewed by the wider community, * its coupling to the core Accumulo code is loosened, ideally if it's designed to use only API/SPI, and could be released as a separate, optional add-on. This might require improvements to API/SPI to expose the features needed to help it function. This could also be done by sub-classing the AccumuloClient. My concern here is the risk of technical debt and the extra maintenance costs of increased complexity for optional features that go unmaintained. We've been hurt by premature inclusion of optional/experimental features before that were rushed to release. No matter how awesome the feature is... if it's niche and optional, we should consider these risks and work to mitigate them. Otherwise, we'll be stuck with the technical debt for years to come. With a little bit of caution, we can make the feature available, without rushing, to satisfy the use case while reducing the risks. Also, one point of clarification: when I say "fully client side", I only mean relative to Accumulo, not necessarily in the client process. I'm lacking vocabulary to describe what I mean. As I understand it, the current client code has been modified to connect to ScanServers sitting off to the side of TabletServers, and the ScanServers are basically modified TabletServers with less functionality. What I mean is that instead of coupling the ScanServer to the TabletServer implementation, and coupling the ScanServer client to the AccumuloClient, there could be less coupling. The ScanServer itself could behave like a client to Accumulo and/or HDFS (and maybe even share some library code that we make public API, like RFile readers) and it could have its own client (this is just one very rough outline of an idea that could be explored). That way, the entire thing could be removed without any change in Accumulo's code, to make it truly optional (as in, optional to even have on the class path).
