I defer to those with more operational experience of ken and smoosh but wouldn't those new subsystems radically impact performance if IOQ is completely bypassed (assuming ken/smoosh are enabled by default)?
On Wed, 11 Sep 2019 at 22:04, Adam Kocoloski <kocol...@apache.org> wrote: > A few months ago a bunch of code landed on master around IO QoS and > prioritization. I think we need to have a conversation about the defaults > for that system and what we want to allow users to enable. > > First topic - there are actually two different generations of the IOQ > system: IOQ and IOQ2. Only one can be active at a given time, and the > configurations are not compatible. The best use case for this queueing > system is to de-prioritize IO for bookkeeping tasks like internal > replication and compaction in favor of IO to respond to client requests. > > The original and currently default IOQ system primarily works by > classifying the IO based on whether it’s serving an interactive read or > write request, an index build, a compaction job, etc. It builds queues for > each of these IO classes and allows for relative prioritization of the > different classes of IO. The main downside of this system is that it can > only sustain a total throughput of about 20,000 operations/sec/node. > Heavily-loaded systems frequently have to configure “bypasses” for certain > classes of IO to keep latencies low. > > IOQ2 was conceived to deliver higher throughput without resorting to > bypasses and thus defeating the QoS. It’s a significantly more complex > system. Tenants are a first-class concept in IOQ2, but of course they’re > not in the rest of the CouchDB, so some of the code in there that computes > per-user priorities will not work correctly. As far as I can tell it will > fail gracefully (i.e., it will bucket every database as belonging to the > same “user”), but I doubt this has been tested. IOQ2 definitely can sustain > higher throughputs, though it has been known to enqueue so many more IO > requests than it can issue that it effectively led to an outage anyway. It > is still a material overhead compared to bypassing the QoS entirely. > > I think there are a few possible paths forward: > > 1) Switch to IOQ2 and only document that one. > 2) Document IOQ, installing bypasses across the board by default to avoid > a big performance regression on upgrade > 3) Just bypass the whole thing and don’t document it, to avoid introducing > a big new admin capability in 3.0 and removing it in 4.0 > > Personally I think I’m leaning towards 3) at this point, but could be > convinced otherwise. Regards, > > Adam