notfound 986749 3.6.4-1 notfound 986749 3.7-1 severity 986749 serious thanks
On Thu, 15 Apr 2021 13:49:22 +0000 FUSTE Emmanuel <emmanuel.fu...@thalesgroup.com> wrote: > Hello, > > I have similar problems exacerbated by tree level of "forcemanaged=1" of > apt cacher servers behind a blucoat proxy. > Somes are VM, somes are physical. All machines / OS are ok. > My conf use VfileUseRangeOps:-1 and ResuseConnections:0 > Trashing all the caches on all the servers even does not completely cure > the problem which reappear shortly. > Client concurrency activity worsen/trigguer the problem very very fast. > Smell like treading problems. > Will activate Debug: 7 and report here if I see something interesting. After excessive testing, I am pretty sure that the root cause of this problem was solved in the commit https://salsa.debian.org/blade/apt-cacher-ng/-/commit/c333cf3829e6373bcad07c831436317a7c34fac1 or for Sid (hopefully unblocked...): https://salsa.debian.org/blade/apt-cacher-ng/-/commit/2afc3d384b2c051f2754730ed392ea5381f854f1 The other aspects with stale storage items (file recreation) were already tackled in versions 3.6.2 und 3.6.3. Your guess was not bad, the Bad-File-Descriptor problem was related to concurrency issues but the error path was not trivial. First, there was buggy usage of a RAII helper (unique_fd) which was added as an afterthought in the commit: 0c02c1a0 (Eduard Bloch 2019-11-23 11:46:20 +0100 This was never used correctly though, the extra member in the class was only for "design beauty" (uniformity) and is basically not used, but it was interfering with the existing method for graceful connection shutdown (see destructor). So actually after that change the socket was closed ASAP and NOT graceful (risking loss of the final bytes of the active TCP stream) which is an issue of its own, and then the delayed closer code (see sockio.cc) came along and tried to close this socket again, which killed random streams depending on the timing. This was not obvious with a fast server and a few clients but with some load, this becomes a real problem. Then, another problem was the graceful-closing code itself. It was not thread-safe but it was called from multi-threaded context via the FinishConnection method in conserver.cc. This is now fixed by posting the scheduling task into the IO thread. I'd also consider the code inefficient and error-prone because it was using a hashmap for a purpose where simply allocating the metadata nodes and releasing them is totally sufficient and probably cheaper. So I rewrote this mess in sockio.cc some weeks ago and current code seems to behave stable. I.e. no socket or memory leaks spotted since then. Another minor issue which caught my eye was the forceclose() helper method, which was written in a sloppy way many years ago, and which might call close(-1) once in a while. Which is not a drama but pointless. The method is now dropped in the Unstable commit (see above), it was hardly used anyway. Best regards, Eduard. -- <leichenwagen>Erst wenn der letzte Programmierer eingesperrt und die letzte Idee patentiert ist, werdet ihr merken, daß Anwälte nicht programmieren können