Bug#983170: s3ql: High load causes "Transport endpoint is not connected"
Package: s3ql Version: 3.7.0+dfsg-2 Followup-For: Bug #983170 Now that bullseye has shipped, and I have moved on to bookworm, I am keen to do anything I can to help resolve this. Is there anything I can do? For example testing with packages? Or is there an upstream fix available for testing? -- System Information: Debian Release: bookworm/sid APT prefers testing APT policy: (900, 'testing') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 5.10.0-8-amd64 (SMP w/8 CPU threads) Locale: LANG=en_IE.utf8, LC_CTYPE=en_IE.utf8 (charmap=UTF-8) (ignored: LC_ALL set to en_IE.utf8), LANGUAGE not set Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages s3ql depends on: ii fuse3 [fuse] 3.10.4-1 ii libc6 2.31-17 ii libjs-sphinxdoc 3.5.4-2 ii libsqlite3-0 3.36.0-2 ii procps2:3.3.17-5 ii psmisc23.4-2 ii python3 3.9.2-3 ii python3-apsw 3.36.0-r1-1 ii python3-cryptography 3.3.2-1 ii python3-defusedxml0.6.0-2 ii python3-dugong3.8.1+dfsg-1 ii python3-google-auth 1.5.1-3 ii python3-google-auth-oauthlib 0.4.2-1 ii python3-pkg-resources 52.0.0-4 ii python3-pyfuse3 3.2.0-2 ii python3-requests 2.25.1+dfsg-2 ii python3-systemd 234-3+b4 ii python3-trio 0.13.0-2 s3ql recommends no packages. s3ql suggests no packages. -- debconf-show failed
Bug#983170: s3ql: High load causes "Transport endpoint is not connected"
Package: s3ql Version: 3.7.0+dfsg-2 Followup-For: Bug #983170 The mount.log consists of minor variations on the following... 2021-02-20 13:21:46.208 238604:MainThread s3ql.mount.determine_threads: Using 10 upload threads. 2021-02-20 13:21:46.210 238604:MainThread s3ql.mount.main: Autodetected 1048514 file descriptors available for cache entries 2021-02-20 13:21:46.982 238604:MainThread s3ql.mount.get_metadata: Using cached metadata. 2021-02-20 13:21:47.001 238604:MainThread s3ql.mount.main_async: Setting cache size to 17754 MB 2021-02-20 13:21:47.004 238604:MainThread s3ql.block_cache.__init__: Loaded 0 entries from cache 2021-02-20 13:21:47.040 238604:MainThread s3ql.mount.main_async: Mounting s3://eu-west-1/xxx/s3ql/yyy/ at /mnt/a... 2021-02-20 13:21:47.050 238624:MainThread s3ql.daemonize.detach_process_context: Daemonizing, new PID is 238625 2021-02-20 13:22:56.691 238625:MainThread s3ql.mount.unmount: Unmounting file system... 2021-02-20 13:23:01.703 238625:MainThread s3ql.block_cache.destroy: Could not complete object removals, no removal threads left alive 2021-02-20 13:23:01.710 238625:MainThread root.excepthook: Uncaught top-level exception: Traceback (most recent call last): File "/usr/bin/mount.s3ql", line 33, in sys.exit(load_entry_point('s3ql==3.7.0', 'console_scripts', 'mount.s3ql')()) File "/usr/lib/s3ql/s3ql/mount.py", line 131, in main trio.run(main_async, options, stdout_log_handler) File "/usr/local/lib/python3.9/dist-packages/trio/_core/_run.py", line 1932, in run raise runner.main_task_outcome.error File "/usr/lib/s3ql/s3ql/mount.py", line 274, in main_async await pyfuse3.main() File "/usr/lib/python3/dist-packages/_pyfuse3.py", line 30, in wrapper await fn(*args, **kwargs) File "src/pyfuse3.pyx", line 776, in main File "/usr/local/lib/python3.9/dist-packages/trio/_core/_run.py", line 815, in __aexit__ raise combined_error_from_nursery trio.MultiError: NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads') Details of embedded exception 1: Traceback (most recent call last): File "/usr/lib/s3ql/s3ql/block_cache.py", line 598, in _deref_block self.to_remove.put(obj_id, block=False) File "/usr/lib/python3.9/queue.py", line 137, in put raise Full queue.Full During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/_pyfuse3.py", line 30, in wrapper await fn(*args, **kwargs) File "src/internal.pxi", line 278, in _session_loop File "/usr/lib/s3ql/s3ql/fs.py", line 1172, in forget await self.cache.remove(id_, 0, inode.size // self.max_obj_size + 1) File "/usr/lib/s3ql/s3ql/block_cache.py", line 847, in remove await self._deref_block(block_id) File "/usr/lib/s3ql/s3ql/block_cache.py", line 600, in _deref_block await trio.to_thread.run_sync(self._queue_removal, obj_id) File "/usr/local/lib/python3.9/dist-packages/trio/_threads.py", line 207, in to_thread_run_sync return await trio.lowlevel.wait_task_rescheduled(abort) File "/usr/local/lib/python3.9/dist-packages/trio/_core/_traps.py", line 166, in wait_task_rescheduled return (await _async_yield(WaitTaskRescheduled(abort_func))).unwrap() File "/usr/lib/python3/dist-packages/outcome/_sync.py", line 111, in unwrap raise captured_error File "/usr/local/lib/python3.9/dist-packages/trio/_threads.py", line 157, in do_release_then_return_result return result.unwrap() File "/usr/lib/python3/dist-packages/outcome/_sync.py", line 111, in unwrap raise captured_error File "/usr/local/lib/python3.9/dist-packages/trio/_threads.py", line 170, in worker_fn ret = sync_fn(*args) File "/usr/lib/s3ql/s3ql/block_cache.py", line 553, in _queue_removal raise NoWorkerThreads('no removal threads') s3ql.block_cache.NoWorkerThreads: no removal threads Details of embedded exception 2: ... embedded exception 2 is a copy of embedded exception 1, and there are another 10 identical embedded exceptions. The complete log is available at http://www.cobb.uk.net/s3ql-983170-mount.log.gz -- System Information: Debian Release: bullseye/sid APT prefers testing APT policy: (990, 'testing') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 5.10.0-3-amd64 (SMP w/12 CPU threads) Locale: LANG=en_IE.utf8, LC_CTYPE=en_IE.utf8 (charmap=UTF-8) (ignored: LC_ALL set to en_IE.utf8), LANGUAGE=en_GB Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via
Bug#983170: s3ql: High load causes "Transport endpoint is not connected"
On Sat, Feb 20, 2021 at 01:24:16PM +, Graham Cobb wrote: Package: s3ql Version: 3.7.0+dfsg-2 Severity: important Dear Maintainer, *** Reporter, please consider answering these questions, where appropriate *** * What led up to the situation? * What exactly did you do (or not do) that was effective (or ineffective)? * What was the outcome of this action? * What outcome did you expect instead? *** End of the template - remove these template lines *** After upgrading s3ql from 3.3.2+dfsg-1 I suffered from bug #982381 (trio) so I tried manually installing trio 0.15 (as mentioned in that thread). Although it allowed some files to be created, any long or complex operation (such as a backup, or even an `rm -rf` of a large directory) cause a 'Software caused connection abort' error followed by enormous numbers of 'Transport endpoint is not connected' errors. In case it was some transient network problem, I used fsck.s3ql to fix the filesystem and retried - same errors. And the same errors if I (fsck again and) just try a `rm -rf` on a large directory. I then tried installing trio 0.18. Same problem. Mounting is working. fsck is working. Simple file operations are working. But heavy load causes 'Software caused connection abort'. Completely repeatable. The following commands reproduce the problem for me: cd /mnt/mountpoint count=100 mkdir testdir ; for f in `seq 1 $count` ; do mkdir testdir/$f ; dd if=/dev/urandom bs=1000 count=1 of=testdir/$f/test status=none ; done rm -rf testdir umount /mnt/mountpoint With the count at 100 the problem occurs when the unmount happens. If the count is increased to 2000 the problem occurs during the run. This is using the S3 backend. By the way, this workload has been working for many years with no problems, and was working with 3.3.2+dfsg-1 before I decided to try testing 3.7.0+dfsg-2. Could you please follow-up with your ~/.s3ql/mount.log log about the error? -- Francesco P. Lovergine
Bug#983170: s3ql: High load causes "Transport endpoint is not connected"
severity 983170 grave tags 983170 + upstream help thanks It seems to me that this version cannot simply be distributed as is. Even, the wrong assumption about trio version compatibility renders it not compatible with bullseye status. On Sat, Feb 20, 2021 at 01:24:16PM +, Graham Cobb wrote: Package: s3ql Version: 3.7.0+dfsg-2 Severity: important Dear Maintainer, After upgrading s3ql from 3.3.2+dfsg-1 I suffered from bug #982381 (trio) so I tried manually installing trio 0.15 (as mentioned in that thread). Although it allowed some files to be created, any long or complex operation (such as a backup, or even an `rm -rf` of a large directory) cause a 'Software caused connection abort' error followed by enormous numbers of 'Transport endpoint is not connected' errors. In case it was some transient network problem, I used fsck.s3ql to fix the filesystem and retried - same errors. And the same errors if I (fsck again and) just try a `rm -rf` on a large directory. I then tried installing trio 0.18. Same problem. Mounting is working. fsck is working. Simple file operations are working. But heavy load causes 'Software caused connection abort'. Completely repeatable. The following commands reproduce the problem for me: cd /mnt/mountpoint count=100 mkdir testdir ; for f in `seq 1 $count` ; do mkdir testdir/$f ; dd if=/dev/urandom bs=1000 count=1 of=testdir/$f/test status=none ; done rm -rf testdir umount /mnt/mountpoint With the count at 100 the problem occurs when the unmount happens. If the count is increased to 2000 the problem occurs during the run. This is using the S3 backend. By the way, this workload has been working for many years with no problems, and was working with 3.3.2+dfsg-1 before I decided to try testing 3.7.0+dfsg-2. -- Francesco P. Lovergine
Bug#983170: s3ql: High load causes "Transport endpoint is not connected"
Package: s3ql Version: 3.7.0+dfsg-2 Severity: important Dear Maintainer, *** Reporter, please consider answering these questions, where appropriate *** * What led up to the situation? * What exactly did you do (or not do) that was effective (or ineffective)? * What was the outcome of this action? * What outcome did you expect instead? *** End of the template - remove these template lines *** After upgrading s3ql from 3.3.2+dfsg-1 I suffered from bug #982381 (trio) so I tried manually installing trio 0.15 (as mentioned in that thread). Although it allowed some files to be created, any long or complex operation (such as a backup, or even an `rm -rf` of a large directory) cause a 'Software caused connection abort' error followed by enormous numbers of 'Transport endpoint is not connected' errors. In case it was some transient network problem, I used fsck.s3ql to fix the filesystem and retried - same errors. And the same errors if I (fsck again and) just try a `rm -rf` on a large directory. I then tried installing trio 0.18. Same problem. Mounting is working. fsck is working. Simple file operations are working. But heavy load causes 'Software caused connection abort'. Completely repeatable. The following commands reproduce the problem for me: cd /mnt/mountpoint count=100 mkdir testdir ; for f in `seq 1 $count` ; do mkdir testdir/$f ; dd if=/dev/urandom bs=1000 count=1 of=testdir/$f/test status=none ; done rm -rf testdir umount /mnt/mountpoint With the count at 100 the problem occurs when the unmount happens. If the count is increased to 2000 the problem occurs during the run. This is using the S3 backend. By the way, this workload has been working for many years with no problems, and was working with 3.3.2+dfsg-1 before I decided to try testing 3.7.0+dfsg-2. -- System Information: Debian Release: bullseye/sid APT prefers testing APT policy: (990, 'testing') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 5.10.0-3-amd64 (SMP w/12 CPU threads) Locale: LANG=en_IE.utf8, LC_CTYPE=en_IE.utf8 (charmap=UTF-8) (ignored: LC_ALL set to en_IE.utf8), LANGUAGE=en_GB Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages s3ql depends on: ii fuse3 [fuse] 3.10.2-1 ii libc6 2.31-9 ii libjs-sphinxdoc 3.4.3-1 ii libsqlite3-0 3.34.1-1 ii procps2:3.3.16-5 ii psmisc23.3-1 ii python3 3.9.1-1 ii python3-apsw 3.32.2-r1-1+b2 ii python3-cryptography 3.3.1-1 ii python3-defusedxml0.6.0-2 ii python3-dugong3.8.1+dfsg-1 ii python3-google-auth 1.5.1-3 ii python3-google-auth-oauthlib 0.4.2-1 ii python3-pkg-resources 52.0.0-1 ii python3-pyfuse3 3.2.0-2 ii python3-requests 2.25.1+dfsg-2 ii python3-systemd 234-3+b4 ii python3-trio 0.13.0-2 s3ql recommends no packages. s3ql suggests no packages. -- no debconf information