Bug#983170: s3ql: High load causes "Transport endpoint is not connected"

2021-09-17 Thread Graham Cobb
Package: s3ql
Version: 3.7.0+dfsg-2
Followup-For: Bug #983170

Now that bullseye has shipped, and I have moved on to bookworm, I am keen to do
anything I can to help resolve this. Is there anything I can do? For example
testing with packages? Or is there an upstream fix available for testing?

-- System Information:
Debian Release: bookworm/sid
  APT prefers testing
  APT policy: (900, 'testing')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-8-amd64 (SMP w/8 CPU threads)
Locale: LANG=en_IE.utf8, LC_CTYPE=en_IE.utf8 (charmap=UTF-8) (ignored: LC_ALL 
set to en_IE.utf8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages s3ql depends on:
ii  fuse3 [fuse]  3.10.4-1
ii  libc6 2.31-17
ii  libjs-sphinxdoc   3.5.4-2
ii  libsqlite3-0  3.36.0-2
ii  procps2:3.3.17-5
ii  psmisc23.4-2
ii  python3   3.9.2-3
ii  python3-apsw  3.36.0-r1-1
ii  python3-cryptography  3.3.2-1
ii  python3-defusedxml0.6.0-2
ii  python3-dugong3.8.1+dfsg-1
ii  python3-google-auth   1.5.1-3
ii  python3-google-auth-oauthlib  0.4.2-1
ii  python3-pkg-resources 52.0.0-4
ii  python3-pyfuse3   3.2.0-2
ii  python3-requests  2.25.1+dfsg-2
ii  python3-systemd   234-3+b4
ii  python3-trio  0.13.0-2

s3ql recommends no packages.

s3ql suggests no packages.

-- debconf-show failed



Bug#983170: s3ql: High load causes "Transport endpoint is not connected"

2021-02-20 Thread Graham Cobb
Package: s3ql
Version: 3.7.0+dfsg-2
Followup-For: Bug #983170

The mount.log consists of minor variations on the following...

2021-02-20 13:21:46.208 238604:MainThread s3ql.mount.determine_threads: Using 
10 upload threads.
2021-02-20 13:21:46.210 238604:MainThread s3ql.mount.main: Autodetected 1048514 
file descriptors available for cache entries
2021-02-20 13:21:46.982 238604:MainThread s3ql.mount.get_metadata: Using cached 
metadata.
2021-02-20 13:21:47.001 238604:MainThread s3ql.mount.main_async: Setting cache 
size to 17754 MB
2021-02-20 13:21:47.004 238604:MainThread s3ql.block_cache.__init__: Loaded 0 
entries from cache
2021-02-20 13:21:47.040 238604:MainThread s3ql.mount.main_async: Mounting 
s3://eu-west-1/xxx/s3ql/yyy/ at /mnt/a...
2021-02-20 13:21:47.050 238624:MainThread 
s3ql.daemonize.detach_process_context: Daemonizing, new PID is 238625
2021-02-20 13:22:56.691 238625:MainThread s3ql.mount.unmount: Unmounting file 
system...
2021-02-20 13:23:01.703 238625:MainThread s3ql.block_cache.destroy: Could not 
complete object removals, no removal threads left alive
2021-02-20 13:23:01.710 238625:MainThread root.excepthook: Uncaught top-level 
exception:
Traceback (most recent call last):
  File "/usr/bin/mount.s3ql", line 33, in 
sys.exit(load_entry_point('s3ql==3.7.0', 'console_scripts', 'mount.s3ql')())
  File "/usr/lib/s3ql/s3ql/mount.py", line 131, in main
trio.run(main_async, options, stdout_log_handler)
  File "/usr/local/lib/python3.9/dist-packages/trio/_core/_run.py", line 1932, 
in run
raise runner.main_task_outcome.error
  File "/usr/lib/s3ql/s3ql/mount.py", line 274, in main_async
await pyfuse3.main()
  File "/usr/lib/python3/dist-packages/_pyfuse3.py", line 30, in wrapper
await fn(*args, **kwargs)
  File "src/pyfuse3.pyx", line 776, in main
  File "/usr/local/lib/python3.9/dist-packages/trio/_core/_run.py", line 815, 
in __aexit__
raise combined_error_from_nursery
trio.MultiError: NoWorkerThreads('no removal threads'), NoWorkerThreads('no 
removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no 
removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no 
removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no 
removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no 
removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no 
removal threads')

Details of embedded exception 1:

  Traceback (most recent call last):
File "/usr/lib/s3ql/s3ql/block_cache.py", line 598, in _deref_block
  self.to_remove.put(obj_id, block=False)
File "/usr/lib/python3.9/queue.py", line 137, in put
  raise Full
  queue.Full

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/_pyfuse3.py", line 30, in wrapper
  await fn(*args, **kwargs)
File "src/internal.pxi", line 278, in _session_loop
File "/usr/lib/s3ql/s3ql/fs.py", line 1172, in forget
  await self.cache.remove(id_, 0, inode.size // self.max_obj_size + 1)
File "/usr/lib/s3ql/s3ql/block_cache.py", line 847, in remove
  await self._deref_block(block_id)
File "/usr/lib/s3ql/s3ql/block_cache.py", line 600, in _deref_block
  await trio.to_thread.run_sync(self._queue_removal, obj_id)
File "/usr/local/lib/python3.9/dist-packages/trio/_threads.py", line 207, 
in to_thread_run_sync
  return await trio.lowlevel.wait_task_rescheduled(abort)
File "/usr/local/lib/python3.9/dist-packages/trio/_core/_traps.py", line 
166, in wait_task_rescheduled
  return (await _async_yield(WaitTaskRescheduled(abort_func))).unwrap()
File "/usr/lib/python3/dist-packages/outcome/_sync.py", line 111, in unwrap
  raise captured_error
File "/usr/local/lib/python3.9/dist-packages/trio/_threads.py", line 157, 
in do_release_then_return_result
  return result.unwrap()
File "/usr/lib/python3/dist-packages/outcome/_sync.py", line 111, in unwrap
  raise captured_error
File "/usr/local/lib/python3.9/dist-packages/trio/_threads.py", line 170, 
in worker_fn
  ret = sync_fn(*args)
File "/usr/lib/s3ql/s3ql/block_cache.py", line 553, in _queue_removal
  raise NoWorkerThreads('no removal threads')
  s3ql.block_cache.NoWorkerThreads: no removal threads

Details of embedded exception 2:

...

embedded exception 2 is a copy of embedded exception 1, and there are another 
10 identical embedded exceptions.

The complete log is available at http://www.cobb.uk.net/s3ql-983170-mount.log.gz

-- System Information:
Debian Release: bullseye/sid
  APT prefers testing
  APT policy: (990, 'testing')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-3-amd64 (SMP w/12 CPU threads)
Locale: LANG=en_IE.utf8, LC_CTYPE=en_IE.utf8 (charmap=UTF-8) (ignored: LC_ALL 
set to en_IE.utf8), LANGUAGE=en_GB
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via 

Bug#983170: s3ql: High load causes "Transport endpoint is not connected"

2021-02-20 Thread Francesco P. Lovergine

On Sat, Feb 20, 2021 at 01:24:16PM +, Graham Cobb wrote:

Package: s3ql
Version: 3.7.0+dfsg-2
Severity: important

Dear Maintainer,

*** Reporter, please consider answering these questions, where appropriate ***

  * What led up to the situation?
  * What exactly did you do (or not do) that was effective (or
ineffective)?
  * What was the outcome of this action?
  * What outcome did you expect instead?

*** End of the template - remove these template lines ***

After upgrading s3ql from 3.3.2+dfsg-1 I suffered from bug #982381 (trio) so
I tried manually installing trio 0.15 (as mentioned in that thread).
Although it allowed some files to be created, any long or complex operation
(such as a backup, or even an `rm -rf` of a large directory) cause a
'Software caused connection abort' error followed by
enormous numbers of 'Transport endpoint is not connected' errors.

In case it was some transient network problem, I used fsck.s3ql to fix the
filesystem and retried - same errors. And the same errors if I (fsck again
and) just try a `rm -rf` on a large directory.

I then tried installing trio 0.18. Same problem.

Mounting is working. fsck is working. Simple file operations are working.
But heavy load causes 'Software caused connection abort'. Completely repeatable.

The following commands reproduce the problem for me:

cd /mnt/mountpoint
count=100
mkdir testdir ; for f in `seq 1 $count` ; do mkdir testdir/$f ; dd 
if=/dev/urandom  bs=1000 count=1 of=testdir/$f/test status=none ; done
rm -rf testdir
umount /mnt/mountpoint

With the count at 100 the problem occurs when the unmount happens. If the count
is increased to 2000 the problem occurs during the run.

This is using the S3 backend.

By the way, this workload has been working for many years with no problems,
and was working with 3.3.2+dfsg-1 before I decided to try testing 3.7.0+dfsg-2.



Could you please follow-up with your ~/.s3ql/mount.log log about the error?

--
Francesco P. Lovergine



Bug#983170: s3ql: High load causes "Transport endpoint is not connected"

2021-02-20 Thread Francesco P. Lovergine

severity 983170 grave
tags 983170 + upstream help
thanks

It seems to me that this version cannot simply be distributed as is.
Even, the wrong assumption about trio version compatibility renders it not 
compatible with bullseye status.


On Sat, Feb 20, 2021 at 01:24:16PM +, Graham Cobb wrote:

Package: s3ql
Version: 3.7.0+dfsg-2
Severity: important

Dear Maintainer,

After upgrading s3ql from 3.3.2+dfsg-1 I suffered from bug #982381 (trio) so
I tried manually installing trio 0.15 (as mentioned in that thread).
Although it allowed some files to be created, any long or complex operation
(such as a backup, or even an `rm -rf` of a large directory) cause a
'Software caused connection abort' error followed by
enormous numbers of 'Transport endpoint is not connected' errors.

In case it was some transient network problem, I used fsck.s3ql to fix the
filesystem and retried - same errors. And the same errors if I (fsck again
and) just try a `rm -rf` on a large directory.

I then tried installing trio 0.18. Same problem.

Mounting is working. fsck is working. Simple file operations are working.
But heavy load causes 'Software caused connection abort'. Completely repeatable.

The following commands reproduce the problem for me:

cd /mnt/mountpoint
count=100
mkdir testdir ; for f in `seq 1 $count` ; do mkdir testdir/$f ; dd 
if=/dev/urandom  bs=1000 count=1 of=testdir/$f/test status=none ; done
rm -rf testdir
umount /mnt/mountpoint

With the count at 100 the problem occurs when the unmount happens. If the count
is increased to 2000 the problem occurs during the run.

This is using the S3 backend.

By the way, this workload has been working for many years with no problems,
and was working with 3.3.2+dfsg-1 before I decided to try testing 3.7.0+dfsg-2.



--
Francesco P. Lovergine



Bug#983170: s3ql: High load causes "Transport endpoint is not connected"

2021-02-20 Thread Graham Cobb
Package: s3ql
Version: 3.7.0+dfsg-2
Severity: important

Dear Maintainer,

*** Reporter, please consider answering these questions, where appropriate ***

   * What led up to the situation?
   * What exactly did you do (or not do) that was effective (or
 ineffective)?
   * What was the outcome of this action?
   * What outcome did you expect instead?

*** End of the template - remove these template lines ***

After upgrading s3ql from 3.3.2+dfsg-1 I suffered from bug #982381 (trio) so
I tried manually installing trio 0.15 (as mentioned in that thread).
Although it allowed some files to be created, any long or complex operation
(such as a backup, or even an `rm -rf` of a large directory) cause a
'Software caused connection abort' error followed by
enormous numbers of 'Transport endpoint is not connected' errors.

In case it was some transient network problem, I used fsck.s3ql to fix the
filesystem and retried - same errors. And the same errors if I (fsck again
and) just try a `rm -rf` on a large directory.

I then tried installing trio 0.18. Same problem.

Mounting is working. fsck is working. Simple file operations are working.
But heavy load causes 'Software caused connection abort'. Completely repeatable.

The following commands reproduce the problem for me:

 cd /mnt/mountpoint
 count=100
 mkdir testdir ; for f in `seq 1 $count` ; do mkdir testdir/$f ; dd 
if=/dev/urandom  bs=1000 count=1 of=testdir/$f/test status=none ; done
 rm -rf testdir
 umount /mnt/mountpoint

With the count at 100 the problem occurs when the unmount happens. If the count
is increased to 2000 the problem occurs during the run.

This is using the S3 backend.

By the way, this workload has been working for many years with no problems,
and was working with 3.3.2+dfsg-1 before I decided to try testing 3.7.0+dfsg-2.

-- System Information:
Debian Release: bullseye/sid
  APT prefers testing
  APT policy: (990, 'testing')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-3-amd64 (SMP w/12 CPU threads)
Locale: LANG=en_IE.utf8, LC_CTYPE=en_IE.utf8 (charmap=UTF-8) (ignored: LC_ALL 
set to en_IE.utf8), LANGUAGE=en_GB
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages s3ql depends on:
ii  fuse3 [fuse]  3.10.2-1
ii  libc6 2.31-9
ii  libjs-sphinxdoc   3.4.3-1
ii  libsqlite3-0  3.34.1-1
ii  procps2:3.3.16-5
ii  psmisc23.3-1
ii  python3   3.9.1-1
ii  python3-apsw  3.32.2-r1-1+b2
ii  python3-cryptography  3.3.1-1
ii  python3-defusedxml0.6.0-2
ii  python3-dugong3.8.1+dfsg-1
ii  python3-google-auth   1.5.1-3
ii  python3-google-auth-oauthlib  0.4.2-1
ii  python3-pkg-resources 52.0.0-1
ii  python3-pyfuse3   3.2.0-2
ii  python3-requests  2.25.1+dfsg-2
ii  python3-systemd   234-3+b4
ii  python3-trio  0.13.0-2

s3ql recommends no packages.

s3ql suggests no packages.

-- no debconf information