Re: filesystem slowdown with backports kernel
Hi Jens, On Wed, Oct 17, 2018 at 01:41:56PM +0200, Jens Holzkämper wrote: > We get the following results (with a variance within a few seconds) > > 4.9 ext4: > real 2m13.303s […] > 4.18 ext4: > real 4m3.276s Absent anyone being able to make a suggestion of exactly what broke here, perhaps you could build your own kernel packages and "git bisect" until you find the culprit? https://wiki.debian.org/DebianKernel/GitBisect When doing this I also find that using ccache avoids having to recompile absolutely everything all the time. Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting
Re: filesystem slowdown with backports kernel
Hi. On Wed, Oct 17, 2018 at 04:44:25PM +0200, Support (Jens) wrote: > Hi, > > >> we have a NAS system acting as a place to store our server's backups > >> (via rsync with link-dest). On that NAS we switched from the stable > >> kernel (4.9) to the one provided by backports (4.18) because of an > >> unrelated problem. When we do that, we see a slowdown of our backup > >> process, from the backup via rsync itself to deleting old backup > >> directories. The slowdown seems to be connected to the number of > >> files/directories as backups of systems with less files seem less > >> affected than the ones with many files. > > > > I'd complete your tests with an invocation of 'perf record/perf top' > > on NFS server side. > > The reason being - you'll be able to point out at particular > > kernel/userspace functions that are responsible for this slowdown. > > there is no NFS in play, everything was tested locally or did I > misinterpret your suggestion. No, it was a bad wording from me. Old habit - it's not a NAS unless it has NFS, and all that. Disregard 'NFS' part. You have a host that serves a role of a fileserver that experiences a slowdown. You do tests on this host with assorted kernel versions trying to locate problematic kernel versions. Before doing these tests once more you run 'perf record' in a separate shell, and terminate 'perf record' once a test is done. Next you copy resulting 'perf.data' file for safekeeping. Rinse and repeat for each kernel/filesystem tested. Once all needed combinations of kernel/filesystem are tested once more, you use 'perf report' and 'perf annonate' to show you actual userspace/kernel functions that were in play at the time of tests. Reco
Re: filesystem slowdown with backports kernel
Hi, >> we have a NAS system acting as a place to store our server's backups >> (via rsync with link-dest). On that NAS we switched from the stable >> kernel (4.9) to the one provided by backports (4.18) because of an >> unrelated problem. When we do that, we see a slowdown of our backup >> process, from the backup via rsync itself to deleting old backup >> directories. The slowdown seems to be connected to the number of >> files/directories as backups of systems with less files seem less >> affected than the ones with many files. > > I'd complete your tests with an invocation of 'perf record/perf top' > on NFS server side. > The reason being - you'll be able to point out at particular > kernel/userspace functions that are responsible for this slowdown. there is no NFS in play, everything was tested locally or did I misinterpret your suggestion. Regards, Jens
Re: filesystem slowdown with backports kernel
Hi. On Wed, Oct 17, 2018 at 01:41:56PM +0200, Jens Holzkämper wrote: > Hi, > > we have a NAS system acting as a place to store our server's backups > (via rsync with link-dest). On that NAS we switched from the stable > kernel (4.9) to the one provided by backports (4.18) because of an > unrelated problem. When we do that, we see a slowdown of our backup > process, from the backup via rsync itself to deleting old backup > directories. The slowdown seems to be connected to the number of > files/directories as backups of systems with less files seem less > affected than the ones with many files. I'd complete your tests with an invocation of 'perf record/perf top' on NFS server side. The reason being - you'll be able to point out at particular kernel/userspace functions that are responsible for this slowdown. Reco
filesystem slowdown with backports kernel
Hi, we have a NAS system acting as a place to store our server's backups (via rsync with link-dest). On that NAS we switched from the stable kernel (4.9) to the one provided by backports (4.18) because of an unrelated problem. When we do that, we see a slowdown of our backup process, from the backup via rsync itself to deleting old backup directories. The slowdown seems to be connected to the number of files/directories as backups of systems with less files seem less affected than the ones with many files. So we started benchmarking and the following seems to do the trick in showing our problem by creating about 100k directories and files (10 dirs containing 1 directories and files for easier deleting between tries): #!/bin/bash time ( for i in {0..9};do for j in {..};do mkdir -p $i/$j touch $i/$j/1 done done ) We get the following results (with a variance within a few seconds) 4.9 ext4: real2m13.303s user0m4.976s sys 0m20.424s 4.9 xfs: real2m7.416s user0m5.076s sys 0m20.960s 4.18 ext4: real4m3.276s user2m46.401s sys 1m12.546s 4.18 xfs: real3m53.430s user2m46.841s sys 1m12.716s About a 50% slowdown in time elapsed and quite an increase in user and sys. To rule out something like spectre/meltdown-mitigations we tried the oldest kernel package that's a higher version number than in stable we could find on http://snapshot.debian.org from July 2017. 4.11 ext4: real3m28.443s user2m29.551s sys 1m0.924s 4.11 xfs real3m32.438s user2m31.349s sys 1m3.333s It's a little faster than 4.18 but the problem still persists. The NAS is using a software RAID 6 via MD, and we tested with the same script on a desktop system to rule out the RAID as a problem source and see the same thing: 4.9 ext4 desktop: real2m22.525s user0m6.176s sys 0m20.872s 4.18 ext4 desktop: real4m16.412s user3m2.282s sys 1m19.308s So to us at looks like something is seriously wrong somewhere but have no clue where exactly to look for anymore. Is the test flawed, did we miss something about an expected slowdown in the news, is it really a bug and if so where can we look to locate it more precisely? Thanks in advance, Jens Holzkämper