Re: filesystem slowdown with backports kernel

2018-10-17 Thread Andy Smith
Hi Jens,

On Wed, Oct 17, 2018 at 01:41:56PM +0200, Jens Holzkämper wrote:
> We get the following results (with a variance within a few seconds)
> 
> 4.9 ext4:
> real  2m13.303s

[…]

> 4.18 ext4:
> real  4m3.276s

Absent anyone being able to make a suggestion of exactly what broke
here, perhaps you could build your own kernel packages and "git
bisect" until you find the culprit?

https://wiki.debian.org/DebianKernel/GitBisect

When doing this I also find that using ccache avoids having to
recompile absolutely everything all the time.

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting



Re: filesystem slowdown with backports kernel

2018-10-17 Thread Reco
Hi.

On Wed, Oct 17, 2018 at 04:44:25PM +0200, Support (Jens) wrote:
> Hi,
> 
> >> we have a NAS system acting as a place to store our server's backups
> >> (via rsync with link-dest). On that NAS we switched from the stable
> >> kernel (4.9) to the one provided by backports (4.18) because of an
> >> unrelated problem. When we do that, we see a slowdown of our backup
> >> process, from the backup via rsync itself to deleting old backup
> >> directories. The slowdown seems to be connected to the number of
> >> files/directories as backups of systems with less files seem less
> >> affected than the ones with many files.
> >
> > I'd complete your tests with an invocation of 'perf record/perf top'
> > on NFS server side.
> > The reason being - you'll be able to point out at particular
> > kernel/userspace functions that are responsible for this slowdown.
> 
> there is no NFS in play, everything was tested locally or did I
> misinterpret your suggestion.

No, it was a bad wording from me. Old habit - it's not a NAS unless it
has NFS, and all that. Disregard 'NFS' part.

You have a host that serves a role of a fileserver that experiences a
slowdown. You do tests on this host with assorted kernel versions trying
to locate problematic kernel versions.

Before doing these tests once more you run 'perf record' in a separate
shell, and terminate 'perf record' once a test is done. Next you copy
resulting 'perf.data' file for safekeeping.
Rinse and repeat for each kernel/filesystem tested.

Once all needed combinations of kernel/filesystem are tested once more,
you use 'perf report' and 'perf annonate' to show you actual
userspace/kernel functions that were in play at the time of tests.

Reco



Re: filesystem slowdown with backports kernel

2018-10-17 Thread Support (Jens)
Hi,

>> we have a NAS system acting as a place to store our server's backups
>> (via rsync with link-dest). On that NAS we switched from the stable
>> kernel (4.9) to the one provided by backports (4.18) because of an
>> unrelated problem. When we do that, we see a slowdown of our backup
>> process, from the backup via rsync itself to deleting old backup
>> directories. The slowdown seems to be connected to the number of
>> files/directories as backups of systems with less files seem less
>> affected than the ones with many files.
>
> I'd complete your tests with an invocation of 'perf record/perf top'
> on NFS server side.
> The reason being - you'll be able to point out at particular
> kernel/userspace functions that are responsible for this slowdown.

there is no NFS in play, everything was tested locally or did I
misinterpret your suggestion.

Regards,
Jens



Re: filesystem slowdown with backports kernel

2018-10-17 Thread Reco
Hi.

On Wed, Oct 17, 2018 at 01:41:56PM +0200, Jens Holzkämper wrote:
> Hi,
> 
> we have a NAS system acting as a place to store our server's backups
> (via rsync with link-dest). On that NAS we switched from the stable
> kernel (4.9) to the one provided by backports (4.18) because of an
> unrelated problem. When we do that, we see a slowdown of our backup
> process, from the backup via rsync itself to deleting old backup
> directories. The slowdown seems to be connected to the number of
> files/directories as backups of systems with less files seem less
> affected than the ones with many files.

I'd complete your tests with an invocation of 'perf record/perf top' on
NFS server side.
The reason being - you'll be able to point out at particular
kernel/userspace functions that are responsible for this slowdown.

Reco



filesystem slowdown with backports kernel

2018-10-17 Thread Jens Holzkämper
Hi,

we have a NAS system acting as a place to store our server's backups
(via rsync with link-dest). On that NAS we switched from the stable
kernel (4.9) to the one provided by backports (4.18) because of an
unrelated problem. When we do that, we see a slowdown of our backup
process, from the backup via rsync itself to deleting old backup
directories. The slowdown seems to be connected to the number of
files/directories as backups of systems with less files seem less
affected than the ones with many files.


So we started benchmarking and the following seems to do the trick in
showing our problem by creating about 100k directories and files (10
dirs containing 1 directories and files for easier deleting between
tries):

#!/bin/bash
time (
for i in {0..9};do
for j in {..};do
mkdir -p $i/$j
touch $i/$j/1
done
done
)


We get the following results (with a variance within a few seconds)

4.9 ext4:
real2m13.303s
user0m4.976s
sys 0m20.424s

4.9 xfs:
real2m7.416s
user0m5.076s
sys 0m20.960s

4.18 ext4:
real4m3.276s
user2m46.401s
sys 1m12.546s

4.18 xfs:
real3m53.430s
user2m46.841s
sys 1m12.716s

About a 50% slowdown in time elapsed and quite an increase in user and sys.


To rule out something like spectre/meltdown-mitigations we tried the
oldest kernel package that's a higher version number than in stable we
could find on http://snapshot.debian.org from July 2017.

4.11 ext4:
real3m28.443s
user2m29.551s
sys 1m0.924s

4.11 xfs
real3m32.438s
user2m31.349s
sys 1m3.333s

It's a little faster than 4.18 but the problem still persists.


The NAS is using a software RAID 6 via MD, and we tested with the same
script on a desktop system to rule out the RAID as a problem source and
see the same thing:

4.9 ext4 desktop:
real2m22.525s
user0m6.176s
sys 0m20.872s

4.18 ext4 desktop:
real4m16.412s
user3m2.282s
sys 1m19.308s


So to us at looks like something is seriously wrong somewhere but have
no clue where exactly to look for anymore. Is the test flawed, did we
miss something about an expected slowdown in the news, is it really a
bug and if so where can we look to locate it more precisely?

Thanks in advance,
Jens Holzkämper