Re: What belongs in the Debian cloud kernel?

2020-04-04 Thread Laurence Parry
> For buster, we generate a cloud kernel for amd64. For sid/bullseye,
> we'll also support a cloud kernel for arm64. At the moment, the cloud
> kernel is the only used in the images we generate for Microsoft Azure
> and Amazon EC2. It's used in the GCE images we generate as well, but
> I'm not sure anybody actually uses those.

I use those, though I'm unsure if it's a level of usage that you'd consider
significant (I also use x32 to a similar extent :-).

I run a Munin master for 17 nodes on an f1-micro with buster-backports
cloud-amd64 - proxied via App Engine to get 1GB/day out for viewing graphs,
rather than 1GB/month/region. Works well enough that I didn't immediately
feel the need to pare it down to a bare minimum and roll my own like I do
with the regular kernels. (I may try to e.g. avoid SMP overhead or cut fs
features to increase inode/dentry slab density; but not sure I can compile
it locally on a 20% of Skylake core with 1GB RAM, especially when it's
already over half its available resources to generate/store graphs.)

My initrd (dep, xz) seems to have gone up from 4.66 MB on disk in
5.4.0-0.bpo.3-cloud-amd64 to 5.12 MB for bpo.4, but the kernel memory line
suggests RAM was minimally impacted by 4KB.

As for the more general question: to me, a 'cloud' machine is just a VM
offered via a cloud provider, supporting the management interfaces of such
providers out of the box. It may be a relatively 'large' machine, hosting
guests of its own. I wouldn't expect to see drivers for old
directly-attached hardware, but legacy or research filesystems and
protocols and drivers used for access to enterprise external storage seems
reasonable, as cloud may be used to migrate older workloads or handle
research projects. "Bare metal cloud" may be outside its remit - though,
these will tend to be newer hardware.

Best regards,
-- 
Laurence "GreenReaper" Parry - Inkbunny administrator
https://www.greenreaper.co.uk/


Bug#953680: linux-image-5.5.0-rc5-amd64: When will the stable version of 5.5 reach the repos?

2020-03-30 Thread Laurence Parry
> Kernel 5.6 was released yesterday
> from upstream, so isn't it a bit late
> now for 5.5?

>From what I've seen, it's not unusual for Debian's kernel team to wait
several minor point releases until there is a kernel they're happy with -
indeed, I wouldn't be surprised if the policy is to wait until the initial
version of the *next* major release is out.

Often early kernel revisions have a fair share of issues - often not
limited to the new features added in that major release. They are typically
also backported to otherwise stable systems, so picking a good revision has
some importance. Meanwhile, most fixes do make it into earlier stable
kernels.

To take one example, 5.5.9 had a fix for btrfs' new checksum feature, which
didn't work properly with direct I/O. So if you updated to 5.5 on its
initial release and made a new FS to use it, you might have had an
unpleasant surprise and spent a lot of time debugging a problem.

There was also a nasty bug in early 5.2 releases which led to delayed
writes not being flushed to disk, and ultimately data loss. Unfortunately
in that case it was fixed too late and made it into Debian backports. But
usually, the delay helps to avoid that kind of thing.

Of course, if you really want a new feature, you can download and compile a
kernel yourself, either from kernel.org - which is what I ended up doing to
get those checksums - using Debian's git repo. But that has the risk of
not-fully-working code, so I'll probably be sticking with the 5.5.13 I have
rather than going to 5.6 right away.

As for a later release of 5.4, one of the Debian kernel team members
indicated on the list a few weeks ago that they preferred to go to 5.5,
which is what has ended up happening:
https://lists.debian.org/debian-kernel/2020/03/msg00086.html

Best regards,
--
Laurence "GreenReaper" Parry


Bug#940105: linux: serious corruption issue with btrfs

2019-09-16 Thread Laurence Parry
I had a look at and it appears to have gone into both 5.3 (final) and
5.2.15.

For what it's worth, it took only a day or so to exhibit the issue on our
(admittedly active) nginx/postgres/PHP server; we weren't doing any unusual
work during that time. If you're using btrfs, and you can't apply a patch
to the backports kernel, it'd be a good idea to revert to a 4.19 kernel for
the time being.

-- 
Laurence "GreenReaper" Parry


Bug#940105: linux: serious corruption issue with btrfs

2019-09-16 Thread Laurence Parry
We seem to have run into this yesterday on a production server sing a
custom compile of the 5.2.9 buster-backports kernel. nginx was hung in D
status, sync hung as well, no obvious reason for it; I ended up having to
reset the machine.

On boot I found we had lost several hours of logs and worse, several user
data files supposedly saved during that time. A small but noticeable
increase in iowait accompanied the start of the lost logs and continued
until the hang.

I guess I was lucky we did not lose more - at least, I *hope* not, because
it is hard to be sure of that without a byte-by-byte verification...

Don't know if it is related, but the behaviour of this kernel has also been
to significantly increase the size of the slab cache of dentries/inodes
during rsync, resulting in fragmentation and a massive increase in the
amount of free memory (despite setting vm.vfs_cache_pressure=1 in an
attempt to remedy it).

I can of course compile my own kernel, but I concur with Christoph's
assessment of this bug's severity and would encourage a new backports
release ASAP.
-- 
Laurence "GreenReaper" Parry - administrator, Inkbunny.net


Re: Persons involved in kernel development of debian os

2017-07-06 Thread Laurence Parry
> Can you please share with me the names of original team
> who developed debian os and its kernel.

There are several hundred Debian Developers and Maintainers:
https://www.debian.org/devel/

Since Debian was founded in the 1990s many have joined and left the project, 
but a few key names are here:
https://en.wikipedia.org/wiki/Debian#History

Debian is mostly put together from pieces developed by others.
Many only have responsibility for a small part of the system,
in the case of Maintainers perhaps because they are involved
with an upstream project which actually develops the software.
What Debian brings to the table is organization into a product.

As for the kernel, it is Linux, with some Debian patches.
A very long list of Linux kernel maintainers is available at
https://github.com/torvalds/linux/blob/master/MAINTAINERS
It may be useful if you have questions about particular subsystems.

> I had some querry regarding BOSS os.

BOSS appears to be a derived distribution of Debian
https://en.wikipedia.org/wiki/Bharat_Operating_System_Solutions

BOSS's creators take parts of Debian and modify it to their needs, selecting 
packages, adjusting software configuration and the user interface. They may 
know lots about the software they use, or only about their modifications.

In some cases, it *may* help to talk to the people working on Debian or the 
Linux kernel itself. This mailing list is a good place to discuss issues 
with the releases of the Linux kernel provided by Debian. If BOSS directly 
uses these releases, it might be a good place to ask for help. But Debian 
developers do not know and are not responsible for anything BOSS does with 
Debian after it has left their hands, including specific kernel 
configurations or its interaction with BOSS software which might be causing 
any problems you have.

Similarly, Linux maintainers would not know about anything Debian or BOSS 
did. So it's a good idea to go up the tree slowly - ask BOSS, then Debian, 
then specific maintainers, rather than try to jump to the top. If you've 
explained a problem and the first person says "I don't know", it may be time 
to talk to someone higher.

Best regards,
-- 
Laurence "GreenReaper" Parry
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?" 



Re: works with amd ryzen in "stretch" ?

2017-06-24 Thread Laurence Parry
> Does the linux kernel work with amd ryzen processor
> series in Debian 9.0 "stretch" ?

It seems to test out fine:
http://www.phoronix.com/scan.php?page=article=amd-ryzen-6linux

A fix for Zen CPU multiprocessing topology was added in 4.10:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08b259631b5a1d912af4832847b5642f377d9101

This was backported to the 4.9.x series stretch uses in 4.9.10.
Stretch has 4.9.30; so it - and jessie-backports - should work.

That doesn't mean that *everything* will work on your system. For example, 
some motherboards come with the Realtek ALC1220 audio chip. Apparently 
support for that was only added in 4.11, so you may need stretch-backports 
to get it to work (once that exists). Multi-IOMMU support went into the 
'tip' branch at the end of March, so . . . 4.13? 4.14? But I imagine few 
people actually need that feature.

It also doesn't mean that there won't be bugs. But before you blame the 
kernel or GCC for anything, be sure you're not overclocking the CPU or RAM. 
Just because you can, and it 'works', does not mean that it will work 
consistently under stress.

You'll probably want a BIOS update containing AGESA code from your 
motherboard manufacturer, too - I'd suggest the non-free amd64-microcode 
package, but it doesn't seem to contain anything for Zen yet. And 
firmware-amd-graphics if you're using an AMD card with it, etc.
-- 
Laurence "GreenReaper" Parry
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?" 



Bug#861964: Increased ext4_inode_cache size wastes RAM under default SLAB allocator

2017-05-06 Thread Laurence Parry
Package: linux-image-amd64
Version: 4.9+80

Debian's use of the SLAB allocator combined with ongoing kernel changes mean 
the ext4 inode cache wastes ~21% of space allocated to it on recent amd64 
kernels, a regression from the ~2% waste in jessie.

SLAB enforces a first-order allocation (i.e. 4KB on x86[-64]) for slabs 
containing VFS-reclaimable objects such as ext4_inode_info: 
http://elixir.free-electrons.com/linux/v4.9.25/source/mm/slab.c#L1827

In jessie's Linux 3.16 kernel, an ext4_inode_cache entry is ~1000 bytes, so 
four fit nicely in a slab. Additions to this structure and its members have 
increased it to ~1072 bytes in 4.9.25 (on a machine with 32 logical cores):

  # grep ext4_inode_cache /proc/slabinfo
name 
ext4_inode_cache  956 987 1072   3  …

…leaving 880 bytes wasted per slab in Debian stretch (and jessie-backports).

Having 3 objects vs. 4 per slab may reduce internal fragmentation, but 
inodes can't linger for as long, and creating them evicts data, leading to 
increased disk activity. Slab cache allocation takes time; and if the slabs 
were denser, more inodes (or other content) could fit in CPU cache.

By comparison, mainline's default SLUB allocator (used by Ubuntu) seems to 
use a 4 page/16KB or 8 page/32 KB slab size, which fits 15/30 
ext4_inode_cache objects. This has also decreased since 3.16, but it is not 
as wasteful.

Inode cache size is initially small, but may grow to ~50% of RAM under heavy 
workloads, e.g. fileserver rsync.

== Possible workarounds/resolutions ==

A custom-compiled kernel with the right options reduces ext4_inode_cache 
object size below 1000 bytes - for me, it cut ~160MB from slab_cache on an 
active 32GB web app/file server with nightly rsync. (It may reduce CPU and 
disk utilization, but the load in question is not constant enough to 
benchmark.)

Some flags have a big impact on ext4_inode_info
(and subsidiary structs such as rw_semaphore):
http://elixir.free-electrons.com/linux/v4.9.25/source/fs/ext4/ext4.h#L937

The precise sizes change with kernel version and CPU configuration. For 
jessie-backports' Linux 4.7.8, disabling both
* EXT4 encryption (CONFIG_EXT4_FS_ENCRYPTION) _and_ either:
  a) VFS quota (CONFIG_QUOTA; OCSFS2 must be disabled first), or
  b) Optimistic rw_semaphore spinning (CONFIG_RWSEM_SPIN_ON_OWNER)
reduced ext4_inode_cache objects to 1008-1016 bytes; sufficient to fit four 
inodes in a slab. It worked on 4.8.7 as well, reducing size to exactly 1024.

But custom compilation is time-consuming and workload-dependent. Tossing 
ext4 encryption and quota is fine for our purposes, but Debian may not want 
to.

Disabling optimistic semaphore owner spinning - perhaps under a certain 
number of cores? - may be part of a general solution; there's no menu option 
for CONFIG_RWSEM_SPIN_ON_OWNER, so it has to be set in the build config, or 
possibly on the command line.

https://lkml.org/lkml/2014/8/3/120 suggests optimistic improves some 
contention-heavy workloads - or at least benchmarks thereof - but it may not 
be worth the trade-off by default. Incidentally, I found zero documentation 
that this may negatively impact memory usage.

Getting into more significant code changes: Ted Ts'o shrunk ext4_inode_info 
by 8% six years ago:
http://linux-ext4.vger.kernel.narkive.com/D3sK9Flg/patch-0-6-shrinking-the-size-of-ext4-inode-info

…but it has since grown ~22%, due to features such as ext4 encryption, 
project-based quota, and the aforementioned optimistic spinning on the three 
read-write semaphores in the struct:
https://github.com/torvalds/linux/commit/4fc828e24cd9c385d3a44e1b499ec7fc70239d8a
https://github.com/torvalds/linux/commit/ce069fc920e5734558b3d9cbef1ab06cf01ee793
https://lwn.net/Articles/697603/

Ted mentioned that "it would be possible to further slim down the 
ext4_inode_cache by another 100 bytes or so, by breaking the ext4_inode_info 
into the portion of the inode required [when] a file is opened for writing, 
and everything else."

This might be worth it, given that we're on the borderline, and particularly 
if rw_semaphore is included; there are attempts to make those even bigger:
http://lists-archives.com/linux-kernel/28643980-locking-rwsem-enable-count-based-spinning-on-reader.html

Adding an define to configure out project quota (kprojid_t i_projid) may cut 
a few bytes - or maybe more given alignment? I don't know if this would have 
a negative impact on filesystems which used them, other than the feature not 
working. At least it would give another knob to tweak.

Adjusting struct alignment may also be beneficial, either in all cases or 
based on the presence/absence of flags, as in 
https://patchwork.ozlabs.org/patch/62051/

ext4_inode_info appears to contain a copy of the 256-byte on-disk format. 
Maybe it's feasible to use some of this in-place rather than duplicating it 
and writing it back later? Or it could be separated into its own object; 
it's a nice round size.  (In-place use may 

Bug#851119: memcg causing freezes on backports kernel

2017-04-15 Thread Laurence Parry
memory cgroups decreased the stability of 4.8, regardless of whether they 
were actively being used; we were not actively using them, but still 
experienced lockups. Disabling them on the kernel command line fixed it.

I'm not sure if it's been fixed in 4.9 because I removed memcgroups them 
from our kernel compilation entirely. It'd certainly be worth trying 4.9.18 
if you haven't already, as I know they were working on the problem.
-- 
Laurence "GreenReaper" Parry - Inkbunny administrator
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?" 



Re:L Bug#851119: 4.8.11 bpo8: "list_del corruption" and "CPU#x got stuck for 22s"

2017-01-12 Thread Laurence Parry
> I have the impression that there is an increased probability
> for linux-image-4.8.0-0.bpo.2-amd64-unsigned (4.8.11-1~bpo8+1)
> to run into "CPU#x got stuck for 22s" messages.

This appears to be due to the cgroups memory controller. I've had had good
experience with adding cgroup_disable=memory on the kernel boot line.

You can also compile your own kernel without the the memory controller,
which I recommend if you have the capability, as it cuts out a non-trivial
amount of kernel code.

Removning cgroups *entirely* may cause systemd to fall over, but AFAIK you
only need CONFIG_CGROUPS itself and not all the secondary drivers. Or just
uninstall systemd and install sysvinit-core again.

[If recompiling, you might want to disable disk quota and EXT4 encryption if
you don't use them, and reduce the number of supported CPUs to the number
you have, including hyperthreads. This reduces ext4 inode size down below
1024 bytes, so it can fit 4 into a 4KB slab again, which is a regression
since jessie.]

-- 
Laurence "GreenReaper" Parry - Inkbunny administrator
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?" 



Bug#851119: memcg causing freezes on backports kernel

2017-01-12 Thread Laurence Parry
This appears to be due to the cgroups memory controller. I've had had good 
experience with adding cgroup_disable=memory on the kernel boot line.

You can also compile your own kernel without the the memory controller, 
which I recommend if you have the capability, as it cuts out a non-trivial 
amount of kernel code.

Removning cgroups *entirely* may cause systemd to fall over, but AFAIK you 
only need CONFIG_CGROUPS itself and not all the secondary drivers. Or just 
uninstall systemd and install sysvinit-core again.

[If recompiling, you might want to disable disk quota and EXT4 encryption if 
you don't use them, and reduce the number of supported CPUs to the number 
you have, including hyperthreads. This reduces ext4 inode size down below 
1024 bytes, so it can fit 4 into a 4KB slab again, which is a regression 
since jessie.]

-- 
Laurence "GreenReaper" Parry - Inkbunny administrator
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?" 



jessie-backports: linux 4.7 686-pae triggering eth errors, nginx OOM

2016-09-19 Thread Laurence Parry

Whoever handles this (Zumbi?) might want to hold off on making
linux-image-686-pae in jessie-backports
==
linux-image-4.7.0-0.bpo.1-686-pae-unsigned (or similar)

I run nginx caches on Debian Jessie which I keep up to date
with backports kernels. I saw the 4.7 above, so I tried it, but
I've started seeing a bunch of eth0 errors since then on x86:
https://dl.dropboxusercontent.com/u/14570328/if_err_eth0-4.7.png

I have not seen this on our 64-bit systems running 4.7, but
these have far larger RAM and perhaps less pressure on it.

I'm also getting OOMs which did not previously
happen AFAIK, e.g. http://pastebin.com/SFErSKGS

These are small VMs under heavy memory pressure.
I believe RAM is often fragmented, which makes me think 
the memory compaction issue mentioned here may be it:

https://lkml.org/lkml/2016/8/22/184

I cannot say for sure that this is the problem. Still,
eth0 errors seem to have vanished once I reverted to 4.6.

I have vm.min_free_kbytes = 8192 and I would test
to see if removing that helps, but I have a flight to catch. :-)

nginx is also set up to use aio write and threads, not sendfile
(it's serving over HTTPS) and tcp_nopush/tcp_nodelay.

Best regards,
--
Laurence "GreenReaper" Parry - Inkbunny administrator
greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net
"Eternity lies ahead of us, and behind. Have you drunk your fill?"