Bug#892105: Cherry-pick "i40e: Be much more verbose about what we can and cannot offload"

2021-06-22 Thread Philipp Hahn
that cherry-picking f114dca2533ca770aebebffb5ed56e5e7d1fb3fb 
on top of v4.9.273 fixes the problem and reverting it again shows the 
problem again.


Philipp
--
Philipp Hahn
Open Source Software Engineer

Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen

📞 +49-421-22232-57
🖶 +49-421-22232-99

✉️ h...@univention.de
🌐 https://www.univention.de/

Geschäftsfßhrer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876



Bug#931111: linux-image-4.9.0-9: Memory "leak" caused by CGroup as used by pam_systemd

2019-07-30 Thread Philipp Hahn
Hello,

Am 24.07.19 um 16:41 schrieb Roman Gushchin:
> On Wed, Jul 24, 2019 at 09:12:50AM +0200, Philipp Hahn wrote:
>> Am 24.07.19 um 00:03 schrieb Ben Hutchings:
...
>>> I would say this is a kernel bug.  I think it's the same problem that
>>> this patch series is trying to solve:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_ml_linux-2Dkernel_20190611231813.3148843-2D1-2Dguro-40fb.com_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=xNLFAB3gBGB1NCKmQZN-6JNEj_AXfJ3-wYK7IDWJAx4&s=YfWWnoW-zJdTN0hd1tzzZQlUIUtjv-iBN9Co5rNP5J0&e=
>>>  
>>>
>>> Does the description there seem to match what you're seeing?
>>
>> Yes, Roman Gushchin replied to me by private mail, which I will quote
>> here to get his response archived in Debian's BTS as well:
...
>>> I've spent lot of time working on this problem, and the final patchset
>>> has been merged into 5.3. It implements reparenting of the slab memory
>>> on cgroup deletion. 5.3 should be much better in reclaiming dying cgroups.
>>>
>>> Unfortunately, the patchset is quite invasive and is based on some
>>> vmstats changes from 5.2, so it's not trivial to backport it to
>>> older kernels.
>>>
>>> Also, there is no good workaround, only manually dropping kernel
>>> caches or disable the kernel memory accounting as a whole.
...
>> So should someone™ bite the bullet and try to backport Romans change to
>> 4.19 (and 4.9)? (those are the kernel versions used by Debian).
>> I'm not a kernel expert myself, especially no mm/cg expert, but have
>> done some work myself in the past, but I would happily pass on the
>> chalice to someone more experienced.
> 
> It's doable from the technical point of view, but I really doubt it's suitable
> for the official stable. The backport will consist of at least 20+ core
> mm/memcontrol patches, so it really feels excessive.
> 
> If you still want to try, you need to backport 205b20cc5a99 first (and the 
> rest
> of the patchset), but it may also depend on some other vmstat changes.

I haven't yet started on trying the backport, but is there some process
to force free those dying cgroups manually?

I have found yet another report of this issue at
<https://github.com/moby/moby/issues/29638#issuecomment-514287415> and
there a cron-job

> 6 */12 * * * root echo 3 > /proc/sys/vm/drop_caches

is recommended. I tried that manually on one of our affected systems and
the number of memory cgroups only dropped marginally from 211_620 to
210_396 after doing the `drop_caches` multiple times and waiting for 10
minutes by now. On that idle system a lot of RAM is gone:
> # free -h
>   totalusedfree  shared  buff/cache   
> available
> Mem:   141G 60G 80G 15M755M 
> 80G

Thanks again for all your help.

Philipp



Bug#931111: linux-image-4.9.0-9: Memory "leak" caused by CGroup as used by pam_systemd

2019-07-24 Thread Philipp Hahn
Hello Ben,

Am 24.07.19 um 00:03 schrieb Ben Hutchings:
> On Tue, 2019-07-23 at 15:56 +0200, Philipp Hahn wrote:
> [...]
>> - when the job / session terminates, the directory is deleted by
>> pam_systemd.
>>
>> - but the Linux kernel still uses the CGroup to track kernel internal
>> memory (SLAB objects, pending cache pages, ...?)
>>
>> - inside the kernel the CGroup is marked as "dying", but it is only
>> garbage collected very later on
> [...]
>> I do not know who is at fault here, if it is
>> - the Linux kernel for not freeing those resources earlier
>> - systemd for using CGs in a broken way
>> - someone others fault.
> [...]
> 
> I would say this is a kernel bug.  I think it's the same problem that
> this patch series is trying to solve:
> https://lwn.net/ml/linux-kernel/20190611231813.3148843-1-g...@fb.com/
> 
> Does the description there seem to match what you're seeing?

Yes, Roman Gushchin replied to me by private mail, which I will quote
here to get his response archived in Debian's BTS as well:

> Hi Philipp!
> 
> Thank you for the report!
> 
> I've spent lot of time working on this problem, and the final patchset
> has been merged into 5.3. It implements reparenting of the slab memory
> on cgroup deletion. 5.3 should be much better in reclaiming dying cgroups.
> 
> Unfortunately, the patchset is quite invasive and is based on some
> vmstats changes from 5.2, so it's not trivial to backport it to
> older kernels.
> 
> Also, there is no good workaround, only manually dropping kernel
> caches or disable the kernel memory accounting as a whole.
> 
> Thanks!


段熊春  also replied and pointed out his
patch-set <https://patchwork.kernel.org/cover/10772277/>, which solved
the problem for them. I more looks like a "hack", was never applied
upstream as Romans work solved the underlying problem.


So should someone™ bite the bullet and try to backport Romans change to
4.19 (and 4.9)? (those are the kernel versions used by Debian).
I'm not a kernel expert myself, especially no mm/cg expert, but have
done some work myself in the past, but I would happily pass on the
chalice to someone more experienced.

Thanks for all your replies - I really appreciate your help.
Philipp



Bug#931111: linux-image-4.9.0-9: Memory "leak" caused by CGroup as used by pam_systemd

2019-07-23 Thread Philipp Hahn
Hi,

I analyzed the issue and the problem seems to be CGroup related:

- we're using 'pam_systemd' in "/etc/pam.d/common-session"

- each cron-job / login then creates a new CGroup below
"/sys/fs/cgroup/systemd/user.slice/" while that job / session is running

- when the job / session terminates, the directory is deleted by
pam_systemd.

- but the Linux kernel still uses the CGroup to track kernel internal
memory (SLAB objects, pending cache pages, ...?)

- inside the kernel the CGroup is marked as "dying", but it is only
garbage collected very later on

- until then it adds to memory pressure and very slowly pushed the
system into swap.


I back-ported the patch
<https://www.spinics.net/lists/cgroups/msg20611.html> from Roman
Gushchin to add some extra debugging, which indeed shows a large number
of "dying" cgroups:

> # find /sys/fs/cgroup/memory -name cgroup.stat -exec grep 
> '^nr_dying_descendants [^0]'  {} +
>   /sys/fs/cgroup/memory/cgroup.stat:nr_dying_descendants 360
>   /sys/fs/cgroup/memory/user.slice/cgroup.stat:nr_dying_descendants 320
>   
> /sys/fs/cgroup/memory/user.slice/user-0.slice/cgroup.stat:nr_dying_descendants
>  303
>   /sys/fs/cgroup/memory/system.slice/cgroup.stat:nr_dying_descendants 40
> # grep ^memory /proc/cgroups 
>   memory  10  452 1

Removing "pam_systemd" from PAM makes the problem go away.

Later Debain kernels are compiled with "CONFIG_MEMCG_KMEM=y", which
prompted me to add "cgroup.memory=nokmem" to the kernel command line.
This also seems to reduce the problem, but I'm not 100% convinced that
it really improves the situation.


I do not have a very good reproducer, but creating a cron-job with just
> * * *  * *  root  dd if=/dev/urandom of=/var/tmp/test-$$ count=1 >/dev/null

will most often increase the number of dying CGs every minute by one.


I do not know who is at fault here, if it is
- the Linux kernel for not freeing those resources earlier
- systemd for using CGs in a broken way
- someone others fault.

Clearly this is not good and I would like to receive some feedback on
what could be done top solve this issue, as running cron jobs is user
exploitable and can be used to DoS the system.
While looking for existing bug reports I stumbled over 912411 in Debian,
which also claims that there is a CG related leak - with Linux 4.19.x.

Should "pam_systemd" maybe do something like this before deleting the CG
directory:
> echo 0 >/sys/fs/cgroup/memory/.../memory.force_empty


Some more details are available at our bug-tracker at
<https://forge.univention.org/bugzilla/show_bug.cgi?id=49614#c5>.

Debian-Bugs:
* <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=93>
* <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912411>

Sincerely
Philipp
-- 
Philipp Hahn
Open Source Software Engineer

Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
h...@univention.de

https://www.univention.de/
Geschäftsfßhrer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876
From 0679dee03c6d706d57145ea92c23d08fa10a1999 Mon Sep 17 00:00:00 2001
Message-Id: <0679dee03c6d706d57145ea92c23d08fa10a1999.1562083574.git.h...@univention.de>
From: Roman Gushchin 
Date: Wed, 2 Aug 2017 17:55:29 +0100
Subject: [PATCH] cgroup: keep track of number of descent cgroups

Keep track of the number of online and dying descent cgroups.

This data will be used later to add an ability to control cgroup
hierarchy (limit the depth and the number of descent cgroups)
and display hierarchy stats.

Signed-off-by: Roman Gushchin 
Suggested-by: Tejun Heo 
Signed-off-by: Tejun Heo 
Cc: Zefan Li 
Cc: Waiman Long 
Cc: Johannes Weiner 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Philipp Hahn 
Url: https://www.spinics.net/lists/cgroups/msg20611.html
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4922,6 +4922,18 @@ static struct cftype cgroup_dfl_base_fil
 	{ }	/* terminate */
 };
 
+static int cgroup_stat_show(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgroup = seq_css(seq)->cgroup;
+
+	seq_printf(seq, "nr_descendants %d\n",
+		   cgroup->nr_descendants);
+	seq_printf(seq, "nr_dying_descendants %d\n",
+		   cgroup->nr_dying_descendants);
+
+	return 0;
+}
+
 /* cgroup core interface files for the legacy hierarchies */
 static struct cftype cgroup_legacy_base_files[] = {
 	{
@@ -4964,6 +4976,10 @@ static struct cftype cgroup_legacy_base_
 		.write = cgroup_release_agent_write,
 		.max_write_len = PATH_MAX - 1,
 	},
+	{
+		.name = "cgroup.stat",
+		.seq_show = cgroup_stat_show,
+	},
 	{ }	/* terminate */
 };
 
@@ -5063,9 +5079,15 @@ static void css_release_work_fn(struct w
 		if (ss->css_released)
 			ss->css_r

Bug#931111: linux-image-4.9.0-9-amd64: Memory leak - fixed with 4.9.174 (or earlier)

2019-06-26 Thread Philipp Hahn
Package: linux-image-4.9.0-9-amd64
Version: 4.9.168-1+deb9u3
Severity: important

Dear fellow DDs,

we (Univention GmbH) have several reports of a memory leak wtih 4.9.168
as shipped by Debian (and used by us):

https://help.univention.com/t/memoryleak-auf-slave-contoller/11892/8
https://forge.univention.org/bugzilla/show_bug.cgi?id=49614

I was not yet able to pin-point the area where the leak occurs, but the
bug seemd to be fixed after switching to 4.9.174.a( I hand-applied the
incremental patches on top of Debians 4.9.168 fixing the rejects and
enabled KMEMLEAK.) The fix can be earlied than 174; I chose that version
from a hint given by one of our customers, who reported 174 to be fixed.

4.9.144 was also fine (at leat no leak was observed).

Stopping all processes did NOT free the memory again.
My interpretation of this is, that the leak is not in user-space, but in
kernel-land.

 does not
show any leaks so far.

One of the affected systems is a system of us, where I can do some
limited testing, e.g. install self-compiled kernel version.
As I don't have a (simple) reproducer running `git bisect` is somehow
inefficient.
So if you have any idea on what to test or where I can help just ask.


-- System Information:
Debian Release: 9.9
  APT prefers 
  APT policy: (500, 'stretch')
Architecture: amd64 (x86_64)

Kernel: Linux 4.9.0-9-amd64 (SMP w/2 CPU cores)
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8), 
LANGUAGE=de_DE.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages linux-image-4.9.0-9-amd64 depends on:
ii  initramfs-tools [linux-initramfs-tool]  0.130
ii  kmod23-2
ii  linux-base  4.5

Versions of packages linux-image-4.9.0-9-amd64 recommends:
ii  firmware-linux-free  3.4
ii  irqbalance   1.1.0-2.3

Versions of packages linux-image-4.9.0-9-amd64 suggests:
pn  debian-kernel-handbook  
ii  grub-pc 2.02~beta3-5+deb9u1A~4.3.3.201812061306
pn  linux-doc-4.9   

Versions of packages linux-image-4.9.0-9-amd64 is related to:
ii  firmware-amd-graphics 20161130-5
ii  firmware-atheros  20161130-5
ii  firmware-bnx2 20161130-5
ii  firmware-bnx2x20161130-5
ii  firmware-brcm8021120161130-5
ii  firmware-cavium   20161130-5
pn  firmware-intel-sound  
ii  firmware-intelwimax   20161130-5
pn  firmware-ipw2x00  
pn  firmware-ivtv 
ii  firmware-iwlwifi  20161130-5
ii  firmware-libertas 20161130-5
ii  firmware-linux-nonfree20161130-5
ii  firmware-misc-nonfree 20161130-5
ii  firmware-myricom  20161130-5
ii  firmware-netxen   20161130-5
ii  firmware-qlogic   20161130-5
ii  firmware-realtek  20161130-5
pn  firmware-samsung  
pn  firmware-siano
ii  firmware-ti-connectivity  2016113



Bug#892654: nfs-kernel-server: Mismatching [RPC]SVCGSSDOPTS defaults

2018-03-11 Thread Philipp Hahn
Package: nfs-kernel-server
Version: 1:1.3.4-2.1
Severity: normal

Dear Maintainer,

The options for rpc.svcgssd are not used:

debian/nfs-kernel-server.default:
> 18 # Options for rpc.svcgssd.
> 19 RPCSVCGSSDOPTS=""
 ^^^
systemd/nfs-config.service:
> 13 ExecStart=/usr/lib/systemd/scripts/nfs-utils_env.sh
debian/nfs-utils_env.sh:
>  7 [ -r /etc/default/nfs-kernel-server ] && . /etc/default/nfs-kernel-server
> 15 echo RPCSVCGSSDARGS=\"$RPCSVCGSSDOPTS\"
  ^^^   ^^^
> 16 } > /run/sysconfig/nfs-utils
systemd/rpc-svcgssd.service:
> 18 EnvironmentFile=-/run/sysconfig/nfs-utils
> 20 ExecStart=/usr/sbin/rpc.svcgssd $SVCGSSDARGS
  ^^^

-- System Information:
Debian Release: 9.3
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 4.9.0-5-amd64 (SMP w/4 CPU cores)
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8), 
LANGUAGE=de:en_US (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)



Re: UPDATE: Re: Proposal: UEFI secure boot implementation sprint, 5-8 April 2018

2018-01-28 Thread Philipp Hahn
Hello,

Am 28.01.2018 um 12:46 schrieb Steve McIntyre:
> HOWEVER... those dates clash with another group already booked in at
> the LinuxHotel in Essen. So we've started talking to people in Fulsa
> as a fallback option. We have a tentative space for a venue and now we
> need to book hotel rooms. So... If you're *definitely* planning on
> coming for the sprint (Fulda, Germany, 5-8 April) please reply to this
> mail to confirm in the next few days and we will organise. Now is a
> good time to book travel.

Yes, I will definitely attend.

Philipp



State of SecureBoot for Debian GNU/Linux?

2017-12-11 Thread Philipp Hahn
Hello fellow DDs,

I'm employed by Univention and we ship our Debian based distribution
"Univention Corporate Server" with UEFI support. Currently we're
building our own Linux kernel and have our own signed version of shim
and grub.

For our next release, which will be based von Debian-Stretch, we're
considering to switch to the Debian provided kernel.

If read <https://wiki.debian.org/SecureBoot> and
<https://bugs.debian.org/820036>, but what's the current state of Secure
Boot in Debian?

We have successfully gone through the Microsoft SHIM review process once
and are currently waiting for the review of our SHIM-13 be the
shim-review list, so we have an EV certificate and I also have some
knowledge of the process myself.

Can I (we) help with improving the situation of Secure-Boot in Debian?

Sincerely
Philipp
-- 
Philipp Hahn
Open Source Software Engineer

Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
h...@univention.de

http://www.univention.de/
Geschäftsfßhrer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876



Bug#822575: linux-4.1: UEFI root-fb vs. cirrusfb

2016-11-15 Thread Philipp Hahn
Hello David,

Am 15.11.2016 um 15:57 schrieb David Herrmann:
> On Fri, Oct 28, 2016 at 2:24 PM, Philipp Hahn  wrote:
>> while experimenting with UEFI and secure-boot I stumbled into the issue
>> "cirrusdrmfb broken with simplefb" associated with your name:
>> <https://groups.google.com/forum/#!msg/linux.kernel/tD2UEqra-wU/u6NkZY8o5YEJ>
>>
>> I'm using a Debian based linux-4.1.38 kernel, which has
>>> # zgrep -E 'CONFIG_X86_SYSFB|CONFIG_FB_SIMPLE' /boot/config-`uname -r`
>>> CONFIG_X86_SYSFB=y
>>> CONFIG_FB_SIMPLE=y
>>
>> I found that SUSE bug
>> <https://bugzilla.novell.com/show_bug.cgi?id=855821>, where Takashi Iwai
>> finally disabled those options.
>>
>> I also checked Debians latest linux-4.7 kernel in Debian-sid, which
>> still has this setting. So my questions are:
>> 1, should Debian disable those options for x86?
>> 2. What would Debian loose?
>> 3. or is that issue fixed otherwise in newer kernels?
> 
> Right now CONFIG_X86_SYSFB should remain disabled. Once the SimpleDRM
> driver is upstream, there will be infrastructure to do the hw
> handover. Right now, it breaks if you hand over hw from one driver to
> another.

@David: Thank you for your feedback.

@Debian: Please disable CONFIG_X86_SYSFB in Debian for all next builds -
maybe except arch=arm.

Philipp Hahn



Bug#822575: fb: switching to cirrusdrmfb from simple

2016-11-15 Thread Philipp Hahn
Source: linux
Version: 4.9~rc5-1~exp1
Followup-For: Bug #822575

Dear Maintainer,

I have the same problem with cirrusfb: After that last message the
screen does not update anymore (but I can login through ssh).

After lots of research I things the problem is related to
CONFIG_X86_SYSFB being enabled in Debian by default:

$ git rev-parse HEAD # git://anonscm.debian.org/kernel/linux.git
6c0c9bcf78dfc886907d006b8cb6c2ea0f075a62
$ git grep -n -F -e CONFIG_X86_SYSFB -e CONFIG_FB_SIMPLE
config/armhf/config:1187:CONFIG_FB_SIMPLE=y
config/kernelarch-x86/config:73:CONFIG_X86_SYSFB=y
config/kernelarch-x86/config:1776:CONFIG_FB_SIMPLE=y

By default the video RAM is claimed by the boot framebbuffer:
> # cat /proc/fb
> 0 simple
> # grep 8000 /proc/iomem
> 8000-febf : PCI Bus :00
>   8000-81ff : :00:02.0
> 8000-801d4fff : BOOTFB

When cirrusdrmfb loads, it disabled the bootfb and tries to claim that
region. As it is still held by bootfb, this failes:
> # modprobe cirrus
> # dmesg
> [  263.171744] checking generic (8000 1d5000) vs hw (8000 200)
> [  263.171753] fb: switching to cirrusdrmfb from simple
> [  263.171838] Console: switching to colour dummy device 80x25
> [  263.177052] [drm:cirrus_device_init [cirrus]] *ERROR* can't reserve VRAM
> [  263.177066] cirrus :00:02.0: Fatal error during GPU init: -6
> [  263.177072] Trying to free nonexistent resource 
> <82029000-82029fff>
> [  263.177080] Trying to free nonexistent resource 
> <8000-81ff>

/proc/fb is empty afterwards, as bootfb remains disabled.

My solution was to remove cirrus.ko and cirrusfb.ko from
/lib/modules/`uname -r`/kernel/ and to re-build the initramfs. That
prevented cirrus from being loaded, leaving the simple-frame-buffer intact.
Loading the module by hand breaks the console again.

I found that SUSE bug
, where Takashi Iwai
finally disabled those options for OpenSUSE.

Ubuntu disabled the Cirrus drm driver first,

but later changed to only disable X86_SYSFB.

So probably it's best to disable X86_SYSFB:
> config X86_SYSFB
>   bool "Mark VGA/VBE/EFI FB as generic system framebuffer"
>   help
> Firmwares often provide initial graphics framebuffers so the BIOS,
> bootloader or kernel can show basic video-output during boot for
> user-guidance and debugging. Historically, x86 used the VESA BIOS
> Extensions and EFI-framebuffers for this, which are mostly limited
> to x86.
> This option, if enabled, marks VGA/VBE/EFI framebuffers as generic
> framebuffers so the new generic system-framebuffer drivers can be
> used on x86. If the framebuffer is not compatible with the generic
> modes, it is adverticed as fallback platform framebuffer so legacy
> drivers like efifb, vesafb and uvesafb can pick it up.
> If this option is not selected, all system framebuffers are always
> marked as fallback platform framebuffers as usual.
> 
> Note: Legacy fbdev drivers, including vesafb, efifb, uvesafb, will
> not be able to pick up generic system framebuffers if this option
> is selected. You are highly encouraged to enable simplefb as
> replacement if you select this option. simplefb can correctly deal
> with generic system framebuffers. But you should still keep vesafb
> and others enabled as fallback if a system framebuffer is
> incompatible with simplefb.
> 
> If unsure, say Y.

If you read  there was a
patched proposed to change it to 'N' and/or to depend on FB_SIMPLE, but
that patch never made it into linux.

Nor did  or
.

I tried to contact David Herrmann himself, but never got a reply.

So I think Debian should disable X86_SYSFB for now, too.

Thank you for your work and help
Philipp "also a Debian maintainer" Hahn
-- System Information:
Debian Release: 8.6
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable'), (90, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 3.16.0-4-amd64 (SMP w/4 CPU cores)
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)



Bug#669335: [2.6.32.y][PATCH] fix pgd_lock deadlock

2012-04-23 Thread Philipp Hahn
Hello,

On Wednesday 16 February 2011 15:49:47 Andrea Arcangeli wrote:
> Subject: fix pgd_lock deadlock
>
> From: Andrea Arcangeli 
>
> It's forbidden to take the page_table_lock with the irq disabled or if
> there's contention the IPIs (for tlb flushes) sent with the page_table_lock
> held will never run leading to a deadlock.
>
> Apparently nobody takes the pgd_lock from irq so the _irqsave can be
> removed.
>
> Signed-off-by: Andrea Arcangeli 

This patch (original commit Id for 2.6.38 
a79e53d85683c6dd9f99c90511028adc2043031f) needs to be back-ported to 2.6.32.x 
as well.
I observed a dead-lock problem when running a PAE enabled Debian 2.6.32.46+ 
kernel with 6 VCPUs as a KVM on (2.6.32, 3.2, 3.3) kernel, which showed the 
following behaviour:

1 VCPU is stuck in
  pgd_alloc() → pgd_prepopulate_pmb() →... →  flush_tlb_others_ipi()
while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
cpu_relax();
(gdb) print f->flush_cpumask
$5 = {1}

while all other VCPUs are stuck in
  pgd_alloc() → spin_lock_irqsave(pgd_lock)

I tracked it down to the commit
 2.6.39-rc1: 4981d01eada5354d81c8929d5b2836829ba3df7b
 2.6.32.34: ba456fd7ec1bdc31a4ad4a6bd02802dcaa730a33
 x86: Flush TLB if PGD entry is changed in i386 PAE mode
which when reverted made the bug disappear.

Comparing 3.2 to 2.6.32.34 showed that the 'pgd-deadlock'-patch went into 
2.6.38, that is before the 'PAE correctness'-patch, so the problem was 
probably never observed in the main development branch.
But for 2.6.32 the 'pgd-deadlock' patch is still missing, so the 'PAE 
corretness'-patch made the problem worse with 2.6.32.

The Patch was also back-ported to the OpenSUSE Kernel
<http://kernel.opensuse.org/cgit/kernel-source/commit/?id=ac27c01aa880c65d17043ab87249c613ac4c3635>,
Since the patch didn't apply cleanly on the current Debian kernel, I had to 
backport it for us and Debian. The patch is also available from our (German) 
Bugzilla <https://forge.univention.org/bugzilla/show_bug.cgi?id=26661> or 
from the Debian BTS at 
<http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=669335>.

I have no easy test case, but running multiple parallel builds inside the VM 
normally triggers the bug within seconds to minutes. With the patch applied 
the VM survived a night building packages without any problem.

Signed-off-by: Philipp Hahn 

Sincerely
Philipp
-- 
Philipp Hahn   Open Source Software Engineer  h...@univention.de
Univention GmbHbe open.   fon: +49 421 22 232- 0
Mary-Somerville-Str.1  D-28359 Bremen fax: +49 421 22 232-99
   http://www.univention.de/
It's forbidden to take the page_table_lock with the irq disabled
or if there's contention the IPIs (for tlb flushes) sent with
the page_table_lock held will never run leading to a deadlock.

Nobody takes the pgd_lock from irq context so the _irqsave can be
removed.

Signed-off-by: Andrea Arcangeli 
Acked-by: Rik van Riel 
Tested-by: Konrad Rzeszutek Wilk 
Signed-off-by: Andrew Morton 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: 
LKML-Reference: <201102162345.p1gnjmjm021...@imap1.linux-foundation.org>
Signed-off-by: Ingo Molnar 
Git-commit: a79e53d85683c6dd9f99c90511028adc2043031f
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -223,15 +223,14 @@ void vmalloc_sync_all(void)
 	 address >= TASK_SIZE && address < FIXADDR_TOP;
 	 address += PMD_SIZE) {
 
-		unsigned long flags;
 		struct page *page;
 
-		spin_lock_irqsave(&pgd_lock, flags);
+		spin_lock(&pgd_lock);
 		list_for_each_entry(page, &pgd_list, lru) {
 			if (!vmalloc_sync_one(page_address(page), address))
 break;
 		}
-		spin_unlock_irqrestore(&pgd_lock, flags);
+		spin_unlock(&pgd_lock);
 	}
 }
 
@@ -331,13 +330,12 @@ void vmalloc_sync_all(void)
 	 address += PGDIR_SIZE) {
 
 		const pgd_t *pgd_ref = pgd_offset_k(address);
-		unsigned long flags;
 		struct page *page;
 
 		if (pgd_none(*pgd_ref))
 			continue;
 
-		spin_lock_irqsave(&pgd_lock, flags);
+		spin_lock(&pgd_lock);
 		list_for_each_entry(page, &pgd_list, lru) {
 			pgd_t *pgd;
 			pgd = (pgd_t *)page_address(page) + pgd_index(address);
@@ -346,7 +344,7 @@ void vmalloc_sync_all(void)
 			else
 BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
 		}
-		spin_unlock_irqrestore(&pgd_lock, flags);
+		spin_unlock(&pgd_lock);
 	}
 }
 
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -56,12 +56,10 @@ static unsigned long direct_pages_count[
 
 void update_page_count(int level, unsigned long pages)
 {
-	unsigned long flags;
-
 	/* Protect against CPA */
-	spin_lock_irqsave(&pgd_lock, flags);
+	spin_lock(&pgd_lock);
 	direct_pages_count[level] += pages;
-	spin_unlock_irqrestore(&pgd_lock, flags);
+	spin_unlock(&pgd_lock);
 }
 
 static void split_page_cou

Bug#669335: linux-image-2.6.32-5-686-bigmem: 86-mm-Fix-pgd_lock-deadlock.patch

2012-04-18 Thread Philipp Hahn
Package: linux-image-2.6.32-5-686-bigmem
Version: 2.6.32-41squeeze2
Severity: important

We received several problem reports from our customers and also
experienced the following bug when running said kernel with >=2 CPUs in
a virtualization environbment (KVM and VMWare ESX). After some time the
VM just stops (no ping, no console, no activity).

Using gdb and KVMs gdbserver capability I was able to track it down to
the following symptom: One thread would be stuck in
flush_tlb_others_ipi() waiting for all other CPUs using a specifiy mm to
signal they have flushed there TLB, while all other CPU-threads would be
waiting for the pgd_lock to be freed.

I have no easy test to reproduce the bug, but it usually happens to me
when I run several pbuilder-builds in parallel. The more virtual CPUs
the VM has, the easier to trigger: With 2 VCPUs it's hours, with 6 VCPUs
it's usually less than 5 minutes.

This got more prominent with git-commit
831d52bc153971b70e64eccfbed2b232394f22f8: x86, mm: avoid possible bogus tlb 
entries by clearing prev mm_cpumask after switching mm
(Linus tree), and even worse with
4981d01eada5354d81c8929d5b2836829ba3df7b: x86: Flush TLB if PGD entry is 
changed in i386 PAE mode

I found the issue to be fixed by
a79e53d85683c6dd9f99c90511028adc2043031f: x86/mm: Fix pgd_lock deadlock

That patch is already in the Debian patch set, but only applied for the
xen flavour: features/all/xen/x86-mm-Fix-pgd_lock-deadlock.patch

I thinks this patch should be applied to all flavours. It doesn't apply
to the non-xen-flavour as is, because it depends on some other
xen-related patch.
The Patch was also back-ported to the OpenSUSE Kernel
,
but since the patch is trivial to backport, I'll attach my version as
well.

The patch should be forwarded to Upstream to be included into the
upstream 2.6.32 longterm stable kernel as well.

The full issue is tracked in our (German) Bugzilla:


Sincerely
Philipp
-- System Information:
Debian Release: 5.0.1
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.32-ucs57-686-bigmem
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8)
Bug #26661: 686-bigmem VM deadlock
--- /dev/null
+++ 
linux-2.6.32-2.6.32/debian/patches/bugfix/x86/x86-mm-Fix-pgd_lock-deadlock.patch
@@ -0,0 +1,217 @@
+It's forbidden to take the page_table_lock with the irq disabled
+or if there's contention the IPIs (for tlb flushes) sent with
+the page_table_lock held will never run leading to a deadlock.
+
+Nobody takes the pgd_lock from irq context so the _irqsave can be
+removed.
+
+Signed-off-by: Andrea Arcangeli 
+Acked-by: Rik van Riel 
+Tested-by: Konrad Rzeszutek Wilk 
+Signed-off-by: Andrew Morton 
+Cc: Peter Zijlstra 
+Cc: Linus Torvalds 
+Cc: 
+LKML-Reference: <201102162345.p1gnjmjm021...@imap1.linux-foundation.org>
+Signed-off-by: Ingo Molnar 
+Git-commit: a79e53d85683c6dd9f99c90511028adc2043031f
+--- a/arch/x86/mm/fault.c
 b/arch/x86/mm/fault.c
+@@ -223,15 +223,14 @@ void vmalloc_sync_all(void)
+address >= TASK_SIZE && address < FIXADDR_TOP;
+address += PMD_SIZE) {
+ 
+-  unsigned long flags;
+   struct page *page;
+ 
+-  spin_lock_irqsave(&pgd_lock, flags);
++  spin_lock(&pgd_lock);
+   list_for_each_entry(page, &pgd_list, lru) {
+   if (!vmalloc_sync_one(page_address(page), address))
+   break;
+   }
+-  spin_unlock_irqrestore(&pgd_lock, flags);
++  spin_unlock(&pgd_lock);
+   }
+ }
+ 
+@@ -331,13 +330,12 @@ void vmalloc_sync_all(void)
+address += PGDIR_SIZE) {
+ 
+   const pgd_t *pgd_ref = pgd_offset_k(address);
+-  unsigned long flags;
+   struct page *page;
+ 
+   if (pgd_none(*pgd_ref))
+   continue;
+ 
+-  spin_lock_irqsave(&pgd_lock, flags);
++  spin_lock(&pgd_lock);
+   list_for_each_entry(page, &pgd_list, lru) {
+   pgd_t *pgd;
+   pgd = (pgd_t *)page_address(page) + pgd_index(address);
+@@ -346,7 +344,7 @@ void vmalloc_sync_all(void)
+   else
+   BUG_ON(pgd_page_vaddr(*pgd) != 
pgd_page_vaddr(*pgd_ref));
+   }
+-  spin_unlock_irqrestore(&pgd_lock, flags);
++  spin_unlock(&pgd_lock);
+   }
+ }
+ 
+--- a/arch/x86/mm/pageattr.c
 b/arch/x86/mm/pageattr.c
+@@ -56,12 +56,10 @@ static unsigned long direct_pages_count[
+ 
+ void update_page_count(int level, unsigned long pages)
+ {
+-  unsigned long flags;
+-
+   /* Protect against CPA */
+-  spin_lock_irqsave(&pgd_lock, flags);
++  spin_lock(&pgd_lock);
+   direct_pages_count[level] += pages;
+-  spin_unlock_i

Bug#599507: KVM: SVM: Fix wrong intercept masks on 32 bit

2010-10-08 Thread Philipp Hahn
Package: linux-2.6.32
Severity: normal

When trying to reboot an ia32 guest, an ia32 kvm running on an Amd64
cpu reports the following error:
 kvm: unhandled exit 
 kvm_run returned -22

This bug was fixed for linux-2.6.34 but is still present in 2.6.32.
<http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=061e2fd16863009c8005b4b5fdfb75c7215c0b99>
> KVM: SVM: Fix wrong intercept masks on 32 bit
> 
> This patch makes KVM on 32 bit SVM working again by
> correcting the masks used for iret interception. With the
> wrong masks the upper 32 bits of the intercepts are masked
> out which leaves vmrun unintercepted. This is not legal on
> svm and the vmrun fails.
> Bug was introduced by commits 95ba827313 and 3cfc3092.

It only happens on Amd cpus, Intel cpus are unaffected.

Please conside applying this patch to the 2.6.32 stable branch as well.

Sincerely
Philipp Hahn
-- System Information:
Debian Release: 5.0.1
Architecture: amd64 (x86_64)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.32-ucs11-amd64
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8)
commit 061e2fd16863009c8005b4b5fdfb75c7215c0b99
Author: Joerg Roedel 
Date:   Wed May 5 16:04:43 2010 +0200

KVM: SVM: Fix wrong intercept masks on 32 bit

This patch makes KVM on 32 bit SVM working again by
correcting the masks used for iret interception. With the
wrong masks the upper 32 bits of the intercepts are masked
out which leaves vmrun unintercepted. This is not legal on
svm and the vmrun fails.
Bug was introduced by commits 95ba827313 and 3cfc3092.

Cc: Jan Kiszka 
Cc: Gleb Natapov 
Cc: sta...@kernel.org
Signed-off-by: Joerg Roedel 
Signed-off-by: Avi Kivity 

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2ba5820..737361f 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2067,7 +2067,7 @@ static int cpuid_interception(struct vcpu_svm *svm)
 static int iret_interception(struct vcpu_svm *svm)
 {
 	++svm->vcpu.stat.nmi_window_exits;
-	svm->vmcb->control.intercept &= ~(1UL << INTERCEPT_IRET);
+	svm->vmcb->control.intercept &= ~(1ULL << INTERCEPT_IRET);
 	svm->vcpu.arch.hflags |= HF_IRET_MASK;
 	return 1;
 }
@@ -2479,7 +2479,7 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
 
 	svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
 	vcpu->arch.hflags |= HF_NMI_MASK;
-	svm->vmcb->control.intercept |= (1UL << INTERCEPT_IRET);
+	svm->vmcb->control.intercept |= (1ULL << INTERCEPT_IRET);
 	++vcpu->stat.nmi_injections;
 }
 
@@ -2539,10 +2539,10 @@ static void svm_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
 
 	if (masked) {
 		svm->vcpu.arch.hflags |= HF_NMI_MASK;
-		svm->vmcb->control.intercept |= (1UL << INTERCEPT_IRET);
+		svm->vmcb->control.intercept |= (1ULL << INTERCEPT_IRET);
 	} else {
 		svm->vcpu.arch.hflags &= ~HF_NMI_MASK;
-		svm->vmcb->control.intercept &= ~(1UL << INTERCEPT_IRET);
+		svm->vmcb->control.intercept &= ~(1ULL << INTERCEPT_IRET);
 	}
 }
 


signature.asc
Description: Digital signature