RAID5->RAID6 reshape remains stuck at 0% (does nothing, not even start)
Dear list, I'm trying to reshape a 3-disk RAID5 array to a 4-disk RAID6 array (of the same total size and per-device size) using linux kernel 4.9.237 on x86_64. I understand that this reshaping operation is supposed to be supported. But it appears perpetually stuck at 0% with no operation taking place whatsoever (the slices are unchanged apart from their metadata, the backup file contains only zeroes, and nothing happens). I wonder if this is a know kernel bug, or what else could explain it, and I have no idea how to debug this sort of thing. Here are some details on exactly what I've been doing. I'll be using loopbacks to illustrate, but I've done this on real partitions and there was no difference. ## Create some empty loop devices: for i in 0 1 2 3 ; do dd if=/dev/zero of=test-${i} bs=1024k count=16 ; done for i in 0 1 2 3 ; do losetup /dev/loop${i} test-${i} ; done ## Make a RAID array out of the first three: mdadm --create /dev/md/test --level=raid5 --chunk=256 --name=test \ --metadata=1.0 --raid-devices=3 /dev/loop{0,1,2} ## Populate it with some content, just to see what's going on: for i in $(seq 0 63) ; do printf "This is chunk %d (0x%x).\n" $i $i \ | dd of=/dev/md/test bs=256k seek=$i ; done ## Now try to reshape the array from 3-way RAID5 to 4-way RAID6: mdadm --manage /dev/md/test --add-spare /dev/loop3 mdadm --grow /dev/md/test --level=6 --raid-devices=4 \ --backup-file=test-reshape.backup ...and then nothing happens. /proc/mdstat reports no progress whatsoever: md112 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0] 32256 blocks super 1.0 level 6, 256k chunk, algorithm 18 [4/3] [UUU_] [>] reshape = 0.0% (1/16128) finish=1.0min speed=244K/sec The loop file contents are unchanged except for the metadata superblock, the backup file is entirely empty, and no activity whatsoever is happening. Actually, further investigation shows that the array is in fact operational as a RAID6 array, but one where the Q-syndrome is stuck in the last device: writing data to the md device (e.g., by repopulating it with the same command as above) does cause loop3 to be updated as expected for such a layout. It's just the reshaping which doesn't take place (or indeed begin). For completeness, here's what mdadm --detail /dev/md/test looks like before the reshape, in my example: /dev/md/test: Version : 1.0 Creation Time : Wed Sep 30 02:42:30 2020 Raid Level : raid5 Array Size : 32256 (31.50 MiB 33.03 MB) Used Dev Size : 16128 (15.75 MiB 16.52 MB) Raid Devices : 3 Total Devices : 4 Persistence : Superblock is persistent Update Time : Wed Sep 30 02:44:21 2020 State : clean Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 256K Name : vega.stars:test (local to host vega.stars) UUID : 30f40e34:b9a52ff0:75c8b063:77234832 Events : 20 Number Major Minor RaidDevice State 0 700 active sync /dev/loop0 1 711 active sync /dev/loop1 3 722 active sync /dev/loop2 4 73- spare /dev/loop3 - and here's what it looks like after the attempted reshape has started (or rather, refused to start): /dev/md/test: Version : 1.0 Creation Time : Wed Sep 30 02:42:30 2020 Raid Level : raid6 Array Size : 32256 (31.50 MiB 33.03 MB) Used Dev Size : 16128 (15.75 MiB 16.52 MB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Wed Sep 30 02:44:54 2020 State : clean, degraded, reshaping Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric-6 Chunk Size : 256K Reshape Status : 0% complete New Layout : left-symmetric Name : vega.stars:test (local to host vega.stars) UUID : 30f40e34:b9a52ff0:75c8b063:77234832 Events : 22 Number Major Minor RaidDevice State 0 700 active sync /dev/loop0 1 711 active sync /dev/loop1 3 722 active sync /dev/loop2 4 733 spare rebuilding /dev/loop3 I also tried writing "frozen" and then "resync" to the /sys/block/md112/md/sync_action file with no further results. I welcome any suggestions on how to investigate, work around, or fix this problem. Happy hacking, -- David A. Madore ( http://www.madore.org/~david/ )
Re: iTCO_wdt watchdog on Asus P10S-WS motherboard FREEZES MOTHERBOARD COMPLETELY
On Tue, Sep 20, 2016 at 03:50:09PM +0300, Mika Westerberg wrote: > Does the machine have WDAT ACPI table (see /sys/firmware/acpi/tables/*)? > If it does, you can try the new WDAT watchdog driver instead [1]. It > still uses the same hardware, though but via set of instructions > provided by the BIOS that should work (given the vendor has tested > it on Windows). > > [1] http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1230607.html Thanks for pointing this out. My motherboard's BIOS does not have this ACPI table, unfortunately, but it's at least good to know that some do, and take the hardware watchdog seriously. -- David A. Madore ( http://www.madore.org/~david/ )
Re: iTCO_wdt watchdog on Asus P10S-WS motherboard FREEZES MOTHERBOARD COMPLETELY
On Tue, Sep 20, 2016 at 03:50:09PM +0300, Mika Westerberg wrote: > Does the machine have WDAT ACPI table (see /sys/firmware/acpi/tables/*)? > If it does, you can try the new WDAT watchdog driver instead [1]. It > still uses the same hardware, though but via set of instructions > provided by the BIOS that should work (given the vendor has tested > it on Windows). > > [1] http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1230607.html Thanks for pointing this out. My motherboard's BIOS does not have this ACPI table, unfortunately, but it's at least good to know that some do, and take the hardware watchdog seriously. -- David A. Madore ( http://www.madore.org/~david/ )
motherboards with recent Intel chipsets: please test this (iTCO-wdt)
TL;DR: On some motherboards with an Intel chipset, at least from Asus and Asrock, the hardware watchdog (linux driver iTCO-wdt) fails to reboot the system correctly (POST fails and leaves system unusable). Looking for people willing to test, in order to pinpoint the problem. Background: I am looking for users of a desktop with a fairly recent Intel chipset, especially if one or several of the following conditions are satisfied: (1)the BIOS is written by AMI (American Megatrends), (2)the chipset is of the Intel 100 series or C230 series (a.k.a. "Sunrise Point", used for "Skylake" processors with an LGA1151 socket), and (3)the system is booting under UEFI (as opposed to legacy BIOS). The point of this test is to check whether the hardware watchdog included in these chipsets (and known in Intel parlance, this watchdog as the "TCO watchdog", where "TCO" stands for "Total Cost of Ownership") reboots the system properly or, as on my motherboard, places it in a broken state (POST fails, even when the reset button is later pressed, or even if the power button is pressed twice; the power supply needs to be disconnected for a few minutes to restore the system to a working state). This is a very serious bug, which could be due to the BIOS, the hardware, or Linux (I suspect the former, but it is conceivable that Linux could work around it). Do not perform this test unless you can disconnect the power supply! How to test: Boot a recent Linux kernel. Load the i2c-i801 and i2c-smbus modules. Then load the iTCO-wdt module. This should cause lines such as the following to appear in the kernel log (dmesg), indicating that Linux has detected the device: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11 iTCO_wdt: Found a Intel PCH TCO device (Version=4, TCOBASE=0x0400) iTCO_wdt: initialized. heartbeat=120 sec (nowayout=0) Make sure all your filesystems are unmounted or mounted read-only (on systemd, e.g.: systemctl isolate emergency.target ; sync ; echo u >> /proc/sysrq-trigger ; sync (and make sure "Emergency Remount complete" appears at the end of dmesg)). A /dev/watchdog device should have appeared. Then run cat >> /dev/watchdog and press enter twice. Do not interrupt (do not press control-C or control-D), just wait for a few minutes. After a certain time (twice the "heartbeat" value indicated by the kernel), the system will try to reboot. What interests me is whether the reboot succeeds (POST proceeds as normal, and OS restarts) or whether the system locks up (in which case you will need to power cycle it at the power supply unit level in order to restore it to normal). Please report (to me, to avoid spamming this list - I will post a summary) results along with information as to the hardware used: motherboard brand and model, BIOS vendor and date (dmidecode should give this information), UEFI or legacy boot, and any extension cards that might be used on the system (in particular, whether the system uses an integrated GPU or a separate graphics card). I am interested in both positive and negative results. Thanks in advance to all who are willing to test this! Xref: https://lkml.org/lkml/2016/9/8/641 https://www.reddit.com/r/linuxquestions/comments/51xad5/users_of_a_desktop_with_an_intel_chipset_could/ -- David A. Madore ( http://www.madore.org/~david/ )
motherboards with recent Intel chipsets: please test this (iTCO-wdt)
TL;DR: On some motherboards with an Intel chipset, at least from Asus and Asrock, the hardware watchdog (linux driver iTCO-wdt) fails to reboot the system correctly (POST fails and leaves system unusable). Looking for people willing to test, in order to pinpoint the problem. Background: I am looking for users of a desktop with a fairly recent Intel chipset, especially if one or several of the following conditions are satisfied: (1)the BIOS is written by AMI (American Megatrends), (2)the chipset is of the Intel 100 series or C230 series (a.k.a. "Sunrise Point", used for "Skylake" processors with an LGA1151 socket), and (3)the system is booting under UEFI (as opposed to legacy BIOS). The point of this test is to check whether the hardware watchdog included in these chipsets (and known in Intel parlance, this watchdog as the "TCO watchdog", where "TCO" stands for "Total Cost of Ownership") reboots the system properly or, as on my motherboard, places it in a broken state (POST fails, even when the reset button is later pressed, or even if the power button is pressed twice; the power supply needs to be disconnected for a few minutes to restore the system to a working state). This is a very serious bug, which could be due to the BIOS, the hardware, or Linux (I suspect the former, but it is conceivable that Linux could work around it). Do not perform this test unless you can disconnect the power supply! How to test: Boot a recent Linux kernel. Load the i2c-i801 and i2c-smbus modules. Then load the iTCO-wdt module. This should cause lines such as the following to appear in the kernel log (dmesg), indicating that Linux has detected the device: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11 iTCO_wdt: Found a Intel PCH TCO device (Version=4, TCOBASE=0x0400) iTCO_wdt: initialized. heartbeat=120 sec (nowayout=0) Make sure all your filesystems are unmounted or mounted read-only (on systemd, e.g.: systemctl isolate emergency.target ; sync ; echo u >> /proc/sysrq-trigger ; sync (and make sure "Emergency Remount complete" appears at the end of dmesg)). A /dev/watchdog device should have appeared. Then run cat >> /dev/watchdog and press enter twice. Do not interrupt (do not press control-C or control-D), just wait for a few minutes. After a certain time (twice the "heartbeat" value indicated by the kernel), the system will try to reboot. What interests me is whether the reboot succeeds (POST proceeds as normal, and OS restarts) or whether the system locks up (in which case you will need to power cycle it at the power supply unit level in order to restore it to normal). Please report (to me, to avoid spamming this list - I will post a summary) results along with information as to the hardware used: motherboard brand and model, BIOS vendor and date (dmidecode should give this information), UEFI or legacy boot, and any extension cards that might be used on the system (in particular, whether the system uses an integrated GPU or a separate graphics card). I am interested in both positive and negative results. Thanks in advance to all who are willing to test this! Xref: https://lkml.org/lkml/2016/9/8/641 https://www.reddit.com/r/linuxquestions/comments/51xad5/users_of_a_desktop_with_an_intel_chipset_could/ -- David A. Madore ( http://www.madore.org/~david/ )
iTCO_wdt watchdog on Asus P10S-WS motherboard FREEZES MOTHERBOARD COMPLETELY
TL;DR: the iTCO_wdt watchdog on the Asus P10S-WS motherboard, instead of rebooting the machine, places the motherboard in a completely nonfunctional state, from which it can be revived only by a hard power cycle. I suspect this is a BIOS bug: seeking advice on how/where to report this, and what to do generally. Maybe Linux can work around? Dear list, I have an Asus P10S-WS motherboard (Intel C236 chipset). I have been trying to get the iTCO_wdt hardware watchdog to work (I have been successfully using this driver with similar Intel chipset based Asus motherboards before, and I know it to work reliably). I am using Linux 4.7.3. I trigger a reboot by killing (with kill -9) the wd_keepalive daemon once it has opened the watchdog device. Sadly, it appears that on this motherboard, the watchdog does not reboot the machine (or at least, does not successfully reboot it). Instead, the machine enters a "frozen" state (fans spinning, screen black, all peripherals unresponsive) from which it cannot be woken up by pressing the reset button, or even the power button twice (the first press does turn the machine off, but it returns to the same nonfunctional state after power on). Instead, power has to be cut completely, at the power supply level. In this nonfunctional state, the Asus POST status display shows the number "62", which according to the motherboard manual is the code for "installation of the PCH runtime services" (I have no idea of what that means). I suspect that this is a BIOS ^W UEFI bug and in no way Linux's fault. It could also be a hardware problem, a chipset bug, or something else. And even if it is a firmware bug, it is conceivable that there is a way to work around the problem from Linux. So I ask for guidance from the wisdom of this list: * Is there something Linux can do about the problem? * Is there a chance some kernel developer knows someone at Asus and can bring this problem to their attention? * Can someone report success using the iTCO_wdt watchdog with other motherboards having the same Intel C236 chipset? (Note: for it to work, the i2c_smbus module needs to be loaded: it took me a long time to figure out.) * Is all hope lost for my motherboard? (I badly need a hardware watchdog: if there is no way to get it to work on this motherboard, I will need to buy a new one.) Any suggestions are welcome (or even words of comfort :-). -- David A. Madore ( http://www.madore.org/~david/ )
iTCO_wdt watchdog on Asus P10S-WS motherboard FREEZES MOTHERBOARD COMPLETELY
TL;DR: the iTCO_wdt watchdog on the Asus P10S-WS motherboard, instead of rebooting the machine, places the motherboard in a completely nonfunctional state, from which it can be revived only by a hard power cycle. I suspect this is a BIOS bug: seeking advice on how/where to report this, and what to do generally. Maybe Linux can work around? Dear list, I have an Asus P10S-WS motherboard (Intel C236 chipset). I have been trying to get the iTCO_wdt hardware watchdog to work (I have been successfully using this driver with similar Intel chipset based Asus motherboards before, and I know it to work reliably). I am using Linux 4.7.3. I trigger a reboot by killing (with kill -9) the wd_keepalive daemon once it has opened the watchdog device. Sadly, it appears that on this motherboard, the watchdog does not reboot the machine (or at least, does not successfully reboot it). Instead, the machine enters a "frozen" state (fans spinning, screen black, all peripherals unresponsive) from which it cannot be woken up by pressing the reset button, or even the power button twice (the first press does turn the machine off, but it returns to the same nonfunctional state after power on). Instead, power has to be cut completely, at the power supply level. In this nonfunctional state, the Asus POST status display shows the number "62", which according to the motherboard manual is the code for "installation of the PCH runtime services" (I have no idea of what that means). I suspect that this is a BIOS ^W UEFI bug and in no way Linux's fault. It could also be a hardware problem, a chipset bug, or something else. And even if it is a firmware bug, it is conceivable that there is a way to work around the problem from Linux. So I ask for guidance from the wisdom of this list: * Is there something Linux can do about the problem? * Is there a chance some kernel developer knows someone at Asus and can bring this problem to their attention? * Can someone report success using the iTCO_wdt watchdog with other motherboards having the same Intel C236 chipset? (Note: for it to work, the i2c_smbus module needs to be loaded: it took me a long time to figure out.) * Is all hope lost for my motherboard? (I badly need a hardware watchdog: if there is no way to get it to work on this motherboard, I will need to buy a new one.) Any suggestions are welcome (or even words of comfort :-). -- David A. Madore ( http://www.madore.org/~david/ )
Many unexplainable OOMs after upgrading to 4.7.x kernel
TL;DR: Why is Firefox getting OOM-killed while I have 24GB free swap? Dear list, A few days ago I upgraded the kernel on my desktop PC from 4.5.5 to 4.7.2 and, since then, I've witnessed a huge number of cases where various processes (typically Firefox) got OOM-killed by the kernel. Before this kernel upgrade, I had never seen a single OOM event in normal use; now I've had dozens in couple of days. Nothing has changed in my config apart from the kernel. Clearly, something has changed for the worse! In fact, this morning, the problem was so bad that I was simply unable to start Firefox (any attempt to do so would result in it getting killed immediately), even though I had about 24GB of free swap available (and the system was idle). I had to kill a few unrelated processes to be able to launch Firefox; and even then, at some point, running "sync" in a different window caused the Firefox to be OOM-killed. (Note: the problem IS NOT the choice of the process being killed: Firefox is a reasonable target. The problem is that the OOM-killer is being invoked at all, whereas there is plenty of free swap, and, before this kernel upgrade, all seemed to work perfectly.) Unfortunately, all of this is very unreproducible: now it seems to have disappeared completely, so I can't really test anything any more. All I can offer is some sample output (below) from the kernel's logs. Any suggestions as to what I should do if and when the problem returns is welcome, either to debug or to work around these OOMs. The computer in question is an x86_64 (Intel Core 2 Quad Q6600) box with 8GB RAM and 24GB swap (4*6GB swap across four different disks): I can of course offer any additional details as to its hardware, kernel or userland config. Sample log output from OOM-killer: ### cut after ### Aug 25 11:46:12 vega kernel: [115461.357412] firefox invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0 Aug 25 11:46:12 vega kernel: [115461.357421] CPU: 2 PID: 11342 Comm: firefox Not tainted 4.7.2-vega #1 Aug 25 11:46:12 vega kernel: [115461.357424] Hardware name: System manufacturer System Product Name/P5W64 WS Pro, BIOS 090307/31/2007 Aug 25 11:46:12 vega kernel: [115461.357427] 880068897bc8 812e4a7f Aug 25 11:46:12 vega kernel: [115461.357433] 880068897c28 8117cd40 818a9840 Aug 25 11:46:12 vega kernel: [115461.357438] 0002810e98a5 88023fd196a8 0002 0206 Aug 25 11:46:12 vega kernel: [115461.357442] Call Trace: Aug 25 11:46:12 vega kernel: [115461.357451] [] dump_stack+0x4d/0x6e Aug 25 11:46:12 vega kernel: [115461.357456] [] dump_header.isra.16+0x51/0x191 Aug 25 11:46:12 vega kernel: [115461.357462] [] oom_kill_process+0x33b/0x430 Aug 25 11:46:12 vega kernel: [115461.357465] [] out_of_memory+0x258/0x2a0 Aug 25 11:46:12 vega kernel: [115461.357470] [] __alloc_pages_nodemask+0xaf9/0xc90 Aug 25 11:46:12 vega kernel: [115461.357474] [] alloc_kmem_pages_node+0x16/0x20 Aug 25 11:46:12 vega kernel: [115461.357479] [] copy_process.part.42+0xfe/0x17a0 Aug 25 11:46:12 vega kernel: [115461.357483] [] ? lru_cache_add_active_or_unevictable+0x30/0xa0 Aug 25 11:46:12 vega kernel: [115461.357487] [] ? handle_mm_fault+0x172a/0x1910 Aug 25 11:46:12 vega kernel: [115461.357491] [] _do_fork+0xdc/0x350 Aug 25 11:46:12 vega kernel: [115461.357496] [] ? __do_page_fault+0x19d/0x4a0 Aug 25 11:46:12 vega kernel: [115461.357499] [] SyS_clone+0x14/0x20 Aug 25 11:46:12 vega kernel: [115461.357503] [] do_syscall_64+0x4b/0xa0 Aug 25 11:46:12 vega kernel: [115461.357507] [] entry_SYSCALL64_slow_path+0x25/0x25 Aug 25 11:46:12 vega kernel: [115461.357510] Mem-Info: Aug 25 11:46:12 vega kernel: [115461.357516] active_anon:44076 inactive_anon:99299 isolated_anon:0 Aug 25 11:46:12 vega kernel: [115461.357516] active_file:392644 inactive_file:215736 isolated_file:0 Aug 25 11:46:12 vega kernel: [115461.357516] unevictable:560 dirty:85 writeback:0 unstable:0 Aug 25 11:46:12 vega kernel: [115461.357516] slab_reclaimable:1033111 slab_unreclaimable:31255 Aug 25 11:46:12 vega kernel: [115461.357516] mapped:34613 shmem:10947 pagetables:4477 bounce:0 Aug 25 11:46:12 vega kernel: [115461.357516] free:77874 free_pcp:0 free_cma:0 Aug 25 11:46:12 vega kernel: [115461.357529] DMA free:15028kB min:260kB low:324kB high:388kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Aug 25 11:46:12 vega kernel: [115461.357531] lowmem_reserve[]: 0 2960 7954 7954 Aug 25 11:46:12 vega kernel: [115461.357544] DMA32 free:187232kB
Many unexplainable OOMs after upgrading to 4.7.x kernel
TL;DR: Why is Firefox getting OOM-killed while I have 24GB free swap? Dear list, A few days ago I upgraded the kernel on my desktop PC from 4.5.5 to 4.7.2 and, since then, I've witnessed a huge number of cases where various processes (typically Firefox) got OOM-killed by the kernel. Before this kernel upgrade, I had never seen a single OOM event in normal use; now I've had dozens in couple of days. Nothing has changed in my config apart from the kernel. Clearly, something has changed for the worse! In fact, this morning, the problem was so bad that I was simply unable to start Firefox (any attempt to do so would result in it getting killed immediately), even though I had about 24GB of free swap available (and the system was idle). I had to kill a few unrelated processes to be able to launch Firefox; and even then, at some point, running "sync" in a different window caused the Firefox to be OOM-killed. (Note: the problem IS NOT the choice of the process being killed: Firefox is a reasonable target. The problem is that the OOM-killer is being invoked at all, whereas there is plenty of free swap, and, before this kernel upgrade, all seemed to work perfectly.) Unfortunately, all of this is very unreproducible: now it seems to have disappeared completely, so I can't really test anything any more. All I can offer is some sample output (below) from the kernel's logs. Any suggestions as to what I should do if and when the problem returns is welcome, either to debug or to work around these OOMs. The computer in question is an x86_64 (Intel Core 2 Quad Q6600) box with 8GB RAM and 24GB swap (4*6GB swap across four different disks): I can of course offer any additional details as to its hardware, kernel or userland config. Sample log output from OOM-killer: ### cut after ### Aug 25 11:46:12 vega kernel: [115461.357412] firefox invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0 Aug 25 11:46:12 vega kernel: [115461.357421] CPU: 2 PID: 11342 Comm: firefox Not tainted 4.7.2-vega #1 Aug 25 11:46:12 vega kernel: [115461.357424] Hardware name: System manufacturer System Product Name/P5W64 WS Pro, BIOS 090307/31/2007 Aug 25 11:46:12 vega kernel: [115461.357427] 880068897bc8 812e4a7f Aug 25 11:46:12 vega kernel: [115461.357433] 880068897c28 8117cd40 818a9840 Aug 25 11:46:12 vega kernel: [115461.357438] 0002810e98a5 88023fd196a8 0002 0206 Aug 25 11:46:12 vega kernel: [115461.357442] Call Trace: Aug 25 11:46:12 vega kernel: [115461.357451] [] dump_stack+0x4d/0x6e Aug 25 11:46:12 vega kernel: [115461.357456] [] dump_header.isra.16+0x51/0x191 Aug 25 11:46:12 vega kernel: [115461.357462] [] oom_kill_process+0x33b/0x430 Aug 25 11:46:12 vega kernel: [115461.357465] [] out_of_memory+0x258/0x2a0 Aug 25 11:46:12 vega kernel: [115461.357470] [] __alloc_pages_nodemask+0xaf9/0xc90 Aug 25 11:46:12 vega kernel: [115461.357474] [] alloc_kmem_pages_node+0x16/0x20 Aug 25 11:46:12 vega kernel: [115461.357479] [] copy_process.part.42+0xfe/0x17a0 Aug 25 11:46:12 vega kernel: [115461.357483] [] ? lru_cache_add_active_or_unevictable+0x30/0xa0 Aug 25 11:46:12 vega kernel: [115461.357487] [] ? handle_mm_fault+0x172a/0x1910 Aug 25 11:46:12 vega kernel: [115461.357491] [] _do_fork+0xdc/0x350 Aug 25 11:46:12 vega kernel: [115461.357496] [] ? __do_page_fault+0x19d/0x4a0 Aug 25 11:46:12 vega kernel: [115461.357499] [] SyS_clone+0x14/0x20 Aug 25 11:46:12 vega kernel: [115461.357503] [] do_syscall_64+0x4b/0xa0 Aug 25 11:46:12 vega kernel: [115461.357507] [] entry_SYSCALL64_slow_path+0x25/0x25 Aug 25 11:46:12 vega kernel: [115461.357510] Mem-Info: Aug 25 11:46:12 vega kernel: [115461.357516] active_anon:44076 inactive_anon:99299 isolated_anon:0 Aug 25 11:46:12 vega kernel: [115461.357516] active_file:392644 inactive_file:215736 isolated_file:0 Aug 25 11:46:12 vega kernel: [115461.357516] unevictable:560 dirty:85 writeback:0 unstable:0 Aug 25 11:46:12 vega kernel: [115461.357516] slab_reclaimable:1033111 slab_unreclaimable:31255 Aug 25 11:46:12 vega kernel: [115461.357516] mapped:34613 shmem:10947 pagetables:4477 bounce:0 Aug 25 11:46:12 vega kernel: [115461.357516] free:77874 free_pcp:0 free_cma:0 Aug 25 11:46:12 vega kernel: [115461.357529] DMA free:15028kB min:260kB low:324kB high:388kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Aug 25 11:46:12 vega kernel: [115461.357531] lowmem_reserve[]: 0 2960 7954 7954 Aug 25 11:46:12 vega kernel: [115461.357544] DMA32 free:187232kB
"hw csum failure" error on skge driver with 4.3 kernel upon receiving ICMPv6 multicast listener discovery packets
The skge driver in the 4.3 kernel reports hardware checksum errors upon receiving (certain?) IPv6 multicast packets containing ICMPv6 multicast listener discovery messages. This is a regression since 4.1 (I believe between 4.1 and 4.2). The e1000e driver on a different Ethernet port of the same machine is not affected. Disabling offload rx checksumming suppresses the errors. Nor are all IPv6 multicast packets affected: for some reason, it seems only those containing ICMPv6 multicast listener discovery messages trigger the problem. In case it also matters, the skge interface in question (eth1 in what follows) is part of a bridge that contains another Ethernet interface and a Wifi card. Here is a frame, with its link-level headers, that caused an error when received by skge: 33 33 ff 62 30 d8 60 fb 42 f1 b1 36 86 dd 60 00 33.b0.`.B..6..`. 0010 00 00 00 20 00 01 fe 80 00 00 00 00 00 00 62 fb ... ..b. 0020 42 ff fe f1 b1 36 ff 02 00 00 00 00 00 00 00 00 B6.. 0030 00 01 ff 62 30 d8 3a 00 01 00 05 02 00 00 83 00 ...b0.:. 0040 c9 8a 00 00 00 00 ff 02 00 00 00 00 00 00 00 00 0050 00 01 ff 62 30 d8...b0. (Network dumps performed on another network device suggest that the checksum is, indeed, correct.) And here is the syslog produced upon receiving the above packet: Nov 15 17:52:13 pleiades kernel: [ 661.393163] eth1: hw csum failure Nov 15 17:52:13 pleiades kernel: [ 661.394203] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW 4.3.0-pleiades #1 Nov 15 17:52:13 pleiades kernel: [ 661.395192] Hardware name: System manufacturer System Product Name/P5WD2-Premium, BIOS 0709 03/31/2006 Nov 15 17:52:13 pleiades kernel: [ 661.395192] 88013a9d5d00 88013fc03aa8 8129a186 88013afe Nov 15 17:52:13 pleiades kernel: [ 661.395192] 88013fc03ac0 81436425 88013fc03af0 Nov 15 17:52:13 pleiades kernel: [ 661.395192] 8142b87a 1027316b3fc03b30 88013a9d5d00 0030 Nov 15 17:52:13 pleiades kernel: [ 661.395192] Call Trace: Nov 15 17:52:13 pleiades kernel: [ 661.395192][] dump_stack+0x44/0x5e Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] netdev_rx_csum_fault+0x35/0x40 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] __skb_checksum_complete+0xca/0xd0 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ipv6_mc_validate_checksum+0xab/0x140 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] skb_checksum_trimmed+0x8f/0x180 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ipv6_mc_check_mld+0x105/0x330 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] br_multicast_rcv+0x8c/0xce0 [bridge] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? __netif_receive_skb+0x13/0x60 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? netif_receive_skb_internal+0x2e/0x90 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] br_handle_frame_finish+0x28c/0x5b0 [bridge] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? usb_hcd_submit_urb+0xa4/0x960 [usbcore] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] br_handle_frame+0x151/0x270 [bridge] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? usb_submit_urb+0x2d2/0x510 [usbcore] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] __netif_receive_skb_core+0x1c2/0x990 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? __usb_hcd_giveback_urb+0x82/0xe0 [usbcore] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] __netif_receive_skb+0x13/0x60 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] netif_receive_skb_internal+0x2e/0x90 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] napi_gro_receive+0xa0/0xd0 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] skge_poll+0x380/0x7a0 [skge] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? lapic_next_event+0x18/0x20 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] net_rx_action+0x13c/0x300 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] __do_softirq+0xc7/0x240 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] irq_exit+0x70/0x90 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] do_IRQ+0x51/0xd0 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] common_interrupt+0x7c/0x7c Nov 15 17:52:13 pleiades kernel: [ 661.395192][] ? mwait_idle+0x87/0x140 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] arch_cpu_idle+0xa/0x10 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] default_idle_call+0x25/0x30 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] cpu_startup_entry+0x29c/0x310 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] rest_init+0x72/0x80 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] start_kernel+0x471/0x47e Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? set_init_arg+0x55/0x55 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] x86_64_start_reservations+0x2a/0x2c Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] x86_64_start_kernel+0xe5/0xe8 I can
"hw csum failure" error on skge driver with 4.3 kernel upon receiving ICMPv6 multicast listener discovery packets
The skge driver in the 4.3 kernel reports hardware checksum errors upon receiving (certain?) IPv6 multicast packets containing ICMPv6 multicast listener discovery messages. This is a regression since 4.1 (I believe between 4.1 and 4.2). The e1000e driver on a different Ethernet port of the same machine is not affected. Disabling offload rx checksumming suppresses the errors. Nor are all IPv6 multicast packets affected: for some reason, it seems only those containing ICMPv6 multicast listener discovery messages trigger the problem. In case it also matters, the skge interface in question (eth1 in what follows) is part of a bridge that contains another Ethernet interface and a Wifi card. Here is a frame, with its link-level headers, that caused an error when received by skge: 33 33 ff 62 30 d8 60 fb 42 f1 b1 36 86 dd 60 00 33.b0.`.B..6..`. 0010 00 00 00 20 00 01 fe 80 00 00 00 00 00 00 62 fb ... ..b. 0020 42 ff fe f1 b1 36 ff 02 00 00 00 00 00 00 00 00 B6.. 0030 00 01 ff 62 30 d8 3a 00 01 00 05 02 00 00 83 00 ...b0.:. 0040 c9 8a 00 00 00 00 ff 02 00 00 00 00 00 00 00 00 0050 00 01 ff 62 30 d8...b0. (Network dumps performed on another network device suggest that the checksum is, indeed, correct.) And here is the syslog produced upon receiving the above packet: Nov 15 17:52:13 pleiades kernel: [ 661.393163] eth1: hw csum failure Nov 15 17:52:13 pleiades kernel: [ 661.394203] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW 4.3.0-pleiades #1 Nov 15 17:52:13 pleiades kernel: [ 661.395192] Hardware name: System manufacturer System Product Name/P5WD2-Premium, BIOS 0709 03/31/2006 Nov 15 17:52:13 pleiades kernel: [ 661.395192] 88013a9d5d00 88013fc03aa8 8129a186 88013afe Nov 15 17:52:13 pleiades kernel: [ 661.395192] 88013fc03ac0 81436425 88013fc03af0 Nov 15 17:52:13 pleiades kernel: [ 661.395192] 8142b87a 1027316b3fc03b30 88013a9d5d00 0030 Nov 15 17:52:13 pleiades kernel: [ 661.395192] Call Trace: Nov 15 17:52:13 pleiades kernel: [ 661.395192][] dump_stack+0x44/0x5e Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] netdev_rx_csum_fault+0x35/0x40 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] __skb_checksum_complete+0xca/0xd0 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ipv6_mc_validate_checksum+0xab/0x140 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] skb_checksum_trimmed+0x8f/0x180 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ipv6_mc_check_mld+0x105/0x330 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] br_multicast_rcv+0x8c/0xce0 [bridge] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? __netif_receive_skb+0x13/0x60 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? netif_receive_skb_internal+0x2e/0x90 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] br_handle_frame_finish+0x28c/0x5b0 [bridge] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? usb_hcd_submit_urb+0xa4/0x960 [usbcore] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] br_handle_frame+0x151/0x270 [bridge] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? usb_submit_urb+0x2d2/0x510 [usbcore] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] __netif_receive_skb_core+0x1c2/0x990 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? __usb_hcd_giveback_urb+0x82/0xe0 [usbcore] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] __netif_receive_skb+0x13/0x60 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] netif_receive_skb_internal+0x2e/0x90 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] napi_gro_receive+0xa0/0xd0 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] skge_poll+0x380/0x7a0 [skge] Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? lapic_next_event+0x18/0x20 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] net_rx_action+0x13c/0x300 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] __do_softirq+0xc7/0x240 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] irq_exit+0x70/0x90 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] do_IRQ+0x51/0xd0 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] common_interrupt+0x7c/0x7c Nov 15 17:52:13 pleiades kernel: [ 661.395192][] ? mwait_idle+0x87/0x140 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] arch_cpu_idle+0xa/0x10 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] default_idle_call+0x25/0x30 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] cpu_startup_entry+0x29c/0x310 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] rest_init+0x72/0x80 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] start_kernel+0x471/0x47e Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] ? set_init_arg+0x55/0x55 Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] x86_64_start_reservations+0x2a/0x2c Nov 15 17:52:13 pleiades kernel: [ 661.395192] [] x86_64_start_kernel+0xe5/0xe8 I can
Re: XFS freeze (xfsaild blocked) with 4.2.5 (also 4.3)
On Sun, Nov 08, 2015 at 08:49:41AM -0800, Christoph Hellwig wrote: > can you try the patch at: > > http://article.gmane.org/gmane.comp.file-systems.xfs.general/70984 > > The symptoms sound surprisingly similar. Thanks for the pointer. I'm running with the patch now, and will follow-up if the problem reoccurs. -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
XFS freeze (xfsaild blocked) with 4.2.5 (also 4.3)
Compiled a 4.2.5 kernel a few days ago. This morning, my machine was essentially unresponsive (couldn't log on) so I used alt-sysrq-{s,t,s,u,b} to reboot it, and after reboot it appears that the first suspicious message concerns xfsaild blocked for more than 300 seconds. The disks and filesystems are good as far as I know, and I never had any problem with my previous (4.1.4) kernel, so I guess this is a regression in 4.2. (I also had a similar issue with 4.3, which is probably due to the same cause and also XFS-related: https://lkml.org/lkml/2015/11/4/104 >.) Please contact me for further information, or just to make me feel less lonely and helpless when facing this kind of bug. :-( Here are the first few and last lines of log, with links to the full logs and full config below: Nov 8 08:08:14 vega kernel: [313800.046048] INFO: task xfsaild/md117:14100 blocked for more than 300 seconds. Nov 8 08:08:14 vega kernel: [313800.046054] Not tainted 4.2.5-vega #1 Nov 8 08:08:14 vega kernel: [313800.046057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 8 08:08:14 vega kernel: [313800.046060] xfsaild/md117 D 88023fc14a00 0 14100 2 0x Nov 8 08:08:14 vega kernel: [313800.046066] 8800bbb07d18 0046 817244c0 8802345126c0 Nov 8 08:08:14 vega kernel: [313800.046071] 8800bbb07d08 810a35ba 8800bbb07d58 8800bbb08000 Nov 8 08:08:14 vega kernel: [313800.046075] 880231235528 880231235400 880234460e00 Nov 8 08:08:14 vega kernel: [313800.046080] Call Trace: Nov 8 08:08:14 vega kernel: [313800.046090] [] ? try_to_del_timer_sync+0x4a/0x60 Nov 8 08:08:14 vega kernel: [313800.046095] [] schedule+0x32/0x80 Nov 8 08:08:14 vega kernel: [313800.046123] [] _xfs_log_force+0x154/0x240 [xfs] Nov 8 08:08:14 vega kernel: [313800.046128] [] ? wake_up_q+0x70/0x70 Nov 8 08:08:14 vega kernel: [313800.046143] [] xfs_log_force+0x25/0x80 [xfs] Nov 8 08:08:14 vega kernel: [313800.046159] [] ? xfs_trans_ail_cursor_done+0x11/0x30 [xfs] Nov 8 08:08:14 vega kernel: [313800.046173] [] xfsaild+0x138/0x560 [xfs] Nov 8 08:08:14 vega kernel: [313800.046188] [] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] Nov 8 08:08:14 vega kernel: [313800.046203] [] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] Nov 8 08:08:14 vega kernel: [313800.046207] [] kthread+0xc4/0xe0 Nov 8 08:08:14 vega kernel: [313800.046211] [] ? kthread_create_on_node+0x180/0x180 Nov 8 08:08:14 vega kernel: [313800.046215] [] ret_from_fork+0x3f/0x70 Nov 8 08:08:14 vega kernel: [313800.046219] [] ? kthread_create_on_node+0x180/0x180 Nov 8 11:18:01 vega kernel: [325187.487839] Showing busy workqueues and worker pools: Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md117: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=2/256 Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_buf_ioend_work [xfs], xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md118: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256 Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md115: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256 Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md120: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256 Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md113: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=2/256 Nov 8 11:18:01 vega kernel: [325187.487839] in-flight: 31546:xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_buf_ioend_work [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md110: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256 Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 31448 idle: 31569 Nov 8 11:18:01 vega kernel: [325187.487839] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=2 manager: 31616 idle: 31510 Nov 8 11:18:01 vega kernel: [325187.487839] pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 manager: 31499 Nov 8 11:18:01 vega kernel: [325187.487839] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=2 manager: 31503 idle: 31571 Nov 8 11:18:01 vega kernel: [325187.487839] pool 8: cpus=0-3
XFS freeze (xfsaild blocked) with 4.2.5 (also 4.3)
Compiled a 4.2.5 kernel a few days ago. This morning, my machine was essentially unresponsive (couldn't log on) so I used alt-sysrq-{s,t,s,u,b} to reboot it, and after reboot it appears that the first suspicious message concerns xfsaild blocked for more than 300 seconds. The disks and filesystems are good as far as I know, and I never had any problem with my previous (4.1.4) kernel, so I guess this is a regression in 4.2. (I also had a similar issue with 4.3, which is probably due to the same cause and also XFS-related: https://lkml.org/lkml/2015/11/4/104 >.) Please contact me for further information, or just to make me feel less lonely and helpless when facing this kind of bug. :-( Here are the first few and last lines of log, with links to the full logs and full config below: Nov 8 08:08:14 vega kernel: [313800.046048] INFO: task xfsaild/md117:14100 blocked for more than 300 seconds. Nov 8 08:08:14 vega kernel: [313800.046054] Not tainted 4.2.5-vega #1 Nov 8 08:08:14 vega kernel: [313800.046057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 8 08:08:14 vega kernel: [313800.046060] xfsaild/md117 D 88023fc14a00 0 14100 2 0x Nov 8 08:08:14 vega kernel: [313800.046066] 8800bbb07d18 0046 817244c0 8802345126c0 Nov 8 08:08:14 vega kernel: [313800.046071] 8800bbb07d08 810a35ba 8800bbb07d58 8800bbb08000 Nov 8 08:08:14 vega kernel: [313800.046075] 880231235528 880231235400 880234460e00 Nov 8 08:08:14 vega kernel: [313800.046080] Call Trace: Nov 8 08:08:14 vega kernel: [313800.046090] [] ? try_to_del_timer_sync+0x4a/0x60 Nov 8 08:08:14 vega kernel: [313800.046095] [] schedule+0x32/0x80 Nov 8 08:08:14 vega kernel: [313800.046123] [] _xfs_log_force+0x154/0x240 [xfs] Nov 8 08:08:14 vega kernel: [313800.046128] [] ? wake_up_q+0x70/0x70 Nov 8 08:08:14 vega kernel: [313800.046143] [] xfs_log_force+0x25/0x80 [xfs] Nov 8 08:08:14 vega kernel: [313800.046159] [] ? xfs_trans_ail_cursor_done+0x11/0x30 [xfs] Nov 8 08:08:14 vega kernel: [313800.046173] [] xfsaild+0x138/0x560 [xfs] Nov 8 08:08:14 vega kernel: [313800.046188] [] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] Nov 8 08:08:14 vega kernel: [313800.046203] [] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] Nov 8 08:08:14 vega kernel: [313800.046207] [] kthread+0xc4/0xe0 Nov 8 08:08:14 vega kernel: [313800.046211] [] ? kthread_create_on_node+0x180/0x180 Nov 8 08:08:14 vega kernel: [313800.046215] [] ret_from_fork+0x3f/0x70 Nov 8 08:08:14 vega kernel: [313800.046219] [] ? kthread_create_on_node+0x180/0x180 Nov 8 11:18:01 vega kernel: [325187.487839] Showing busy workqueues and worker pools: Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md117: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=2/256 Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_buf_ioend_work [xfs], xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md118: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256 Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md115: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256 Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md120: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256 Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md113: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=2/256 Nov 8 11:18:01 vega kernel: [325187.487839] in-flight: 31546:xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_buf_ioend_work [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] workqueue xfs-log/md110: flags=0x14 Nov 8 11:18:01 vega kernel: [325187.487839] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256 Nov 8 11:18:01 vega kernel: [325187.487839] pending: xfs_log_worker [xfs] Nov 8 11:18:01 vega kernel: [325187.487839] pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 31448 idle: 31569 Nov 8 11:18:01 vega kernel: [325187.487839] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=2 manager: 31616 idle: 31510 Nov 8 11:18:01 vega kernel: [325187.487839] pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 manager: 31499 Nov 8 11:18:01 vega kernel: [325187.487839] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=2 manager: 31503 idle: 31571 Nov 8 11:18:01 vega kernel: [325187.487839] pool 8: cpus=0-3
Re: XFS freeze (xfsaild blocked) with 4.2.5 (also 4.3)
On Sun, Nov 08, 2015 at 08:49:41AM -0800, Christoph Hellwig wrote: > can you try the patch at: > > http://article.gmane.org/gmane.comp.file-systems.xfs.general/70984 > > The symptoms sound surprisingly similar. Thanks for the pointer. I'm running with the patch now, and will follow-up if the problem reoccurs. -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
tasks hung forever (khugepaged blocked) with 4.3 kernel
With a 4.3 kernel I compiled two days ago, I had various processes stuck in 'D' state this morning, tried to unmount filesystems, which made things worse and froze everything. In case this means anything, reading from the hard drives with dd (e.g., dd if=/dev/sda bs=4096k count=1) worked (data was not in cache), but not for large amounts of data (count=64 hung forever). If this is of any use, I did an alt-sysrq-t after an emergency sync, and I have the logs: full kernel log is at link below, here are the beginning and end of it: Nov 4 08:10:10 vega kernel: [141900.078018] INFO: task khugepaged:173 blocked for more than 300 seconds. Nov 4 08:10:10 vega kernel: [141900.078024] Not tainted 4.3.0-vega #1 Nov 4 08:10:10 vega kernel: [141900.078026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 4 08:10:10 vega kernel: [141900.078029] khugepaged D 88023fc94f00 0 173 2 0x Nov 4 08:10:10 vega kernel: [141900.078035] 88023eb97748 0046 88023e9a6d00 880236e38d40 Nov 4 08:10:10 vega kernel: [141900.078040] 88023fc1ec88 88023eb97728 88023eb98000 8800bba5e800 Nov 4 08:10:10 vega kernel: [141900.078044] 880234607518 880236e38d40 88023eb97760 Nov 4 08:10:10 vega kernel: [141900.078048] Call Trace: Nov 4 08:10:10 vega kernel: [141900.078057] [] schedule+0x2e/0x70 Nov 4 08:10:10 vega kernel: [141900.078086] [] _xfs_log_force_lsn+0x155/0x2b0 [xfs] Nov 4 08:10:10 vega kernel: [141900.078091] [] ? wake_up_q+0x70/0x70 Nov 4 08:10:10 vega kernel: [141900.078106] [] xfs_log_force_lsn+0x29/0x80 [xfs] Nov 4 08:10:10 vega kernel: [141900.078123] [] ? xfs_iunpin_wait+0x14/0x20 [xfs] Nov 4 08:10:10 vega kernel: [141900.078140] [] __xfs_iunpin_wait+0x88/0x120 [xfs] Nov 4 08:10:10 vega kernel: [141900.078145] [] ? autoremove_wake_function+0x30/0x30 Nov 4 08:10:10 vega kernel: [141900.078161] [] xfs_iunpin_wait+0x14/0x20 [xfs] Nov 4 08:10:10 vega kernel: [141900.078178] [] xfs_reclaim_inode+0x5d/0x320 [xfs] Nov 4 08:10:10 vega kernel: [141900.078196] [] xfs_reclaim_inodes_ag+0x243/0x360 [xfs] Nov 4 08:10:10 vega kernel: [141900.078214] [] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] Nov 4 08:10:10 vega kernel: [141900.078229] [] xfs_fs_free_cached_objects+0x14/0x20 [xfs] Nov 4 08:10:10 vega kernel: [141900.078233] [] super_cache_scan+0x179/0x180 Nov 4 08:10:10 vega kernel: [141900.078239] [] shrink_slab.part.62.constprop.72+0x1c5/0x340 Nov 4 08:10:10 vega kernel: [141900.078243] [] shrink_zone+0x166/0x170 Nov 4 08:10:10 vega kernel: [141900.078246] [] do_try_to_free_pages+0x173/0x350 Nov 4 08:10:10 vega kernel: [141900.078249] [] try_to_free_pages+0xa0/0x130 Nov 4 08:10:10 vega kernel: [141900.078253] [] __alloc_pages_nodemask+0x40e/0x790 Nov 4 08:10:10 vega kernel: [141900.078258] [] khugepaged+0x14a/0x11c0 Nov 4 08:10:10 vega kernel: [141900.078261] [] ? wait_woken+0x80/0x80 Nov 4 08:10:10 vega kernel: [141900.078265] [] ? use_zero_page_show+0x30/0x30 Nov 4 08:10:10 vega kernel: [141900.078269] [] kthread+0xc4/0xe0 Nov 4 08:10:10 vega kernel: [141900.078273] [] ? kthread_create_on_node+0x170/0x170 Nov 4 08:10:10 vega kernel: [141900.078277] [] ret_from_fork+0x3f/0x70 Nov 4 08:10:10 vega kernel: [141900.078280] [] ? kthread_create_on_node+0x170/0x170 Nov 4 08:10:10 vega kernel: [141900.078285] INFO: task kswapd0:443 blocked for more than 300 seconds. Nov 4 08:10:10 vega kernel: [141900.078287] Not tainted 4.3.0-vega #1 Nov 4 10:10:41 vega kernel: [149114.106817] Nov 4 10:10:41 vega kernel: [149114.106817] Showing busy workqueues and worker pools: Nov 4 10:10:41 vega kernel: [149114.106817] workqueue events: flags=0x0 Nov 4 10:10:41 vega kernel: [149114.106817] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=9/256 Nov 4 10:10:41 vega kernel: [149114.106817] in-flight: 19153:do_sync_work Nov 4 10:10:41 vega kernel: [149114.106817] pending: console_callback, dbs_timer, sysrq_reinject_alt_sysrq, cache_reap, vmstat_update, flush_to_ldisc, push_to_pool, kernfs_notify_workfn Nov 4 10:10:41 vega kernel: [149114.106817] workqueue events_power_efficient: flags=0x80 Nov 4 10:10:41 vega kernel: [149114.106817] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 Nov 4 10:10:41 vega kernel: [149114.106817] pending: neigh_periodic_work Nov 4 10:10:41 vega kernel: [149114.106817] workqueue writeback: flags=0x4e Nov 4 10:10:41 vega kernel: [149114.106817] pwq 8: cpus=0-3 flags=0x4 nice=0 active=2/256 Nov 4 10:10:41 vega kernel: [149114.106817] in-flight: 14424:wb_workfn, 169(RESCUER):wb_workfn Nov 4 10:10:41 vega kernel: [149114.106817] workqueue xfs-reclaim/md115: flags=0x4 Nov 4 10:10:41 vega kernel: [149114.106817] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 Nov 4 10:10:41 vega kernel: [149114.106817] pending: xfs_reclaim_worker [xfs] Nov 4 10:10:41 vega kernel:
tasks hung forever (khugepaged blocked) with 4.3 kernel
With a 4.3 kernel I compiled two days ago, I had various processes stuck in 'D' state this morning, tried to unmount filesystems, which made things worse and froze everything. In case this means anything, reading from the hard drives with dd (e.g., dd if=/dev/sda bs=4096k count=1) worked (data was not in cache), but not for large amounts of data (count=64 hung forever). If this is of any use, I did an alt-sysrq-t after an emergency sync, and I have the logs: full kernel log is at link below, here are the beginning and end of it: Nov 4 08:10:10 vega kernel: [141900.078018] INFO: task khugepaged:173 blocked for more than 300 seconds. Nov 4 08:10:10 vega kernel: [141900.078024] Not tainted 4.3.0-vega #1 Nov 4 08:10:10 vega kernel: [141900.078026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 4 08:10:10 vega kernel: [141900.078029] khugepaged D 88023fc94f00 0 173 2 0x Nov 4 08:10:10 vega kernel: [141900.078035] 88023eb97748 0046 88023e9a6d00 880236e38d40 Nov 4 08:10:10 vega kernel: [141900.078040] 88023fc1ec88 88023eb97728 88023eb98000 8800bba5e800 Nov 4 08:10:10 vega kernel: [141900.078044] 880234607518 880236e38d40 88023eb97760 Nov 4 08:10:10 vega kernel: [141900.078048] Call Trace: Nov 4 08:10:10 vega kernel: [141900.078057] [] schedule+0x2e/0x70 Nov 4 08:10:10 vega kernel: [141900.078086] [] _xfs_log_force_lsn+0x155/0x2b0 [xfs] Nov 4 08:10:10 vega kernel: [141900.078091] [] ? wake_up_q+0x70/0x70 Nov 4 08:10:10 vega kernel: [141900.078106] [] xfs_log_force_lsn+0x29/0x80 [xfs] Nov 4 08:10:10 vega kernel: [141900.078123] [] ? xfs_iunpin_wait+0x14/0x20 [xfs] Nov 4 08:10:10 vega kernel: [141900.078140] [] __xfs_iunpin_wait+0x88/0x120 [xfs] Nov 4 08:10:10 vega kernel: [141900.078145] [] ? autoremove_wake_function+0x30/0x30 Nov 4 08:10:10 vega kernel: [141900.078161] [] xfs_iunpin_wait+0x14/0x20 [xfs] Nov 4 08:10:10 vega kernel: [141900.078178] [] xfs_reclaim_inode+0x5d/0x320 [xfs] Nov 4 08:10:10 vega kernel: [141900.078196] [] xfs_reclaim_inodes_ag+0x243/0x360 [xfs] Nov 4 08:10:10 vega kernel: [141900.078214] [] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] Nov 4 08:10:10 vega kernel: [141900.078229] [] xfs_fs_free_cached_objects+0x14/0x20 [xfs] Nov 4 08:10:10 vega kernel: [141900.078233] [] super_cache_scan+0x179/0x180 Nov 4 08:10:10 vega kernel: [141900.078239] [] shrink_slab.part.62.constprop.72+0x1c5/0x340 Nov 4 08:10:10 vega kernel: [141900.078243] [] shrink_zone+0x166/0x170 Nov 4 08:10:10 vega kernel: [141900.078246] [] do_try_to_free_pages+0x173/0x350 Nov 4 08:10:10 vega kernel: [141900.078249] [] try_to_free_pages+0xa0/0x130 Nov 4 08:10:10 vega kernel: [141900.078253] [] __alloc_pages_nodemask+0x40e/0x790 Nov 4 08:10:10 vega kernel: [141900.078258] [] khugepaged+0x14a/0x11c0 Nov 4 08:10:10 vega kernel: [141900.078261] [] ? wait_woken+0x80/0x80 Nov 4 08:10:10 vega kernel: [141900.078265] [] ? use_zero_page_show+0x30/0x30 Nov 4 08:10:10 vega kernel: [141900.078269] [] kthread+0xc4/0xe0 Nov 4 08:10:10 vega kernel: [141900.078273] [] ? kthread_create_on_node+0x170/0x170 Nov 4 08:10:10 vega kernel: [141900.078277] [] ret_from_fork+0x3f/0x70 Nov 4 08:10:10 vega kernel: [141900.078280] [] ? kthread_create_on_node+0x170/0x170 Nov 4 08:10:10 vega kernel: [141900.078285] INFO: task kswapd0:443 blocked for more than 300 seconds. Nov 4 08:10:10 vega kernel: [141900.078287] Not tainted 4.3.0-vega #1 Nov 4 10:10:41 vega kernel: [149114.106817] Nov 4 10:10:41 vega kernel: [149114.106817] Showing busy workqueues and worker pools: Nov 4 10:10:41 vega kernel: [149114.106817] workqueue events: flags=0x0 Nov 4 10:10:41 vega kernel: [149114.106817] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=9/256 Nov 4 10:10:41 vega kernel: [149114.106817] in-flight: 19153:do_sync_work Nov 4 10:10:41 vega kernel: [149114.106817] pending: console_callback, dbs_timer, sysrq_reinject_alt_sysrq, cache_reap, vmstat_update, flush_to_ldisc, push_to_pool, kernfs_notify_workfn Nov 4 10:10:41 vega kernel: [149114.106817] workqueue events_power_efficient: flags=0x80 Nov 4 10:10:41 vega kernel: [149114.106817] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 Nov 4 10:10:41 vega kernel: [149114.106817] pending: neigh_periodic_work Nov 4 10:10:41 vega kernel: [149114.106817] workqueue writeback: flags=0x4e Nov 4 10:10:41 vega kernel: [149114.106817] pwq 8: cpus=0-3 flags=0x4 nice=0 active=2/256 Nov 4 10:10:41 vega kernel: [149114.106817] in-flight: 14424:wb_workfn, 169(RESCUER):wb_workfn Nov 4 10:10:41 vega kernel: [149114.106817] workqueue xfs-reclaim/md115: flags=0x4 Nov 4 10:10:41 vega kernel: [149114.106817] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 Nov 4 10:10:41 vega kernel: [149114.106817] pending: xfs_reclaim_worker [xfs] Nov 4 10:10:41 vega kernel:
xhci-hcd does not detect USB3 host controller: how to debug/diagnose?
Dear list, I have a USB 3.0 controller card connected to a PCI-E bus that appears on lspci as 06:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host Controller [1033:0194] (rev 03) (prog-if 30 [XHCI]) Subsystem: ASUSTeK Computer Inc. P8P67 Deluxe Motherboard [1043:8413] Flags: fast devsel, IRQ 10 Memory at efdfe000 (64-bit, non-prefetchable) [size=8K] My problem is that a 4.3 kernel that I compiled [see link below for config] does not see this peripheral at all: loading the xhci-hcd module does not provide any sort of diagnostic, nothing appears in dmesg (even when I insert something in a USB port), the module is not marked as in use and does not appear in lspci -v (there is no link /sys/bus/pci/devices/:06:00.0/driver). No error, no warning, no message, nothing: the module simply behaves as a no-op. (I made sure to load xhci-hcd before ehci-hcd and uhci-hcd, in case this is important. I also tried loading xhci-hcd last, but none of this made any difference.) This is almost certainly a config problem, because booting a live Ubuntu 14.04 LTS, the controller is recognized by xhci-hcd, as it should be: [4.949337] xhci_hcd :06:00.0: xHCI Host Controller [4.949344] xhci_hcd :06:00.0: new USB bus registered, assigned bus number 1 [4.949653] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002 [4.949655] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [4.949658] usb usb1: Product: xHCI Host Controller [4.949660] usb usb1: Manufacturer: Linux 3.19.0-25-generic xhci-hcd [4.949662] usb usb1: SerialNumber: :06:00.0 [4.949790] hub 1-0:1.0: USB hub found [4.949801] hub 1-0:1.0: 2 ports detected My question is: how can I diagnose this problem? What can cause the driver to fail to see the peripheral and ignore it without any kind of diagnostic? Is there any way I can debug this, or ask the xhci-hcd module "why are you ignoring 06:00.0?"? Links to more information: Full config of custom 4.3 kernel: http://www.madore.org/~david/.tmp/config.20151102.broken dmesg output of said custom kernel: http://www.madore.org/~david/.tmp/dmesg.20151102.broken lspci -vvv output under custom kernel: http://www.madore.org/~david/.tmp/lspci.20151102.broken dmesg output of live Ubuntu where the module works: http://www.madore.org/~david/.tmp/dmesg.20151102.ubuntu lspci -vvv output under live Ubuntu: http://www.madore.org/~david/.tmp/lspci.20151102.ubuntu -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
xhci-hcd does not detect USB3 host controller: how to debug/diagnose?
Dear list, I have a USB 3.0 controller card connected to a PCI-E bus that appears on lspci as 06:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host Controller [1033:0194] (rev 03) (prog-if 30 [XHCI]) Subsystem: ASUSTeK Computer Inc. P8P67 Deluxe Motherboard [1043:8413] Flags: fast devsel, IRQ 10 Memory at efdfe000 (64-bit, non-prefetchable) [size=8K] My problem is that a 4.3 kernel that I compiled [see link below for config] does not see this peripheral at all: loading the xhci-hcd module does not provide any sort of diagnostic, nothing appears in dmesg (even when I insert something in a USB port), the module is not marked as in use and does not appear in lspci -v (there is no link /sys/bus/pci/devices/:06:00.0/driver). No error, no warning, no message, nothing: the module simply behaves as a no-op. (I made sure to load xhci-hcd before ehci-hcd and uhci-hcd, in case this is important. I also tried loading xhci-hcd last, but none of this made any difference.) This is almost certainly a config problem, because booting a live Ubuntu 14.04 LTS, the controller is recognized by xhci-hcd, as it should be: [4.949337] xhci_hcd :06:00.0: xHCI Host Controller [4.949344] xhci_hcd :06:00.0: new USB bus registered, assigned bus number 1 [4.949653] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002 [4.949655] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [4.949658] usb usb1: Product: xHCI Host Controller [4.949660] usb usb1: Manufacturer: Linux 3.19.0-25-generic xhci-hcd [4.949662] usb usb1: SerialNumber: :06:00.0 [4.949790] hub 1-0:1.0: USB hub found [4.949801] hub 1-0:1.0: 2 ports detected My question is: how can I diagnose this problem? What can cause the driver to fail to see the peripheral and ignore it without any kind of diagnostic? Is there any way I can debug this, or ask the xhci-hcd module "why are you ignoring 06:00.0?"? Links to more information: Full config of custom 4.3 kernel: http://www.madore.org/~david/.tmp/config.20151102.broken dmesg output of said custom kernel: http://www.madore.org/~david/.tmp/dmesg.20151102.broken lspci -vvv output under custom kernel: http://www.madore.org/~david/.tmp/lspci.20151102.broken dmesg output of live Ubuntu where the module works: http://www.madore.org/~david/.tmp/dmesg.20151102.ubuntu lspci -vvv output under live Ubuntu: http://www.madore.org/~david/.tmp/lspci.20151102.ubuntu -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)
On Wed, Oct 15, 2014 at 03:54:08PM -0700, Andy Lutomirski wrote: > On Wed, Oct 15, 2014 at 3:30 PM, David Madore wrote: > > Note that since the possibility of using SO_PEERCRED on AF_INET > > sockets does not hitherto exist on Linux, we can be sure that nobody > > uses it, so it's not like it might open vulnerabilities in existing > > code. If you think it's insecure, it can be documented as such (by > > comparing it with identd): I still think it's better than having no > > control at all when binding to localhost, which is the present > > situation (causing, e.g., CVE-2014-2914). > > This doesn't follow. *Everybody* uses connect on AF_INET. > > IMO anything that sends a caller's credentials needs to be explicit and > opt-in. I'm confused as to whether you mean "opt-in" on the side of the caller (=process requesting the endpoint's credentials), or on that of the endpoint (=authenticated process). On the one hand I don't understand what it could mean on the caller side, on the other hand you mention explicit support in OpenSSH, which would be the caller in my scenario. So, in case I haven't been clear enough, the situation I have in mind is: on "thishost", I run "ssh -L 14321:remotehost:4321 somehost" to forward connexions on from the local port 14321 of thishost (where ssh listens on the loopback) to the port 4321 of remotehost. Unfortunately, now everyone with an acccount on thishost can connect to port 14321 and effectively emit a connection from somehost to remotehost on my behalf. I think everyone agrees that this is a huge problem. But I don't understand how you propose to remedy this. Patching ssh is an option, but I don't see how to do it (ssh needs to make sure that the connections it receives on 14321 are from the same uid, and this seems impossible without the feature I'm discussing). Patching the kernel is an option. Patching clients that connect to 14321, on the other hand, is not, because there are many different ones, and their protocol is defined by immutable Internet standards, so we have no latitude there (for example, we can't ask a Web browser to connect to Unix domain sockets: there simply isn't a URL scheme to refer to them). Adding iptables rules is not an option if I'm not the system administrator on thishost. So, how can we solve this problem securely? > I believe that there is no secure way to authenticate clients that > currently don't authenticate themselves without changing the clients. > That's the whole point: currently-secure are written under the > assumption that they are not exercising their credentials. You can't > safely change that without making it opt-in. Then what are we to do, given that modifying the clients is impossible? What about my proposal that user credentials would be returned only if they refer to the same user as the caller user and that the caller is permitted to ptrace the endpoint? This answers your objection of leaking credentials: the caller could do anything at all with the other side since it could ptrace it - we're just permitting a user to authenticate their own sockets. A further sysctl could enable the use of the call in more general cases, for those administrators who think it should be allowed. -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)
On Wed, Oct 15, 2014 at 07:41:48AM -0700, Andy Lutomirski wrote: > On 10/15/2014 06:35 AM, David Madore wrote: > > Given an AF_UNIX socket, the getsockopt(, SOL_SOCKET, SO_PEERCRED,,) > > call allows one endpoint to authenticate the other endpoint's pid, uid > > and gid. > > > > The call is valid on AF_INET and AF_INET6 sockets but returns no data > > (pid=0, uid=-1, gid=-1). Obviously it is meaningless to try to get > > such credentials from a INET/INET6 socket in general, but there is one > > case where it would make sense: namely, when the endpoint is local > > (i.e., when the socket is a connection to the same machine, e.g., when > > connecting to 127.0.0.0/8 or ::1/32). > > I will object to adding it as described, for the same reason that I > object to anything that extends the current model of socket-based > credential passing. Ideally, credentials would *never* be implicitly > captured by socket syscalls. We live in the real world, and SO_CRED > exists, so I think the best we can do is to try to minimize its use. > > I can elaborate further, or you can IIRC search the archives for > SCM_IDENTITY, and you can also look at CVE-2013-1979 for a nasty example > of why this model is broken. >From what I understand, what was broken is mainly that the credentials were evaluated when the write() system call took place rather than when socket() or bind(): this violates the Unix security model (privilege control occurs when the file descriptor is created, not when it is used). On the contrary, it is conform to Unix security principles that credentials are checked implicitly when binding a socket (this happens when permissions are being checked on the path when binding or connecting on a Unix domain socket; and to allow binding to secure ports in the INET domain; and so on). It seems to me that a suid program that is willing to create or bind a socket on behalf of its caller without knowing exactly what it will be connecting to, it should intrinsically be treated as a security vulnerability, even when it is not obviously exploitable. Also, to go along the real world examples, identd exists and is used for identification on local networks (e.g. localhost), so the capture of credentials already takes place. Unix programmers are aware of this, and know that a privileged program should not bind a socket if they don't want to leak privileges. (Another example is the use of -m owner in iptables.) And, of course, if Solaris already has this feature, there is some experience for it. Has there been any documented vulnerability relating to the fact that Solaris allows getpeerucred() to authenticate locally connected AF_INET sockets? Note that since the possibility of using SO_PEERCRED on AF_INET sockets does not hitherto exist on Linux, we can be sure that nobody uses it, so it's not like it might open vulnerabilities in existing code. If you think it's insecure, it can be documented as such (by comparing it with identd): I still think it's better than having no control at all when binding to localhost, which is the present situation (causing, e.g., CVE-2014-2914). Because SO_PEERCRED currently returns {pid=0,uid=-1,gid=-1} on AF_INET, we might still return this value if there is any risk that the endpoint would be unwilling to share its credentials: for example, this value might be returned if the other endpoint is not ptraceable by the caller - this would still cover the essential use case, which is for unprivileged users to authenticate the connections from their own processes. Would this limitation assuage your worries about the proposed feature? The thing is, I don't see any other way the ssh port forwarding mess can ever be improved. Do you have another solution in mind that? Any attempt to have some kind of authentication of local sockets that required participation on the client (authenticatee)'s part is doomed: if modifying the protocol and/or client code is an option, we might as well use some form of crypto / TLS. Or Unix-domain sockets. But what are we supposed to do when modifying the client (to make it send credentials, use crypto or connect on AF_UNIX) is not an option? -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)
Given an AF_UNIX socket, the getsockopt(, SOL_SOCKET, SO_PEERCRED,,) call allows one endpoint to authenticate the other endpoint's pid, uid and gid. The call is valid on AF_INET and AF_INET6 sockets but returns no data (pid=0, uid=-1, gid=-1). Obviously it is meaningless to try to get such credentials from a INET/INET6 socket in general, but there is one case where it would make sense: namely, when the endpoint is local (i.e., when the socket is a connection to the same machine, e.g., when connecting to 127.0.0.0/8 or ::1/32). Being able to authenticate local INET/INET6 sockets would be immensely valuable for a number of programs, to provide some kind of access control to local sockets. For example, ssh allows port forwarding using the -L and -D options: by default or by option (cf. the GatewayPorts option of ssh), these port-forwarding sockets can be restricted to localhost, but of course they cannot be restricted to the user running ssh, which makes them a huge security problem. Many programs suffer from the same problem (they restrict some kind of connection to localhost, but they of course cannot make a restriction on which user will be able to connect). One cannot simply retort "these programs should be using Unix-domain sockets instead": I don't think many browsers support using a SOCKS proxy or an HTTP proxy over a Unix-domain socket, and in the latter case I'm not even sure it would make sense (protocol-wise). If I believe http://www.lehman.cuny.edu/cgi-bin/man-cgi?getpeerucred+3 > ("The system currently supports both sides of connection end-points for local AF_UNIX, AF_INET, and AF_INET6 sockets"), Solaris, or at least some version thereof, support authentication of local AF_INET and AF_INET6 sockets. I think it would be wonderful if Linux had this. I'm willing to work on the implementation if it is considered *a priori* acceptable for inclusion. The data seems to be available, since it is exposed in /proc/net/tcp and /proc/net/tcp6 and whatnots (implementation details left aside, it is merely a question of matching a line with opposite endpoints to the current socket and returning it). [In principle, a userland program can parse /proc/net/tcp so it does not need the feature I am suggesting, but in practice parsing a text file to communicate with the kernel is yucky at best, and probably not very robust (e.g., /proc might not be mounted), and it would be very difficult to convince, say, the OpenSSH authors to include code that parses the Linux /proc/net/tcp format (or even link with a library which does this) in order to add access-control on ssh port-forwards: having this under a more standard getsockpot() interface is cleaner and opens at least some kind of hope that programs would agree to use it.] Question number 1: If this feature were implemented, would it be considered acceptable for inclusion in the kernel? (If there is some reason why it can't be accepted, I'd like to know in advance, to avoid working in vain.) Question number 2: A priori, how difficult would it be to implement this? (As mentioned above, it seems trivial in principle to merely go through the local endpoints to find a matching connection, but maybe there are locking issues that I don't understand that make it much more difficult than it would seem.) Any guidelines on implementation? (I imagine one should try to fill sk->sk_peer_cred at connect time, but I don't really know how difficult this might turn out.) Any comments on the matter are welcome. Happy hacking, -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)
Given an AF_UNIX socket, the getsockopt(, SOL_SOCKET, SO_PEERCRED,,) call allows one endpoint to authenticate the other endpoint's pid, uid and gid. The call is valid on AF_INET and AF_INET6 sockets but returns no data (pid=0, uid=-1, gid=-1). Obviously it is meaningless to try to get such credentials from a INET/INET6 socket in general, but there is one case where it would make sense: namely, when the endpoint is local (i.e., when the socket is a connection to the same machine, e.g., when connecting to 127.0.0.0/8 or ::1/32). Being able to authenticate local INET/INET6 sockets would be immensely valuable for a number of programs, to provide some kind of access control to local sockets. For example, ssh allows port forwarding using the -L and -D options: by default or by option (cf. the GatewayPorts option of ssh), these port-forwarding sockets can be restricted to localhost, but of course they cannot be restricted to the user running ssh, which makes them a huge security problem. Many programs suffer from the same problem (they restrict some kind of connection to localhost, but they of course cannot make a restriction on which user will be able to connect). One cannot simply retort these programs should be using Unix-domain sockets instead: I don't think many browsers support using a SOCKS proxy or an HTTP proxy over a Unix-domain socket, and in the latter case I'm not even sure it would make sense (protocol-wise). If I believe URL: http://www.lehman.cuny.edu/cgi-bin/man-cgi?getpeerucred+3 (The system currently supports both sides of connection end-points for local AF_UNIX, AF_INET, and AF_INET6 sockets), Solaris, or at least some version thereof, support authentication of local AF_INET and AF_INET6 sockets. I think it would be wonderful if Linux had this. I'm willing to work on the implementation if it is considered *a priori* acceptable for inclusion. The data seems to be available, since it is exposed in /proc/net/tcp and /proc/net/tcp6 and whatnots (implementation details left aside, it is merely a question of matching a line with opposite endpoints to the current socket and returning it). [In principle, a userland program can parse /proc/net/tcp so it does not need the feature I am suggesting, but in practice parsing a text file to communicate with the kernel is yucky at best, and probably not very robust (e.g., /proc might not be mounted), and it would be very difficult to convince, say, the OpenSSH authors to include code that parses the Linux /proc/net/tcp format (or even link with a library which does this) in order to add access-control on ssh port-forwards: having this under a more standard getsockpot() interface is cleaner and opens at least some kind of hope that programs would agree to use it.] Question number 1: If this feature were implemented, would it be considered acceptable for inclusion in the kernel? (If there is some reason why it can't be accepted, I'd like to know in advance, to avoid working in vain.) Question number 2: A priori, how difficult would it be to implement this? (As mentioned above, it seems trivial in principle to merely go through the local endpoints to find a matching connection, but maybe there are locking issues that I don't understand that make it much more difficult than it would seem.) Any guidelines on implementation? (I imagine one should try to fill sk-sk_peer_cred at connect time, but I don't really know how difficult this might turn out.) Any comments on the matter are welcome. Happy hacking, -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)
On Wed, Oct 15, 2014 at 07:41:48AM -0700, Andy Lutomirski wrote: On 10/15/2014 06:35 AM, David Madore wrote: Given an AF_UNIX socket, the getsockopt(, SOL_SOCKET, SO_PEERCRED,,) call allows one endpoint to authenticate the other endpoint's pid, uid and gid. The call is valid on AF_INET and AF_INET6 sockets but returns no data (pid=0, uid=-1, gid=-1). Obviously it is meaningless to try to get such credentials from a INET/INET6 socket in general, but there is one case where it would make sense: namely, when the endpoint is local (i.e., when the socket is a connection to the same machine, e.g., when connecting to 127.0.0.0/8 or ::1/32). I will object to adding it as described, for the same reason that I object to anything that extends the current model of socket-based credential passing. Ideally, credentials would *never* be implicitly captured by socket syscalls. We live in the real world, and SO_CRED exists, so I think the best we can do is to try to minimize its use. I can elaborate further, or you can IIRC search the archives for SCM_IDENTITY, and you can also look at CVE-2013-1979 for a nasty example of why this model is broken. From what I understand, what was broken is mainly that the credentials were evaluated when the write() system call took place rather than when socket() or bind(): this violates the Unix security model (privilege control occurs when the file descriptor is created, not when it is used). On the contrary, it is conform to Unix security principles that credentials are checked implicitly when binding a socket (this happens when permissions are being checked on the path when binding or connecting on a Unix domain socket; and to allow binding to secure ports in the INET domain; and so on). It seems to me that a suid program that is willing to create or bind a socket on behalf of its caller without knowing exactly what it will be connecting to, it should intrinsically be treated as a security vulnerability, even when it is not obviously exploitable. Also, to go along the real world examples, identd exists and is used for identification on local networks (e.g. localhost), so the capture of credentials already takes place. Unix programmers are aware of this, and know that a privileged program should not bind a socket if they don't want to leak privileges. (Another example is the use of -m owner in iptables.) And, of course, if Solaris already has this feature, there is some experience for it. Has there been any documented vulnerability relating to the fact that Solaris allows getpeerucred() to authenticate locally connected AF_INET sockets? Note that since the possibility of using SO_PEERCRED on AF_INET sockets does not hitherto exist on Linux, we can be sure that nobody uses it, so it's not like it might open vulnerabilities in existing code. If you think it's insecure, it can be documented as such (by comparing it with identd): I still think it's better than having no control at all when binding to localhost, which is the present situation (causing, e.g., CVE-2014-2914). Because SO_PEERCRED currently returns {pid=0,uid=-1,gid=-1} on AF_INET, we might still return this value if there is any risk that the endpoint would be unwilling to share its credentials: for example, this value might be returned if the other endpoint is not ptraceable by the caller - this would still cover the essential use case, which is for unprivileged users to authenticate the connections from their own processes. Would this limitation assuage your worries about the proposed feature? The thing is, I don't see any other way the ssh port forwarding mess can ever be improved. Do you have another solution in mind that? Any attempt to have some kind of authentication of local sockets that required participation on the client (authenticatee)'s part is doomed: if modifying the protocol and/or client code is an option, we might as well use some form of crypto / TLS. Or Unix-domain sockets. But what are we supposed to do when modifying the client (to make it send credentials, use crypto or connect on AF_UNIX) is not an option? -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: feature suggestion: implement SO_PEERCRED on local AF_INET/AF_INET6 sockets (allow uid-based identification on localhost)
On Wed, Oct 15, 2014 at 03:54:08PM -0700, Andy Lutomirski wrote: On Wed, Oct 15, 2014 at 3:30 PM, David Madore david...@madore.org wrote: Note that since the possibility of using SO_PEERCRED on AF_INET sockets does not hitherto exist on Linux, we can be sure that nobody uses it, so it's not like it might open vulnerabilities in existing code. If you think it's insecure, it can be documented as such (by comparing it with identd): I still think it's better than having no control at all when binding to localhost, which is the present situation (causing, e.g., CVE-2014-2914). This doesn't follow. *Everybody* uses connect on AF_INET. IMO anything that sends a caller's credentials needs to be explicit and opt-in. I'm confused as to whether you mean opt-in on the side of the caller (=process requesting the endpoint's credentials), or on that of the endpoint (=authenticated process). On the one hand I don't understand what it could mean on the caller side, on the other hand you mention explicit support in OpenSSH, which would be the caller in my scenario. So, in case I haven't been clear enough, the situation I have in mind is: on thishost, I run ssh -L 14321:remotehost:4321 somehost to forward connexions on from the local port 14321 of thishost (where ssh listens on the loopback) to the port 4321 of remotehost. Unfortunately, now everyone with an acccount on thishost can connect to port 14321 and effectively emit a connection from somehost to remotehost on my behalf. I think everyone agrees that this is a huge problem. But I don't understand how you propose to remedy this. Patching ssh is an option, but I don't see how to do it (ssh needs to make sure that the connections it receives on 14321 are from the same uid, and this seems impossible without the feature I'm discussing). Patching the kernel is an option. Patching clients that connect to 14321, on the other hand, is not, because there are many different ones, and their protocol is defined by immutable Internet standards, so we have no latitude there (for example, we can't ask a Web browser to connect to Unix domain sockets: there simply isn't a URL scheme to refer to them). Adding iptables rules is not an option if I'm not the system administrator on thishost. So, how can we solve this problem securely? I believe that there is no secure way to authenticate clients that currently don't authenticate themselves without changing the clients. That's the whole point: currently-secure are written under the assumption that they are not exercising their credentials. You can't safely change that without making it opt-in. Then what are we to do, given that modifying the clients is impossible? What about my proposal that user credentials would be returned only if they refer to the same user as the caller user and that the caller is permitted to ptrace the endpoint? This answers your objection of leaking credentials: the caller could do anything at all with the other side since it could ptrace it - we're just permitting a user to authenticate their own sockets. A further sysctl could enable the use of the call in more general cases, for those administrators who think it should be allowed. -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
Since I had a rare occasion to physically access the machine, I did the following experiment: connect another machine to the serial console, run while true ; do date ; cat /proc/slabinfo ; echo '***' ; sleep 3 ; done and generate lots of IPv6 traffic through the box (as I mentioned, for some reason, a Firefox compilation through ssh seems particularly effective). So I now have lots of slabinfo data and, beyond the initial WARNING, I also got messages along the lines of "swapper: page allocation failure: order:10, mode:0x4020". I put the full log in http://www.madore.org/~david/.tmp/pollux-dump.0 > (unfortunately a bit garbled, because sometimes the cat slabinfo was interspaced with printk output, but there are still plenty of usable lines of each sort). For completeness, here's a sample message from a page allocation failure, and a copy of /proc/slabinfo from just about that time (I have no idea how to read this, but one thing I can say is that there is no extraordinarily large number in this): [ 567.757489] swapper: page allocation failure: order:10, mode:0x4020 [ 567.763815] [] (unwind_backtrace+0x0/0xf0) from [] (warn_alloc_failed+0xcc/0x10c) [ 567.773119] [] (warn_alloc_failed+0xcc/0x10c) from [] (__alloc_pages_nodemask+0x530/0x68c) [ 567.783184] [] (__alloc_pages_nodemask+0x530/0x68c) from [] (__get_free_pages+0x10/0x3c) [ 567.793084] [] (__get_free_pages+0x10/0x3c) from [] (kmalloc_order_trace+0x24/0xdc) [ 567.802547] [] (kmalloc_order_trace+0x24/0xdc) from [] (pskb_expand_head+0x68/0x298) [ 567.812317] [] (pskb_expand_head+0x68/0x298) from [] (ip6_forward+0x4d4/0x7bc [ipv6]) [ 567.822056] [] (ip6_forward+0x4d4/0x7bc [ipv6]) from [] (ipv6_rcv+0x2bc/0x3dc [ipv6]) [ 567.831751] [] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [] (__netif_receive_skb+0x544/0x66c) [ 567.841521] [] (__netif_receive_skb+0x544/0x66c) from [] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) [ 567.853306] [] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) from [] (br_nf_pre_routing+0x59c/0x67c [bridge]) [ 567.865673] [] (br_nf_pre_routing+0x59c/0x67c [bridge]) from [] (nf_iterate+0x8c/0xb4) [ 567.875387] [] (nf_iterate+0x8c/0xb4) from [] (nf_hook_slow+0x5c/0x118) [ 567.883800] [] (nf_hook_slow+0x5c/0x118) from [] (br_handle_frame+0x1b8/0x290 [bridge]) [ 567.893624] [] (br_handle_frame+0x1b8/0x290 [bridge]) from [] (__netif_receive_skb+0x3cc/0x66c) [ 567.904137] [] (__netif_receive_skb+0x3cc/0x66c) from [] (mv643xx_eth_poll+0x540/0x734) [ 567.913928] [] (mv643xx_eth_poll+0x540/0x734) from [] (net_rx_action+0x118/0x314) [ 567.923215] [] (net_rx_action+0x118/0x314) from [] (__do_softirq+0xac/0x234) [ 567.932058] [] (__do_softirq+0xac/0x234) from [] (irq_exit+0x94/0x9c) [ 567.940421] [] (irq_exit+0x94/0x9c) from [] (handle_IRQ+0x34/0x84) [ 567.948392] [] (handle_IRQ+0x34/0x84) from [] (__irq_svc+0x34/0x98) [ 567.956454] [] (__irq_svc+0x34/0x98) from [] (kirkwood_enter_idle+0x4c/0x94) [ 567.965299] [] (kirkwood_enter_idle+0x4c/0x94) from [] (cpuidle_idle_call+0xc8/0x35c) [ 567.974925] [] (cpuidle_idle_call+0xc8/0x35c) from [] (cpu_idle+0x88/0xdc) [ 567.983581] [] (cpu_idle+0x88/0xdc) from [] (start_kernel+0x2a0/0x2f0) [ 567.991893] Mem-info: [ 567.994185] Normal per-cpu: [ 567.996995] CPU0: hi: 186, btch: 31 usd: 84 [ 568.001815] active_anon:5592 inactive_anon:34 isolated_anon:0 [ 568.001820] active_file:2845 inactive_file:6118 isolated_file:0 [ 568.001825] unevictable:418 dirty:13 writeback:0 unstable:0 [ 568.001829] free:12507 slab_reclaimable:632 slab_unreclaimable:1124 [ 568.001835] mapped:2546 shmem:47 pagetables:152 bounce:0 [ 568.031126] Normal free:50028kB min:2884kB low:3604kB high:4324kB active_anon:22368kB inactive_anon:136kB active_file:11380kB inactive_file:24472kB unevictable:1672kB isolated(anon):0kB isolated(file):0kB present:520192kB mlocked:1672kB dirty:52kB writeback:0kB mapped:10184kB shmem:188kB slab_reclaimable:2528kB slab_unreclaimable:4496kB kernel_stack:584kB pagetables:608kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 568.071329] lowmem_reserve[]: 0 0 [ 568.074696] Normal: 1*4kB 1*8kB 8*16kB 7*32kB 0*64kB 2*128kB 1*256kB 4*512kB 20*1024kB 11*2048kB 1*4096kB = 50028kB [ 568.085282] 9350 total pagecache pages [ 568.089039] 0 pages in swap cache [ 568.092363] Swap cache stats: add 0, delete 0, find 0/0 [ 568.097621] Free swap = 0kB [ 568.100506] Total swap = 0kB [ 568.117927] 131072 pages of RAM [ 568.121087] 12771 free pages [ 568.123972] 2839 reserved pages [ 568.127140] 1361 slab pages [ 568.129945] 8003 pages shared [ 568.132919] 0 pages swap cached slabinfo - version: 2.1 # name : tunables: slabdata nf_conntrack_expect 0 0176 231 : tunables000 : slabdata 0 0 0 nf_conntrack_c06d1258128128248 161 : tunables000 : slabdata 8 8 0 ip6_dst_cache 72
Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
Since I had a rare occasion to physically access the machine, I did the following experiment: connect another machine to the serial console, run while true ; do date ; cat /proc/slabinfo ; echo '***' ; sleep 3 ; done and generate lots of IPv6 traffic through the box (as I mentioned, for some reason, a Firefox compilation through ssh seems particularly effective). So I now have lots of slabinfo data and, beyond the initial WARNING, I also got messages along the lines of swapper: page allocation failure: order:10, mode:0x4020. I put the full log in URL: http://www.madore.org/~david/.tmp/pollux-dump.0 (unfortunately a bit garbled, because sometimes the cat slabinfo was interspaced with printk output, but there are still plenty of usable lines of each sort). For completeness, here's a sample message from a page allocation failure, and a copy of /proc/slabinfo from just about that time (I have no idea how to read this, but one thing I can say is that there is no extraordinarily large number in this): [ 567.757489] swapper: page allocation failure: order:10, mode:0x4020 [ 567.763815] [c000d728] (unwind_backtrace+0x0/0xf0) from [c009a87c] (warn_alloc_failed+0xcc/0x10c) [ 567.773119] [c009a87c] (warn_alloc_failed+0xcc/0x10c) from [c009ce48] (__alloc_pages_nodemask+0x530/0x68c) [ 567.783184] [c009ce48] (__alloc_pages_nodemask+0x530/0x68c) from [c009cfb4] (__get_free_pages+0x10/0x3c) [ 567.793084] [c009cfb4] (__get_free_pages+0x10/0x3c) from [c00c9fd0] (kmalloc_order_trace+0x24/0xdc) [ 567.802547] [c00c9fd0] (kmalloc_order_trace+0x24/0xdc) from [c038d638] (pskb_expand_head+0x68/0x298) [ 567.812317] [c038d638] (pskb_expand_head+0x68/0x298) from [bf0e93ec] (ip6_forward+0x4d4/0x7bc [ipv6]) [ 567.822056] [bf0e93ec] (ip6_forward+0x4d4/0x7bc [ipv6]) from [bf0ebebc] (ipv6_rcv+0x2bc/0x3dc [ipv6]) [ 567.831751] [bf0ebebc] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [c0394870] (__netif_receive_skb+0x544/0x66c) [ 567.841521] [c0394870] (__netif_receive_skb+0x544/0x66c) from [bf1d9054] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) [ 567.853306] [bf1d9054] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) from [bf1d9ae8] (br_nf_pre_routing+0x59c/0x67c [bridge]) [ 567.865673] [bf1d9ae8] (br_nf_pre_routing+0x59c/0x67c [bridge]) from [c03bd2a4] (nf_iterate+0x8c/0xb4) [ 567.875387] [c03bd2a4] (nf_iterate+0x8c/0xb4) from [c03bd328] (nf_hook_slow+0x5c/0x118) [ 567.883800] [c03bd328] (nf_hook_slow+0x5c/0x118) from [bf1d3fa4] (br_handle_frame+0x1b8/0x290 [bridge]) [ 567.893624] [bf1d3fa4] (br_handle_frame+0x1b8/0x290 [bridge]) from [c03946f8] (__netif_receive_skb+0x3cc/0x66c) [ 567.904137] [c03946f8] (__netif_receive_skb+0x3cc/0x66c) from [c031e254] (mv643xx_eth_poll+0x540/0x734) [ 567.913928] [c031e254] (mv643xx_eth_poll+0x540/0x734) from [c0397390] (net_rx_action+0x118/0x314) [ 567.923215] [c0397390] (net_rx_action+0x118/0x314) from [c0029924] (__do_softirq+0xac/0x234) [ 567.932058] [c0029924] (__do_softirq+0xac/0x234) from [c0029f00] (irq_exit+0x94/0x9c) [ 567.940421] [c0029f00] (irq_exit+0x94/0x9c) from [c00094b0] (handle_IRQ+0x34/0x84) [ 567.948392] [c00094b0] (handle_IRQ+0x34/0x84) from [c04398d4] (__irq_svc+0x34/0x98) [ 567.956454] [c04398d4] (__irq_svc+0x34/0x98) from [c0011d6c] (kirkwood_enter_idle+0x4c/0x94) [ 567.965299] [c0011d6c] (kirkwood_enter_idle+0x4c/0x94) from [c0357a00] (cpuidle_idle_call+0xc8/0x35c) [ 567.974925] [c0357a00] (cpuidle_idle_call+0xc8/0x35c) from [c0009764] (cpu_idle+0x88/0xdc) [ 567.983581] [c0009764] (cpu_idle+0x88/0xdc) from [c05d8720] (start_kernel+0x2a0/0x2f0) [ 567.991893] Mem-info: [ 567.994185] Normal per-cpu: [ 567.996995] CPU0: hi: 186, btch: 31 usd: 84 [ 568.001815] active_anon:5592 inactive_anon:34 isolated_anon:0 [ 568.001820] active_file:2845 inactive_file:6118 isolated_file:0 [ 568.001825] unevictable:418 dirty:13 writeback:0 unstable:0 [ 568.001829] free:12507 slab_reclaimable:632 slab_unreclaimable:1124 [ 568.001835] mapped:2546 shmem:47 pagetables:152 bounce:0 [ 568.031126] Normal free:50028kB min:2884kB low:3604kB high:4324kB active_anon:22368kB inactive_anon:136kB active_file:11380kB inactive_file:24472kB unevictable:1672kB isolated(anon):0kB isolated(file):0kB present:520192kB mlocked:1672kB dirty:52kB writeback:0kB mapped:10184kB shmem:188kB slab_reclaimable:2528kB slab_unreclaimable:4496kB kernel_stack:584kB pagetables:608kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 568.071329] lowmem_reserve[]: 0 0 [ 568.074696] Normal: 1*4kB 1*8kB 8*16kB 7*32kB 0*64kB 2*128kB 1*256kB 4*512kB 20*1024kB 11*2048kB 1*4096kB = 50028kB [ 568.085282] 9350 total pagecache pages [ 568.089039] 0 pages in swap cache [ 568.092363] Swap cache stats: add 0, delete 0, find 0/0 [ 568.097621] Free swap = 0kB [ 568.100506] Total swap = 0kB [ 568.117927] 131072 pages of RAM [ 568.121087] 12771 free pages [ 568.123972] 2839 reserved pages [ 568.127140] 1361 slab pages [
Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
On Fri, Aug 31, 2012 at 12:59:36PM +0200, David Madore wrote: > On Wed, Aug 29, 2012 at 09:32:20PM +0200, Francois Romieu wrote: > > David Madore : > > [...] > > > I imagine it being somehow related to the fact that it operates a > > > network bridge (I imagine this because I have another identical > > > machine with exactly the same kernel and a very similar config but not > > > running a bridge, and the warning never pops up). > > > > Could it not be a genuine allocation failure ? > > I have no idea. How can I tell? In any case, if having 512MB RAM > isn't enough for the kernel in the router of a small home's network, > that's a bug somewhere, isn't it? PS: I'm also getting the following kind of messages from a wlan interface that's on the bridge: [ 268.976317] ieee80211 phy0: failed to reallocate TX buffer [ 716.880515] ieee80211 phy0: failed to reallocate TX buffer [ 1160.877677] ieee80211 phy0: failed to reallocate TX buffer Could they be related? -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
On Wed, Aug 29, 2012 at 09:32:20PM +0200, Francois Romieu wrote: > David Madore : > [...] > > I imagine it being somehow related to the fact that it operates a > > network bridge (I imagine this because I have another identical > > machine with exactly the same kernel and a very similar config but not > > running a bridge, and the warning never pops up). > > Could it not be a genuine allocation failure ? I have no idea. How can I tell? In any case, if having 512MB RAM isn't enough for the kernel in the router of a small home's network, that's a bug somewhere, isn't it? Also: On Wed, Aug 29, 2012 at 02:25:48AM +0200, David Madore wrote: > Is this worth investigating? (I will, of course, provide the config > file and any other relevant data if the answer is "yes".) Is this > potentially serious? (I'm getting hard lockups on this machine which > I suspect are due to hardware and unrelated to this, but if someone > tells me it could be the cause, I'd be more than happy to believe it.) I'm now inclined to believe the hard lockups are indeed related to this (I can semi-reproducibly make them happen with only network traffic - actually, with the messages of a compilation taking place on another machine being routed through this box (over IPv6)). So how can I help debug this? (One difficulty is that I have only remote access to this box, and it's not meant for experimenting with.) -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
On Wed, Aug 29, 2012 at 09:32:20PM +0200, Francois Romieu wrote: David Madore david...@madore.org : [...] I imagine it being somehow related to the fact that it operates a network bridge (I imagine this because I have another identical machine with exactly the same kernel and a very similar config but not running a bridge, and the warning never pops up). Could it not be a genuine allocation failure ? I have no idea. How can I tell? In any case, if having 512MB RAM isn't enough for the kernel in the router of a small home's network, that's a bug somewhere, isn't it? Also: On Wed, Aug 29, 2012 at 02:25:48AM +0200, David Madore wrote: Is this worth investigating? (I will, of course, provide the config file and any other relevant data if the answer is yes.) Is this potentially serious? (I'm getting hard lockups on this machine which I suspect are due to hardware and unrelated to this, but if someone tells me it could be the cause, I'd be more than happy to believe it.) I'm now inclined to believe the hard lockups are indeed related to this (I can semi-reproducibly make them happen with only network traffic - actually, with the messages of a compilation taking place on another machine being routed through this box (over IPv6)). So how can I help debug this? (One difficulty is that I have only remote access to this box, and it's not meant for experimenting with.) -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
On Fri, Aug 31, 2012 at 12:59:36PM +0200, David Madore wrote: On Wed, Aug 29, 2012 at 09:32:20PM +0200, Francois Romieu wrote: David Madore david...@madore.org : [...] I imagine it being somehow related to the fact that it operates a network bridge (I imagine this because I have another identical machine with exactly the same kernel and a very similar config but not running a bridge, and the warning never pops up). Could it not be a genuine allocation failure ? I have no idea. How can I tell? In any case, if having 512MB RAM isn't enough for the kernel in the router of a small home's network, that's a bug somewhere, isn't it? PS: I'm also getting the following kind of messages from a wlan interface that's on the bridge: [ 268.976317] ieee80211 phy0: failed to reallocate TX buffer [ 716.880515] ieee80211 phy0: failed to reallocate TX buffer [ 1160.877677] ieee80211 phy0: failed to reallocate TX buffer Could they be related? -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
Dear all, I hope this is the right place to send this sort of backtrace dump. I'm getting the following sort of dumps (below) on a 3.2.27 kernel on an arm/kirkwood (actually DreamPlug) machine that's used as a router. I imagine it being somehow related to the fact that it operates a network bridge (I imagine this because I have another identical machine with exactly the same kernel and a very similar config but not running a bridge, and the warning never pops up). Is this worth investigating? (I will, of course, provide the config file and any other relevant data if the answer is "yes".) Is this potentially serious? (I'm getting hard lockups on this machine which I suspect are due to hardware and unrelated to this, but if someone tells me it could be the cause, I'd be more than happy to believe it.) [24711.204492] [ cut here ] [24711.209151] WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c() [24711.216667] Modules linked in: 8021q ath9k_htc mac80211 ath9k_common ath9k_hw ath cfg80211 bnep rfcomm sit tunnel4 sch_ingress cls_fw cls_u32 sch_sfq sch_htb pppoe pppox ppp_generic slhc bridge stp llc ip6t_REJECT ip6table_filter ip6table_mangle xt_NOTRACK ip6table_raw ip6_tables nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ftp nf_conntrack_ftp ipt_REJECT xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat xt_TCPMSS xt_tcpudp xt_mark iptable_mangle ip_tables x_tables nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 orion_wdt ipv6 snd_usb_audio snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_midi snd_seq_midi_event snd_rawmidi btmrvl_sdio btmrvl snd_seq snd_timer snd_seq_device snd bluetooth soundcore [24711.280663] [] (unwind_backtrace+0x0/0xf0) from [] (warn_slowpath_common+0x50/0x68) [24711.290124] [] (warn_slowpath_common+0x50/0x68) from [] (warn_slowpath_null+0x1c/0x24) [24711.299845] [] (warn_slowpath_null+0x1c/0x24) from [] (__alloc_pages_nodemask+0x1d4/0x68c) [24711.309914] [] (__alloc_pages_nodemask+0x1d4/0x68c) from [] (__get_free_pages+0x10/0x3c) [24711.319805] [] (__get_free_pages+0x10/0x3c) from [] (kmalloc_order_trace+0x24/0xdc) [24711.329269] [] (kmalloc_order_trace+0x24/0xdc) from [] (pskb_expand_head+0x68/0x298) [24711.338901] [] (pskb_expand_head+0x68/0x298) from [] (ip6_forward+0x4d4/0x7bc [ipv6]) [24711.348638] [] (ip6_forward+0x4d4/0x7bc [ipv6]) from [] (ipv6_rcv+0x2bc/0x3dc [ipv6]) [24711.358333] [] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [] (__netif_receive_skb+0x544/0x66c) [24711.368106] [] (__netif_receive_skb+0x544/0x66c) from [] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) [24711.379899] [] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) from [] (br_nf_pre_routing+0x59c/0x67c [bridge]) [24711.392271] [] (br_nf_pre_routing+0x59c/0x67c [bridge]) from [] (nf_iterate+0x8c/0xb4) [24711.401988] [] (nf_iterate+0x8c/0xb4) from [] (nf_hook_slow+0x5c/0x118) [24711.410540] [] (nf_hook_slow+0x5c/0x118) from [] (br_handle_frame+0x1b8/0x290 [bridge]) [24711.420367] [] (br_handle_frame+0x1b8/0x290 [bridge]) from [] (__netif_receive_skb+0x3cc/0x66c) [24711.430872] [] (__netif_receive_skb+0x3cc/0x66c) from [] (mv643xx_eth_poll+0x540/0x734) [24711.440680] [] (mv643xx_eth_poll+0x540/0x734) from [] (net_rx_action+0x118/0x314) [24711.449970] [] (net_rx_action+0x118/0x314) from [] (__do_softirq+0xac/0x234) [24711.458817] [] (__do_softirq+0xac/0x234) from [] (irq_exit+0x94/0x9c) [24711.467046] [] (irq_exit+0x94/0x9c) from [] (handle_IRQ+0x34/0x84) [24711.475007] [] (handle_IRQ+0x34/0x84) from [] (__irq_svc+0x34/0x98) [24711.483068] [] (__irq_svc+0x34/0x98) from [] (kirkwood_enter_idle+0x4c/0x94) [24711.491908] [] (kirkwood_enter_idle+0x4c/0x94) from [] (cpuidle_idle_call+0xc8/0x35c) [24711.501532] [] (cpuidle_idle_call+0xc8/0x35c) from [] (cpu_idle+0x88/0xdc) [24711.510201] [] (cpu_idle+0x88/0xdc) from [] (start_kernel+0x2a0/0x2f0) [24711.518512] ---[ end trace e1776fbe32468909 ]--- -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kernel 3.2.27 on arm: WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c()
Dear all, I hope this is the right place to send this sort of backtrace dump. I'm getting the following sort of dumps (below) on a 3.2.27 kernel on an arm/kirkwood (actually DreamPlug) machine that's used as a router. I imagine it being somehow related to the fact that it operates a network bridge (I imagine this because I have another identical machine with exactly the same kernel and a very similar config but not running a bridge, and the warning never pops up). Is this worth investigating? (I will, of course, provide the config file and any other relevant data if the answer is yes.) Is this potentially serious? (I'm getting hard lockups on this machine which I suspect are due to hardware and unrelated to this, but if someone tells me it could be the cause, I'd be more than happy to believe it.) [24711.204492] [ cut here ] [24711.209151] WARNING: at mm/page_alloc.c:2109 __alloc_pages_nodemask+0x1d4/0x68c() [24711.216667] Modules linked in: 8021q ath9k_htc mac80211 ath9k_common ath9k_hw ath cfg80211 bnep rfcomm sit tunnel4 sch_ingress cls_fw cls_u32 sch_sfq sch_htb pppoe pppox ppp_generic slhc bridge stp llc ip6t_REJECT ip6table_filter ip6table_mangle xt_NOTRACK ip6table_raw ip6_tables nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ftp nf_conntrack_ftp ipt_REJECT xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat xt_TCPMSS xt_tcpudp xt_mark iptable_mangle ip_tables x_tables nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 orion_wdt ipv6 snd_usb_audio snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_midi snd_seq_midi_event snd_rawmidi btmrvl_sdio btmrvl snd_seq snd_timer snd_seq_device snd bluetooth soundcore [24711.280663] [c000d728] (unwind_backtrace+0x0/0xf0) from [c0022f74] (warn_slowpath_common+0x50/0x68) [24711.290124] [c0022f74] (warn_slowpath_common+0x50/0x68) from [c0022fa8] (warn_slowpath_null+0x1c/0x24) [24711.299845] [c0022fa8] (warn_slowpath_null+0x1c/0x24) from [c009caec] (__alloc_pages_nodemask+0x1d4/0x68c) [24711.309914] [c009caec] (__alloc_pages_nodemask+0x1d4/0x68c) from [c009cfb4] (__get_free_pages+0x10/0x3c) [24711.319805] [c009cfb4] (__get_free_pages+0x10/0x3c) from [c00c9fd0] (kmalloc_order_trace+0x24/0xdc) [24711.329269] [c00c9fd0] (kmalloc_order_trace+0x24/0xdc) from [c038d638] (pskb_expand_head+0x68/0x298) [24711.338901] [c038d638] (pskb_expand_head+0x68/0x298) from [bf0dd3ec] (ip6_forward+0x4d4/0x7bc [ipv6]) [24711.348638] [bf0dd3ec] (ip6_forward+0x4d4/0x7bc [ipv6]) from [bf0dfebc] (ipv6_rcv+0x2bc/0x3dc [ipv6]) [24711.358333] [bf0dfebc] (ipv6_rcv+0x2bc/0x3dc [ipv6]) from [c0394870] (__netif_receive_skb+0x544/0x66c) [24711.368106] [c0394870] (__netif_receive_skb+0x544/0x66c) from [bf1cd054] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) [24711.379899] [bf1cd054] (br_nf_pre_routing_finish_ipv6+0x10c/0x160 [bridge]) from [bf1cdae8] (br_nf_pre_routing+0x59c/0x67c [bridge]) [24711.392271] [bf1cdae8] (br_nf_pre_routing+0x59c/0x67c [bridge]) from [c03bd2a4] (nf_iterate+0x8c/0xb4) [24711.401988] [c03bd2a4] (nf_iterate+0x8c/0xb4) from [c03bd328] (nf_hook_slow+0x5c/0x118) [24711.410540] [c03bd328] (nf_hook_slow+0x5c/0x118) from [bf1c7fa4] (br_handle_frame+0x1b8/0x290 [bridge]) [24711.420367] [bf1c7fa4] (br_handle_frame+0x1b8/0x290 [bridge]) from [c03946f8] (__netif_receive_skb+0x3cc/0x66c) [24711.430872] [c03946f8] (__netif_receive_skb+0x3cc/0x66c) from [c031e254] (mv643xx_eth_poll+0x540/0x734) [24711.440680] [c031e254] (mv643xx_eth_poll+0x540/0x734) from [c0397390] (net_rx_action+0x118/0x314) [24711.449970] [c0397390] (net_rx_action+0x118/0x314) from [c0029924] (__do_softirq+0xac/0x234) [24711.458817] [c0029924] (__do_softirq+0xac/0x234) from [c0029f00] (irq_exit+0x94/0x9c) [24711.467046] [c0029f00] (irq_exit+0x94/0x9c) from [c00094b0] (handle_IRQ+0x34/0x84) [24711.475007] [c00094b0] (handle_IRQ+0x34/0x84) from [c04398d4] (__irq_svc+0x34/0x98) [24711.483068] [c04398d4] (__irq_svc+0x34/0x98) from [c0011d6c] (kirkwood_enter_idle+0x4c/0x94) [24711.491908] [c0011d6c] (kirkwood_enter_idle+0x4c/0x94) from [c0357a00] (cpuidle_idle_call+0xc8/0x35c) [24711.501532] [c0357a00] (cpuidle_idle_call+0xc8/0x35c) from [c0009764] (cpu_idle+0x88/0xdc) [24711.510201] [c0009764] (cpu_idle+0x88/0xdc) from [c05d8720] (start_kernel+0x2a0/0x2f0) [24711.518512] ---[ end trace e1776fbe32468909 ]--- -- David A. Madore ( http://www.madore.org/~david/ ) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
setting the init process's personality?
Hi, Is there a simple way (via a kernel boot option or config setting or - if really necessary - a patch or something like that) to set the personality for the init process? I'm running an x86_64 kernel on a system whose userland is almost entirely 32-bits (but needs an occasional 64-bit process to be run, hence the choice of kernel), and I'd like `uname -m` to be i686 unless I take special action. So I think that means letting init (which is indeed a 32-bit process) have the PER_LINUX32 personality (in case I'm wrong about this, the output of uname -m is essentially what matters to me). So, where does the default come from? -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
setting the init process's personality?
Hi, Is there a simple way (via a kernel boot option or config setting or - if really necessary - a patch or something like that) to set the personality for the init process? I'm running an x86_64 kernel on a system whose userland is almost entirely 32-bits (but needs an occasional 64-bit process to be run, hence the choice of kernel), and I'd like `uname -m` to be i686 unless I take special action. So I think that means letting init (which is indeed a 32-bit process) have the PER_LINUX32 personality (in case I'm wrong about this, the output of uname -m is essentially what matters to me). So, where does the default come from? -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
any help on RTC-related config? (and "rtc_cmos: probe of 00:03 failed with error -16" error message)
Hi all, I'm extremely confused as to what all the RTC-related config variables in the kernel mean and what I'm supposed to do with them, and I wonder if someone can help me or point me to some doc beside rtc.txt (which I've read, of course). I understand (from reading Documentation/rtc.txt) that there are two different RTC driver systems for Linux: an "old" one, supporting only one PC-AT-compatible RTC source, which drives /dev/rtc, and a "new" one, supporting different sources, which drives /dev/rtc[0123...]. But I'm not sure which configuration variables enable which, whether they should be enabled together or whether I should choose between the twain, and what I should be using on my system anyway. I sort of gathered (I hope not too incorrectly) that the "genrtc" module is brought by the CONFIG_GEN_RTC configuration choice and that it contains the "old" driver, whereas the "new" driver is split between modules such as "rtc", "rtc_lib", "rtc_core" and actual drivers like "rtc_cmos" - right? - and configured by such switches as CONFIG_RTC_CLASS, CONFIG_RTC_LIB and CONFIG_RTC_DRV_CMOS. There might also be a CONFIG_RTC variable, about which I'm not sure. I'm also very confused about how HPET's tie into this, and what CONFIG_HPET_EMULATE_RTC does, for example. Now how do I know what's on my system? (It's an ASUS P5W64 WS Pro based x86_64.) I certainly have some kind of CMOS clock that I can configure in my BIOS, but I don't know about HPET's or other kind of RTC sources. I tried using the following config (this is all with 2.6.22.10): CONFIG_RTC=m CONFIG_GEN_RTC=m CONFIG_GEN_RTC_X=y CONFIG_HPET=y # CONFIG_HPET_RTC_IRQ is not set CONFIG_RTC_LIB=m CONFIG_RTC_CLASS=m CONFIG_RTC_INTF_SYSFS=y CONFIG_RTC_INTF_PROC=y CONFIG_RTC_INTF_DEV=y CONFIG_RTC_INTF_DEV_UIE_EMUL=y CONFIG_RTC_DRV_TEST=m CONFIG_RTC_DRV_CMOS=m # CONFIG_RTC_DRV_DS1553 is not set # CONFIG_RTC_DRV_DS1742 is not set # CONFIG_RTC_DRV_M48T86 is not set # CONFIG_RTC_DRV_V3020 is not set Now if I load the genrtc module (to use the "old" driver?), I get a /dev/rtc which may or may not be satisfactory but the dev.rtc.max-user-freq sysctl does not exist and ALSA does not use snd_rtctimer. If I try unloading genrtc and instead loading the rtc, rtc_lib, rtc_core and rtc_cmos modules (to use the "new" driver?), I get the following error in dmesg: Real Time Clock Driver v1.12ac rtc_cmos 00:03: rtc core: registered rtc_cmos as rtc0 rtc_cmos: probe of 00:03 failed with error -16 After what attempts to, e.g., play a MIDI file with ALSA, fail (only a single note is played) and the following error occurs in dmesg: rtc: lost some interrupts at 1024Hz. So, why does rtc_cmos fail that way? And how am I supposed to configure RTC as a whole? (I will, of course, gladly provide more information if requested.) Thanks for any help! -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
any help on RTC-related config? (and rtc_cmos: probe of 00:03 failed with error -16 error message)
Hi all, I'm extremely confused as to what all the RTC-related config variables in the kernel mean and what I'm supposed to do with them, and I wonder if someone can help me or point me to some doc beside rtc.txt (which I've read, of course). I understand (from reading Documentation/rtc.txt) that there are two different RTC driver systems for Linux: an old one, supporting only one PC-AT-compatible RTC source, which drives /dev/rtc, and a new one, supporting different sources, which drives /dev/rtc[0123...]. But I'm not sure which configuration variables enable which, whether they should be enabled together or whether I should choose between the twain, and what I should be using on my system anyway. I sort of gathered (I hope not too incorrectly) that the genrtc module is brought by the CONFIG_GEN_RTC configuration choice and that it contains the old driver, whereas the new driver is split between modules such as rtc, rtc_lib, rtc_core and actual drivers like rtc_cmos - right? - and configured by such switches as CONFIG_RTC_CLASS, CONFIG_RTC_LIB and CONFIG_RTC_DRV_CMOS. There might also be a CONFIG_RTC variable, about which I'm not sure. I'm also very confused about how HPET's tie into this, and what CONFIG_HPET_EMULATE_RTC does, for example. Now how do I know what's on my system? (It's an ASUS P5W64 WS Pro based x86_64.) I certainly have some kind of CMOS clock that I can configure in my BIOS, but I don't know about HPET's or other kind of RTC sources. I tried using the following config (this is all with 2.6.22.10): CONFIG_RTC=m CONFIG_GEN_RTC=m CONFIG_GEN_RTC_X=y CONFIG_HPET=y # CONFIG_HPET_RTC_IRQ is not set CONFIG_RTC_LIB=m CONFIG_RTC_CLASS=m CONFIG_RTC_INTF_SYSFS=y CONFIG_RTC_INTF_PROC=y CONFIG_RTC_INTF_DEV=y CONFIG_RTC_INTF_DEV_UIE_EMUL=y CONFIG_RTC_DRV_TEST=m CONFIG_RTC_DRV_CMOS=m # CONFIG_RTC_DRV_DS1553 is not set # CONFIG_RTC_DRV_DS1742 is not set # CONFIG_RTC_DRV_M48T86 is not set # CONFIG_RTC_DRV_V3020 is not set Now if I load the genrtc module (to use the old driver?), I get a /dev/rtc which may or may not be satisfactory but the dev.rtc.max-user-freq sysctl does not exist and ALSA does not use snd_rtctimer. If I try unloading genrtc and instead loading the rtc, rtc_lib, rtc_core and rtc_cmos modules (to use the new driver?), I get the following error in dmesg: Real Time Clock Driver v1.12ac rtc_cmos 00:03: rtc core: registered rtc_cmos as rtc0 rtc_cmos: probe of 00:03 failed with error -16 After what attempts to, e.g., play a MIDI file with ALSA, fail (only a single note is played) and the following error occurs in dmesg: rtc: lost some interrupts at 1024Hz. So, why does rtc_cmos fail that way? And how am I supposed to configure RTC as a whole? (I will, of course, gladly provide more information if requested.) Thanks for any help! -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: patch/option to wipe memory at boot?
On Mon, Sep 17, 2007 at 11:11:52AM -0700, Jeremy Fitzhardinge wrote: > Boot memtest86 for a little while before booting the kernel? And if you > haven't already run it for a while, then that would be your first step > anyway. Indeed, that does the trick, thanks for the suggestion. So I can be quite confident, now, that my RAM is sane and it's just that the BIOS doesn't initialize it properly. But I'd still like some way of filling the RAM when Linux starts (or perhaps in the bootloader), because letting memtest86 run after every cold reboot isn't a very satisfactory solution. Happy hacking, -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
patch/option to wipe memory at boot?
Hi, Is there a patch or a boot option or something which wipes all available (physical) RAM at boot (or better, fills it with a fixed signature like 0xdeadbeef)? I'm getting phony ECC errors and I'd like to test whether they go away when the RAM is properly initialized. Also, I'd like to know exactly which parts of RAM are being used and which are untouched since boot (hence the 0xdeadbeef signature). If this patch/option doesn't exist, can anyone give me a hint as to where and how it would be best to add this? (I'm afraid I'm very ignorant as to how Linux sets up its RAM mapping.) I'm concerned about x86 and x86_64. PS: I'm not finicky: it's all right if a couple of megabytes at the bottom of RAM are not scrubbed (I'm more interested about the top gigabyte-or-so), especially if they're guaranteed to be used by the kernel. Happy hacking, -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
patch/option to wipe memory at boot?
Hi, Is there a patch or a boot option or something which wipes all available (physical) RAM at boot (or better, fills it with a fixed signature like 0xdeadbeef)? I'm getting phony ECC errors and I'd like to test whether they go away when the RAM is properly initialized. Also, I'd like to know exactly which parts of RAM are being used and which are untouched since boot (hence the 0xdeadbeef signature). If this patch/option doesn't exist, can anyone give me a hint as to where and how it would be best to add this? (I'm afraid I'm very ignorant as to how Linux sets up its RAM mapping.) I'm concerned about x86 and x86_64. PS: I'm not finicky: it's all right if a couple of megabytes at the bottom of RAM are not scrubbed (I'm more interested about the top gigabyte-or-so), especially if they're guaranteed to be used by the kernel. Happy hacking, -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: patch/option to wipe memory at boot?
On Mon, Sep 17, 2007 at 11:11:52AM -0700, Jeremy Fitzhardinge wrote: Boot memtest86 for a little while before booting the kernel? And if you haven't already run it for a while, then that would be your first step anyway. Indeed, that does the trick, thanks for the suggestion. So I can be quite confident, now, that my RAM is sane and it's just that the BIOS doesn't initialize it properly. But I'd still like some way of filling the RAM when Linux starts (or perhaps in the bootloader), because letting memtest86 run after every cold reboot isn't a very satisfactory solution. Happy hacking, -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
setsystz utility: set the kernel's system time zone
// Hi. I felt the need to write the following little utility (which // is mostly comments, really), to prevent my digital camera's image // files to have incorrect modification when I mount them under Linux. // Comments are welcome. // Enjoy! /// cut after /// /* setsystz: set the Linux kernel's idea of the time zone */ /* David A. Madore <[EMAIL PROTECTED]>, 2007-02-19. Public Domain */ /* Rationale: the Linux kernel needs to have some idea of time zone, notably because some filesystems (e.g. FAT) store file modification/access times in local time rather than UTC(=GMT) (which Unix uses internally for all timestamps). This kernel (system) time zone is set through the settimeofday() system call; unfortunately, there does not seem to be a practical way to do it, and some (all?) Linux distributions get it wrong: e.g., simply because my CMOS clock is set to GMT (as recommended), my Debian init scripts apparently assume that any FAT filesystems I'll be mounting will have GMT timestamps (uh?). Note: IMHO, the whole idea of having a per-system global time zone is probably wrong, and FAT mounts should probably better use an adhoc option to specify GMT offset (defaulting to the libc time zone for the mount process), and CMOS clock thingies should be kept separate. */ /* What this does: called without arguments, setsystz sets the kernel's time zone to the userland's time zone (typically from the /etc/localtime file, overridden by the TZ environment variable if it exists). With an explicit argument, setsystz sets the kernel's time zone to that many minutes west of GMT (see settimeofday(2) man page for explanations). This program takes care _not_ to change/warp the system clock while changing the time zone: see comments on avoid_linux_braindeadness() below. */ /* How to use: probably just call "setsystz" (as root) before mounting a FAT filesystem, if the files it contains are in your usual system time zone. If they are, e.g., from the Shanghai time zone, then use "TZ=Asia/Shanghai setsystz" before mounting. Note: it's probably wiser not to do this while there are existing mounted FAT filesystems. */ #include #include #include #include #include int auto_minutes (void) /* Determine localtime GMT offset and return it in minutes west of GMT (as expected by a struct timezone). This will typically use the TZ environment variable if it is defined or, as a fallback, the contents of /etc/localtime (see libc documentation for more details). */ { time_t now = time(NULL); struct tm *lt = localtime (); long int gmtoff = lt->tm_gmtoff; fprintf (stderr, "GMT offset=%lds\n", gmtoff); if ( gmtoff%60 ) fprintf (stderr, "warning: GMT offset %lds " "is not an integer number of minutes\n", gmtoff); gmtoff /= 60; return -gmtoff; } void avoid_linux_braindeadness (void) /* We ___DO NOT___ want to change the system time, only the system time zone! Since Linux does something special (warp_clock() semantics) the very first time settimeofday() is called with tz!=NULL, we call it once with tz pointing to a GMT-filled structure, i.e., tz->tz_minuteswest==0 (so the clock won't be warped). The settimeofday(2) man page claims that tz->tz_minuteswest==0 will not count toward cancelling the warp_clock() semantics, i.e., that our trick does not work: fortunately, it is wrong (at least under 2.6.19 and whereabouts) and our trick works. Note however that this still resets the time interpolator the first time: unfortunately there does not seem to be a way around this problem. See /usr/src/linux/kernel/time.c for details about the whole mess. -- David A. Madore 2007-02-19 */ { struct timezone tz; memset (, 0, sizeof(struct timezone)); tz.tz_minuteswest = 0; tz.tz_dsttime = 0; settimeofday (NULL, ); } int main (int argc, char *argv[]) { int minuteswest; if ( argc == 1 ) minuteswest = auto_minutes(); else if ( argc == 2 ) { if ( sscanf (argv[1], "%d", ) != 1 ) { fprintf (stderr, "invalid argument: %s\n", argv[1]); exit (2); } } else { fprintf (stderr, "wrong number or arguments\n"); exit (2); } struct timezone tz; memset (, 0, sizeof(struct timezone)); tz.tz_minuteswest = minuteswest; tz.tz_dsttime = 0; fprintf (stderr, "setting system time zone to tz_minuteswest=%d\n", minuteswest); #if 1 avoid_linux_braindeadness (); if ( settimeofday (NULL, ) == -1 ) { perror ("settimeofday()"); exit (EXIT_FAILURE); } #endif return 0; } /// cut before /// - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
setsystz utility: set the kernel's system time zone
// Hi. I felt the need to write the following little utility (which // is mostly comments, really), to prevent my digital camera's image // files to have incorrect modification when I mount them under Linux. // Comments are welcome. // Enjoy! /// cut after /// /* setsystz: set the Linux kernel's idea of the time zone */ /* David A. Madore [EMAIL PROTECTED], 2007-02-19. Public Domain */ /* Rationale: the Linux kernel needs to have some idea of time zone, notably because some filesystems (e.g. FAT) store file modification/access times in local time rather than UTC(=GMT) (which Unix uses internally for all timestamps). This kernel (system) time zone is set through the settimeofday() system call; unfortunately, there does not seem to be a practical way to do it, and some (all?) Linux distributions get it wrong: e.g., simply because my CMOS clock is set to GMT (as recommended), my Debian init scripts apparently assume that any FAT filesystems I'll be mounting will have GMT timestamps (uh?). Note: IMHO, the whole idea of having a per-system global time zone is probably wrong, and FAT mounts should probably better use an adhoc option to specify GMT offset (defaulting to the libc time zone for the mount process), and CMOS clock thingies should be kept separate. */ /* What this does: called without arguments, setsystz sets the kernel's time zone to the userland's time zone (typically from the /etc/localtime file, overridden by the TZ environment variable if it exists). With an explicit argument, setsystz sets the kernel's time zone to that many minutes west of GMT (see settimeofday(2) man page for explanations). This program takes care _not_ to change/warp the system clock while changing the time zone: see comments on avoid_linux_braindeadness() below. */ /* How to use: probably just call setsystz (as root) before mounting a FAT filesystem, if the files it contains are in your usual system time zone. If they are, e.g., from the Shanghai time zone, then use TZ=Asia/Shanghai setsystz before mounting. Note: it's probably wiser not to do this while there are existing mounted FAT filesystems. */ #include stdio.h #include stdlib.h #include string.h #include time.h #include sys/time.h int auto_minutes (void) /* Determine localtime GMT offset and return it in minutes west of GMT (as expected by a struct timezone). This will typically use the TZ environment variable if it is defined or, as a fallback, the contents of /etc/localtime (see libc documentation for more details). */ { time_t now = time(NULL); struct tm *lt = localtime (now); long int gmtoff = lt-tm_gmtoff; fprintf (stderr, GMT offset=%lds\n, gmtoff); if ( gmtoff%60 ) fprintf (stderr, warning: GMT offset %lds is not an integer number of minutes\n, gmtoff); gmtoff /= 60; return -gmtoff; } void avoid_linux_braindeadness (void) /* We ___DO NOT___ want to change the system time, only the system time zone! Since Linux does something special (warp_clock() semantics) the very first time settimeofday() is called with tz!=NULL, we call it once with tz pointing to a GMT-filled structure, i.e., tz-tz_minuteswest==0 (so the clock won't be warped). The settimeofday(2) man page claims that tz-tz_minuteswest==0 will not count toward cancelling the warp_clock() semantics, i.e., that our trick does not work: fortunately, it is wrong (at least under 2.6.19 and whereabouts) and our trick works. Note however that this still resets the time interpolator the first time: unfortunately there does not seem to be a way around this problem. See /usr/src/linux/kernel/time.c for details about the whole mess. -- David A. Madore 2007-02-19 */ { struct timezone tz; memset (tz, 0, sizeof(struct timezone)); tz.tz_minuteswest = 0; tz.tz_dsttime = 0; settimeofday (NULL, tz); } int main (int argc, char *argv[]) { int minuteswest; if ( argc == 1 ) minuteswest = auto_minutes(); else if ( argc == 2 ) { if ( sscanf (argv[1], %d, minuteswest) != 1 ) { fprintf (stderr, invalid argument: %s\n, argv[1]); exit (2); } } else { fprintf (stderr, wrong number or arguments\n); exit (2); } struct timezone tz; memset (tz, 0, sizeof(struct timezone)); tz.tz_minuteswest = minuteswest; tz.tz_dsttime = 0; fprintf (stderr, setting system time zone to tz_minuteswest=%d\n, minuteswest); #if 1 avoid_linux_braindeadness (); if ( settimeofday (NULL, tz) == -1 ) { perror (settimeofday()); exit (EXIT_FAILURE); } #endif return 0; } /// cut before /// - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at
Re: [patch] netfilter: implement TCPMSS target for IPv6
On Sun, Jan 14, 2007 at 09:10:45PM +0100, Jan Engelhardt wrote: > On Jan 14 2007 20:20, David Madore wrote: > >Implement TCPMSS target for IPv6 by shamelessly copying from > >Marc Boucher's IPv4 implementation. > > Would not it be worthwhile to merge ipt_TCPMSS and > ip6t_TCPMSS to xt_TCPMSS instead? It may be, but I'm afraid that's outside my competence. I happened to need ip6t_TCPMSS badly and soon, so I went for the quickest solution. Of course, I'd appreciate it if someone were to do it in a better way. Happy hacking, -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] netfilter: implement TCPMSS target for IPv6
Implement TCPMSS target for IPv6 by shamelessly copying from Marc Boucher's IPv4 implementation. Signed-off-by: David A. Madore <[EMAIL PROTECTED]> --- Note: The patch for ip6tables to make use of this module can be obtained from ftp://quatramaran.ens.fr/pub/madore/misc/ip6t-TCPMSS/ > (also contains a version of this same patch for 2.6.19.2). include/linux/netfilter_ipv6/ip6t_TCPMSS.h | 10 ++ net/ipv6/netfilter/Kconfig | 26 net/ipv6/netfilter/Makefile|1 + net/ipv6/netfilter/ip6t_TCPMSS.c | 225 4 files changed, 262 insertions(+), 0 deletions(-) diff --git a/include/linux/netfilter_ipv6/ip6t_TCPMSS.h b/include/linux/netfilter_ipv6/ip6t_TCPMSS.h new file mode 100644 index 000..412d1cb --- /dev/null +++ b/include/linux/netfilter_ipv6/ip6t_TCPMSS.h @@ -0,0 +1,10 @@ +#ifndef _IP6T_TCPMSS_H +#define _IP6T_TCPMSS_H + +struct ip6t_tcpmss_info { + u_int16_t mss; +}; + +#define IP6T_TCPMSS_CLAMP_PMTU 0x + +#endif /*_IP6T_TCPMSS_H*/ diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig index adcd613..3890a59 100644 --- a/net/ipv6/netfilter/Kconfig +++ b/net/ipv6/netfilter/Kconfig @@ -154,6 +154,32 @@ config IP6_NF_TARGET_REJECT To compile it as a module, choose M here. If unsure, say N. +config IP6_NF_TARGET_TCPMSS + tristate "TCPMSS target support" + depends on IP6_NF_IPTABLES + ---help--- + This option adds a `TCPMSS' target, which allows you to alter the + MSS value of TCP SYN packets, to control the maximum size for that + connection (usually limiting it to your outgoing interface's MTU + minus 60). + + This is used to overcome criminally braindead ISPs or servers which + block ICMPv6 Packet Too Big packets. The symptoms of this + problem are that everything works fine from your Linux + firewall/router, but machines behind it can never exchange large + packets: + 1) Web browsers connect, then hang with no data received. + 2) Small mail works fine, but large emails hang. + 3) ssh works fine, but scp hangs after initial handshaking. + + Workaround: activate this option and add a rule to your firewall + configuration like: + + ip6tables -A FORWARD -p tcp --tcp-flags SYN,RST SYN \ +-j TCPMSS --clamp-mss-to-pmtu + + To compile it as a module, choose M here. If unsure, say N. + config IP6_NF_MANGLE tristate "Packet mangling" depends on IP6_NF_IPTABLES diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile index ac1dfeb..616a006 100644 --- a/net/ipv6/netfilter/Makefile +++ b/net/ipv6/netfilter/Makefile @@ -19,6 +19,7 @@ obj-$(CONFIG_IP6_NF_TARGET_LOG) += ip6t_LOG.o obj-$(CONFIG_IP6_NF_RAW) += ip6table_raw.o obj-$(CONFIG_IP6_NF_MATCH_HL) += ip6t_hl.o obj-$(CONFIG_IP6_NF_TARGET_REJECT) += ip6t_REJECT.o +obj-$(CONFIG_IP6_NF_TARGET_TCPMSS) += ip6t_TCPMSS.o # objects for l3 independent conntrack nf_conntrack_ipv6-objs := nf_conntrack_l3proto_ipv6.o nf_conntrack_proto_icmpv6.o nf_conntrack_reasm.o diff --git a/net/ipv6/netfilter/ip6t_TCPMSS.c b/net/ipv6/netfilter/ip6t_TCPMSS.c new file mode 100644 index 000..ab492c3 --- /dev/null +++ b/net/ipv6/netfilter/ip6t_TCPMSS.c @@ -0,0 +1,225 @@ +/* + * This is a module which is used for setting the MSS option in TCP packets. + * + * Copyright (C) 2007 David Madore <[EMAIL PROTECTED]> + * + * Shamelessly based on net/ipv4/netfilter/ipt_TCPMSS.c + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include + +#include +#include + +#include +#include + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("David Madore <[EMAIL PROTECTED]>"); +MODULE_DESCRIPTION("ip6tables TCP MSS modification module"); + +static inline unsigned int +optlen(const u_int8_t *opt, unsigned int offset) +{ + /* Beware zero-length options: make finite progress */ + if (opt[offset] <= TCPOPT_NOP || opt[offset+1] == 0) + return 1; + else + return opt[offset+1]; +} + +static unsigned int +ip6t_tcpmss_target(struct sk_buff **pskb, + const struct net_device *in, + const struct net_device *out, + unsigned int hooknum, + const struct xt_target *target, + const void *targinfo) +{ + const struct ip6t_tcpmss_info *tcpmssinfo = targinfo; + struct tcphdr *tcph; + struct ipv6hdr *ipv6h; + u_int8_t nexthdr; + int tcphoff; + u_int16_t tcplen, newmss; + __be16 newiplen, oldval; + unsigned int i; + u_int8_t *opt; + + if (!skb_make_writable(psk
[patch] netfilter: implement TCPMSS target for IPv6
Implement TCPMSS target for IPv6 by shamelessly copying from Marc Boucher's IPv4 implementation. Signed-off-by: David A. Madore [EMAIL PROTECTED] --- Note: The patch for ip6tables to make use of this module can be obtained from URL: ftp://quatramaran.ens.fr/pub/madore/misc/ip6t-TCPMSS/ (also contains a version of this same patch for 2.6.19.2). include/linux/netfilter_ipv6/ip6t_TCPMSS.h | 10 ++ net/ipv6/netfilter/Kconfig | 26 net/ipv6/netfilter/Makefile|1 + net/ipv6/netfilter/ip6t_TCPMSS.c | 225 4 files changed, 262 insertions(+), 0 deletions(-) diff --git a/include/linux/netfilter_ipv6/ip6t_TCPMSS.h b/include/linux/netfilter_ipv6/ip6t_TCPMSS.h new file mode 100644 index 000..412d1cb --- /dev/null +++ b/include/linux/netfilter_ipv6/ip6t_TCPMSS.h @@ -0,0 +1,10 @@ +#ifndef _IP6T_TCPMSS_H +#define _IP6T_TCPMSS_H + +struct ip6t_tcpmss_info { + u_int16_t mss; +}; + +#define IP6T_TCPMSS_CLAMP_PMTU 0x + +#endif /*_IP6T_TCPMSS_H*/ diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig index adcd613..3890a59 100644 --- a/net/ipv6/netfilter/Kconfig +++ b/net/ipv6/netfilter/Kconfig @@ -154,6 +154,32 @@ config IP6_NF_TARGET_REJECT To compile it as a module, choose M here. If unsure, say N. +config IP6_NF_TARGET_TCPMSS + tristate TCPMSS target support + depends on IP6_NF_IPTABLES + ---help--- + This option adds a `TCPMSS' target, which allows you to alter the + MSS value of TCP SYN packets, to control the maximum size for that + connection (usually limiting it to your outgoing interface's MTU + minus 60). + + This is used to overcome criminally braindead ISPs or servers which + block ICMPv6 Packet Too Big packets. The symptoms of this + problem are that everything works fine from your Linux + firewall/router, but machines behind it can never exchange large + packets: + 1) Web browsers connect, then hang with no data received. + 2) Small mail works fine, but large emails hang. + 3) ssh works fine, but scp hangs after initial handshaking. + + Workaround: activate this option and add a rule to your firewall + configuration like: + + ip6tables -A FORWARD -p tcp --tcp-flags SYN,RST SYN \ +-j TCPMSS --clamp-mss-to-pmtu + + To compile it as a module, choose M here. If unsure, say N. + config IP6_NF_MANGLE tristate Packet mangling depends on IP6_NF_IPTABLES diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile index ac1dfeb..616a006 100644 --- a/net/ipv6/netfilter/Makefile +++ b/net/ipv6/netfilter/Makefile @@ -19,6 +19,7 @@ obj-$(CONFIG_IP6_NF_TARGET_LOG) += ip6t_LOG.o obj-$(CONFIG_IP6_NF_RAW) += ip6table_raw.o obj-$(CONFIG_IP6_NF_MATCH_HL) += ip6t_hl.o obj-$(CONFIG_IP6_NF_TARGET_REJECT) += ip6t_REJECT.o +obj-$(CONFIG_IP6_NF_TARGET_TCPMSS) += ip6t_TCPMSS.o # objects for l3 independent conntrack nf_conntrack_ipv6-objs := nf_conntrack_l3proto_ipv6.o nf_conntrack_proto_icmpv6.o nf_conntrack_reasm.o diff --git a/net/ipv6/netfilter/ip6t_TCPMSS.c b/net/ipv6/netfilter/ip6t_TCPMSS.c new file mode 100644 index 000..ab492c3 --- /dev/null +++ b/net/ipv6/netfilter/ip6t_TCPMSS.c @@ -0,0 +1,225 @@ +/* + * This is a module which is used for setting the MSS option in TCP packets. + * + * Copyright (C) 2007 David Madore [EMAIL PROTECTED] + * + * Shamelessly based on net/ipv4/netfilter/ipt_TCPMSS.c + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include linux/module.h +#include linux/skbuff.h + +#include net/ipv6.h +#include net/tcp.h + +#include linux/netfilter_ipv6/ip6_tables.h +#include linux/netfilter_ipv6/ip6t_TCPMSS.h + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(David Madore [EMAIL PROTECTED]); +MODULE_DESCRIPTION(ip6tables TCP MSS modification module); + +static inline unsigned int +optlen(const u_int8_t *opt, unsigned int offset) +{ + /* Beware zero-length options: make finite progress */ + if (opt[offset] = TCPOPT_NOP || opt[offset+1] == 0) + return 1; + else + return opt[offset+1]; +} + +static unsigned int +ip6t_tcpmss_target(struct sk_buff **pskb, + const struct net_device *in, + const struct net_device *out, + unsigned int hooknum, + const struct xt_target *target, + const void *targinfo) +{ + const struct ip6t_tcpmss_info *tcpmssinfo = targinfo; + struct tcphdr *tcph; + struct ipv6hdr *ipv6h; + u_int8_t nexthdr; + int tcphoff; + u_int16_t tcplen, newmss; + __be16 newiplen, oldval; + unsigned int i; + u_int8_t *opt
Re: [patch] netfilter: implement TCPMSS target for IPv6
On Sun, Jan 14, 2007 at 09:10:45PM +0100, Jan Engelhardt wrote: On Jan 14 2007 20:20, David Madore wrote: Implement TCPMSS target for IPv6 by shamelessly copying from Marc Boucher's IPv4 implementation. Would not it be worthwhile to merge ipt_TCPMSS and ip6t_TCPMSS to xt_TCPMSS instead? It may be, but I'm afraid that's outside my competence. I happened to need ip6t_TCPMSS badly and soon, so I went for the quickest solution. Of course, I'd appreciate it if someone were to do it in a better way. Happy hacking, -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] Support UTF-8 scripts
On Sun, Aug 14, 2005 at 08:00:31PM +, Lee Revell wrote: > We write code in ASCII, dammit. http://www.madore.org/~david/weblog/2004-12.html#d.2004-12-03.0813 > :-) -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] Support UTF-8 scripts
On Sun, Aug 14, 2005 at 08:00:31PM +, Lee Revell wrote: We write code in ASCII, dammit. URL: http://www.madore.org/~david/weblog/2004-12.html#d.2004-12-03.0813 :-) -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[slightly OT] what's in RAM at 0x3ffe5000 ?
Hi. I have ECC RAM on my system and I wanted to check it, so (because there doesn't seem to be any Linux ECC support for my P5WD2 motherboard) I wrote my own kernel module[#] to interrogate the northbridge. I was a little annoyed to find that the northbridge had reported an ECC error, and a multi-bit uncorrectable error at that!, at memory location 0x3ffe5000. I cleared the error flag and ran multiple checks and couldn't find any other error, so I stared thinking about this address I realized that it was very near the top of memory (I have 1GB RAM). In fact, it is reported as "reserved" by Linux: BIOS-provided physical RAM map: BIOS-e820: - 0009fc00 (usable) BIOS-e820: 0009fc00 - 000a (reserved) BIOS-e820: 000e4000 - 0010 (reserved) BIOS-e820: 0010 - 3ff8 (usable) BIOS-e820: 3ff8 - 3ff8e000 (ACPI data) BIOS-e820: 3ff8e000 - 3ffe (ACPI NVS) BIOS-e820: 3ffe - 4000 (reserved) BIOS-e820: ffb0 - 0001 (reserved) Now /dev/mem won't work that far so I can't read what's there, but I suspect there's something very strange in that place and the ECC error reported by the northbridge is not really an error. Interestingly enough, I always get an error at 0x3ffe5000 when I boot, and then later on I get an error at 0x3fff0580. This is consistent: I always get those "errors" at the same memory locations, and they're always multiple-bit errors. So here are my questions: * What does "reserved" mean in the BIOS physical RAM table? Reserved by whom? Who owns my memory? Do all my base are belong to him? * What's the simplest way, under Linux (whether in userspace or in kernel), to read the contents of a _physical_ memory location, given that /dev/mem won't do it: vega david ~ $ sudo dd if=/dev/mem bs=4096 count=1 skip=262117 of=/tmp/page dd: reading `/dev/mem': Bad address 0+0 records in 0+0 records out 0 bytes transferred in 0.000118 seconds (0 bytes/sec) * Why am I getting ECC errors in this strange place, and only there? Do I need to worry about them? (I mean, if it's something strange like memory-mapped I/O I would expect the northbridge to know about it and not report an error!) -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) [#] Source available on demand - it's pretty damn ugly, I wouldn't want Mr. Torvalds to see it! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[slightly OT] what's in RAM at 0x3ffe5000 ?
Hi. I have ECC RAM on my system and I wanted to check it, so (because there doesn't seem to be any Linux ECC support for my P5WD2 motherboard) I wrote my own kernel module[#] to interrogate the northbridge. I was a little annoyed to find that the northbridge had reported an ECC error, and a multi-bit uncorrectable error at that!, at memory location 0x3ffe5000. I cleared the error flag and ran multiple checks and couldn't find any other error, so I stared thinking about this address I realized that it was very near the top of memory (I have 1GB RAM). In fact, it is reported as reserved by Linux: BIOS-provided physical RAM map: BIOS-e820: - 0009fc00 (usable) BIOS-e820: 0009fc00 - 000a (reserved) BIOS-e820: 000e4000 - 0010 (reserved) BIOS-e820: 0010 - 3ff8 (usable) BIOS-e820: 3ff8 - 3ff8e000 (ACPI data) BIOS-e820: 3ff8e000 - 3ffe (ACPI NVS) BIOS-e820: 3ffe - 4000 (reserved) BIOS-e820: ffb0 - 0001 (reserved) Now /dev/mem won't work that far so I can't read what's there, but I suspect there's something very strange in that place and the ECC error reported by the northbridge is not really an error. Interestingly enough, I always get an error at 0x3ffe5000 when I boot, and then later on I get an error at 0x3fff0580. This is consistent: I always get those errors at the same memory locations, and they're always multiple-bit errors. So here are my questions: * What does reserved mean in the BIOS physical RAM table? Reserved by whom? Who owns my memory? Do all my base are belong to him? * What's the simplest way, under Linux (whether in userspace or in kernel), to read the contents of a _physical_ memory location, given that /dev/mem won't do it: vega david ~ $ sudo dd if=/dev/mem bs=4096 count=1 skip=262117 of=/tmp/page dd: reading `/dev/mem': Bad address 0+0 records in 0+0 records out 0 bytes transferred in 0.000118 seconds (0 bytes/sec) * Why am I getting ECC errors in this strange place, and only there? Do I need to worry about them? (I mean, if it's something strange like memory-mapped I/O I would expect the northbridge to know about it and not report an error!) -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) [#] Source available on demand - it's pretty damn ugly, I wouldn't want Mr. Torvalds to see it! - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
how do I read CPU temperature in ACPI? (w/ P5WD2 motherboard)
Hi. I apologize for what is surely a stupid question: I understand that ACPI should be able to tell me what my CPU's temperature is (I have a sever overheating problem and I am trying to solve it by underclocking somewhat, but I need to be able to read the temperature to do anything worth while), but no matter what ACPI modules I load, I can't find any hint of a CPU temperature reading anywhere below /proc/acpi (the /proc/acpi/thermal_zone/ directory, for example, remains empty). That's with the "thermal", "processor" and "fan" modules loaded (and a few others; full listing follows signature). I tried to load the asus_acpi module also, since I have an ASUS motherboard (a P5WD2 Premium - precise details are given below signature), but I got a "No such device" error. Does that mean my motherboard is unsupported and I cannot read my CPU temperature at all? (But I thought the whole _point_ of ACPI was that it was an abstraction away from the hardware: so why is there such a thing as "Asus" ACPI?) Or else, what am I doing wrong? -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) Full details: hardware config: Asus P5WD2 Premium motherboard Intel 995X chipset Intel Pentium 4 550 @3.4GHz processor lsmod output: Module Size Used by ac 5892 0 container 5504 0 fan 5636 0 video 17156 0 thermal14600 0 processor 24328 1 thermal nvidia 3714948 12 agpgart37580 1 nvidia ide_cd 44804 0 cdrom 42528 1 ide_cd af_packet 25480 4 ip6table_filter 3840 1 ip6_tables 21120 1 ip6table_filter ipt_REJECT 6656 8 ipt_TOS 3584 1 reiserfs 281716 7 snd_emu10k1_synth 8960 0 snd_emu10k1 120068 1 snd_emu10k1_synth snd_ac97_codec 84216 1 snd_emu10k1 snd_pcm_oss54688 0 snd_mixer_oss 20736 1 snd_pcm_oss snd_pcm96900 3 snd_emu10k1,snd_ac97_codec,snd_pcm_oss snd_page_alloc 11140 2 snd_emu10k1,snd_pcm snd_emux_synth 39680 1 snd_emu10k1_synth snd_seq_virmidi 9088 1 snd_emux_synth snd_seq_midi_emul 8576 1 snd_emux_synth snd_seq_dummy 4740 0 snd_seq_oss36992 0 snd_seq_midi 10144 0 snd_rawmidi27040 3 snd_emu10k1,snd_seq_virmidi,snd_seq_midi snd_seq_midi_event 8960 3 snd_seq_virmidi,snd_seq_oss,snd_seq_midi snd_seq57360 9 snd_emux_synth,snd_seq_virmidi,snd_seq_midi_emul,snd_seq_dummy,snd_seq_oss,snd_seq_midi,snd_seq_midi_event snd_timer 27012 3 snd_emu10k1,snd_pcm,snd_seq snd_seq_device 9740 8 snd_emu10k1_synth,snd_emu10k1,snd_emux_synth,snd_seq_dummy,snd_seq_oss,snd_seq_midi,snd_rawmidi,snd_seq snd_hwdep 10400 2 snd_emu10k1,snd_emux_synth snd57956 13 snd_emu10k1,snd_ac97_codec,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_emux_synth,snd_seq_virmidi,snd_seq_oss,snd_rawmidi,snd_seq,snd_timer,snd_seq_device,snd_hwdep soundcore 11232 1 snd snd_util_mem5632 2 snd_emu10k1,snd_emux_synth ipv6 272544 24 mousedev 13220 2 iptable_mangle 3968 1 iptable_nat25268 0 ip_conntrack 47208 1 iptable_nat iptable_filter 4096 1 ip_tables 23296 5 ipt_REJECT,ipt_TOS,iptable_mangle,iptable_nat,iptable_filter capability 5896 0 commoncap 8064 1 capability ext2 72328 6 ext3 146952 0 jbd64920 1 ext3 mbcache11268 2 ext2,ext3 ppp_deflate 7424 0 zlib_deflate 23704 1 ppp_deflate bsd_comp7168 0 tun13056 1 ppp_async 13312 1 ppp_generic31892 7 ppp_deflate,bsd_comp,ppp_async slhc8064 1 ppp_generic crc_ccitt 3072 1 ppp_async dummy 4100 0 dm_mod 62368 0 ohci1394 37300 0 ieee1394 107704 1 ohci1394 usbhid 36832 0 ohci_hcd 23172 0 uhci_hcd 34960 0 usbcore 125820 4 usbhid,ohci_hcd,uhci_hcd e1000 108724 0 rtc14664 0 unix 31248 364 lspci output: :00:00.0 Host bridge: Intel Corp.: Unknown device 2774 (rev 81) :00:01.0 PCI bridge: Intel Corp.: Unknown device 2775 (rev 81) :00:1b.0 0403: Intel Corp.: Unknown device 27d8 (rev 01) :00:1c.0 PCI bridge: Intel Corp.: Unknown device 27d0 (rev 01) :00:1c.1 PCI bridge: Intel Corp.: Unknown device 27d2 (rev 01) :00:1c.2 PCI bridge: Intel Corp.: Unknown device 27d4 (rev 01) :00:1c.3 PCI bridge: Intel Corp.: Unknown device 27d6 (rev 01)
how do I read CPU temperature in ACPI? (w/ P5WD2 motherboard)
Hi. I apologize for what is surely a stupid question: I understand that ACPI should be able to tell me what my CPU's temperature is (I have a sever overheating problem and I am trying to solve it by underclocking somewhat, but I need to be able to read the temperature to do anything worth while), but no matter what ACPI modules I load, I can't find any hint of a CPU temperature reading anywhere below /proc/acpi (the /proc/acpi/thermal_zone/ directory, for example, remains empty). That's with the thermal, processor and fan modules loaded (and a few others; full listing follows signature). I tried to load the asus_acpi module also, since I have an ASUS motherboard (a P5WD2 Premium - precise details are given below signature), but I got a No such device error. Does that mean my motherboard is unsupported and I cannot read my CPU temperature at all? (But I thought the whole _point_ of ACPI was that it was an abstraction away from the hardware: so why is there such a thing as Asus ACPI?) Or else, what am I doing wrong? -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) Full details: hardware config: Asus P5WD2 Premium motherboard Intel 995X chipset Intel Pentium 4 550 @3.4GHz processor lsmod output: Module Size Used by ac 5892 0 container 5504 0 fan 5636 0 video 17156 0 thermal14600 0 processor 24328 1 thermal nvidia 3714948 12 agpgart37580 1 nvidia ide_cd 44804 0 cdrom 42528 1 ide_cd af_packet 25480 4 ip6table_filter 3840 1 ip6_tables 21120 1 ip6table_filter ipt_REJECT 6656 8 ipt_TOS 3584 1 reiserfs 281716 7 snd_emu10k1_synth 8960 0 snd_emu10k1 120068 1 snd_emu10k1_synth snd_ac97_codec 84216 1 snd_emu10k1 snd_pcm_oss54688 0 snd_mixer_oss 20736 1 snd_pcm_oss snd_pcm96900 3 snd_emu10k1,snd_ac97_codec,snd_pcm_oss snd_page_alloc 11140 2 snd_emu10k1,snd_pcm snd_emux_synth 39680 1 snd_emu10k1_synth snd_seq_virmidi 9088 1 snd_emux_synth snd_seq_midi_emul 8576 1 snd_emux_synth snd_seq_dummy 4740 0 snd_seq_oss36992 0 snd_seq_midi 10144 0 snd_rawmidi27040 3 snd_emu10k1,snd_seq_virmidi,snd_seq_midi snd_seq_midi_event 8960 3 snd_seq_virmidi,snd_seq_oss,snd_seq_midi snd_seq57360 9 snd_emux_synth,snd_seq_virmidi,snd_seq_midi_emul,snd_seq_dummy,snd_seq_oss,snd_seq_midi,snd_seq_midi_event snd_timer 27012 3 snd_emu10k1,snd_pcm,snd_seq snd_seq_device 9740 8 snd_emu10k1_synth,snd_emu10k1,snd_emux_synth,snd_seq_dummy,snd_seq_oss,snd_seq_midi,snd_rawmidi,snd_seq snd_hwdep 10400 2 snd_emu10k1,snd_emux_synth snd57956 13 snd_emu10k1,snd_ac97_codec,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_emux_synth,snd_seq_virmidi,snd_seq_oss,snd_rawmidi,snd_seq,snd_timer,snd_seq_device,snd_hwdep soundcore 11232 1 snd snd_util_mem5632 2 snd_emu10k1,snd_emux_synth ipv6 272544 24 mousedev 13220 2 iptable_mangle 3968 1 iptable_nat25268 0 ip_conntrack 47208 1 iptable_nat iptable_filter 4096 1 ip_tables 23296 5 ipt_REJECT,ipt_TOS,iptable_mangle,iptable_nat,iptable_filter capability 5896 0 commoncap 8064 1 capability ext2 72328 6 ext3 146952 0 jbd64920 1 ext3 mbcache11268 2 ext2,ext3 ppp_deflate 7424 0 zlib_deflate 23704 1 ppp_deflate bsd_comp7168 0 tun13056 1 ppp_async 13312 1 ppp_generic31892 7 ppp_deflate,bsd_comp,ppp_async slhc8064 1 ppp_generic crc_ccitt 3072 1 ppp_async dummy 4100 0 dm_mod 62368 0 ohci1394 37300 0 ieee1394 107704 1 ohci1394 usbhid 36832 0 ohci_hcd 23172 0 uhci_hcd 34960 0 usbcore 125820 4 usbhid,ohci_hcd,uhci_hcd e1000 108724 0 rtc14664 0 unix 31248 364 lspci output: :00:00.0 Host bridge: Intel Corp.: Unknown device 2774 (rev 81) :00:01.0 PCI bridge: Intel Corp.: Unknown device 2775 (rev 81) :00:1b.0 0403: Intel Corp.: Unknown device 27d8 (rev 01) :00:1c.0 PCI bridge: Intel Corp.: Unknown device 27d0 (rev 01) :00:1c.1 PCI bridge: Intel Corp.: Unknown device 27d2 (rev 01) :00:1c.2 PCI bridge: Intel Corp.: Unknown device 27d4 (rev 01) :00:1c.3 PCI bridge: Intel Corp.: Unknown device 27d6 (rev 01)
Re: capabilities patch (v 0.1)
On Tue, Aug 09, 2005 at 11:36:00PM +0200, Bodo Eggert wrote: > 1) I wouldn't want an exploited service to gain any privileges, even by >chaining userspace exploits (e.g. exec sendmail < exploitstring). For >most services, I'd like CAP_EXEC being unset (but it doesn't exist). I intend to add a couple of capabilities which are normally available to all user processes, including capability to exec(), capability to fork() and a couple of others (maybe a capability to perform any kind of write operation, but that seems a bit more difficult to implement). So keep an eye open[#] for future versions of my patch. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) [#] On the other hand, I have a strong tendency not to finish anything I start :-( so maybe this is all just vaporware. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: capabilities patch (v 0.1)
On Tue, Aug 09, 2005 at 01:52:06PM -0700, Chris Wright wrote: > * Bodo Eggert ([EMAIL PROTECTED]) wrote: > > How are you going to tell processes that may exec suid (or set-capability-) > > programs from those that aren't supposed to gain certain capabilities? > > typically you'd expect exec suid will reset to full caps. suid exec _must_ reset to full caps or we have the sendmail disaster again. However, that is _if_ execve() succeeds. It is quite possible that execve() should fail, and that is precisely what my patch does: if a process has bounded capabilities, it _may not_ exec suid. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: capabilities patch (v 0.1)
On Tue, Aug 09, 2005 at 04:28:31PM -0400, [EMAIL PROTECTED] wrote: > On Tue, 09 Aug 2005 07:26:21 +0200, David Madore said: > > * Second, a much more extensive change, the patch introduces a third > > set of capabilities for every process, the "bounding" set. Normally > > the bounding set has every capability in it > > How is this different in semantics from the existing 'permitted' capset? The permitted sets is a set of capabilities really available to the process (though they may be temporarily dropped by removing them from the effective set, they are still available to take back). In contrast, the bounding set capabilities are not readily available to the process; it just means that the capabilities in question *might* be acquired by running a suid program (or setcap program if filesystem support for capabilities ever comes to Linux). Currently this is more or less an all-or-nothing process: since capabilities can only be acquired by running a suid program, removing any capability from the bounding set means the program will never be permitted to execute a suid program any more (execve() will fail with EPERM). But maybe I'll reinstate the CAP_SETPCAP thing in some future version of the patch (I'm still waiting for someone to tell me what was wrong with CAP_SETPCAP and why it was removed), and then the bounding set should also prohibit capabilities being given through that interface. The bottom line is: if you have some untrusted process, it might be wise to remove empty its bounding set, making it incapable of executing a suid root program and thus acquiring new capabilities. (I also plan to add some normally-available-to-all capabilities such as "permission to fork()", "permission to exec()" and so on, and then it will also be useful to remove these from a process's permitted set.) > include/linux/capabilities.h: > > typedef struct __user_cap_data_struct { > __u32 effective; > __u32 permitted; > __u32 inheritable; > } __user *cap_user_data_t; > And my patch adds a __u32 bounding to that structure. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: capabilities patch (v 0.1)
On Tue, Aug 09, 2005 at 05:37:56AM +, Chris Wright wrote: > * David Madore ([EMAIL PROTECTED]) wrote: > > * Second, a much more extensive change, the patch introduces a third > > set of capabilities for every process, the "bounding" set. Normally > > this is not a good idea. don't add more sets. Could you elaborate? Why is adding sets bad? From what I read of the June 2000 discussions on the linux-privs-discuss mailing-list (http://sourceforge.net/mailarchive/forum.php?forum_id=25120_rows=25=flat=26 >), a rather large consensus had formed around the idea that some kind of bounding set was a useful idea (as a matter of fact, the sendmail problem came essentially from the fact that some people wanted an inheritable set and other people wanted a bounding set, and the code was some mixture of the two); and it had been argued convincincly that it could be made POSIX compliant if that is the issue. Plus, Solaris privileges also come in four sets. If it's compatibility you're worried about, it seems to me that the user interface can be made so that it will still work with the old libcap and merely ignore the bounding set. So full binary compatibility will be achieved, at least on the user level. Finally, if it's a matter of kernel policy, I seem to understand that my patch has a snowball's chance in hell of ever being accepted in the mainstream kernel (I mean, it's not as though this were new: patches to make capabilities work have been available ever since the sendmail exploit, and in five years they haven't ever been accepted, so I suppose there's a reason to this), so adding a fourth set of capabilities of my own initiative isn't going to change a thing there. So what's the problem? >if you really want to > work on this i'll give you all the patches that have been done thus far, > plus a set of tests that look at all the execve, ptrace, setuid type of > corner cases. Yes, I'm very interested in the test suite. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: capabilities patch (v 0.1)
On Tue, Aug 09, 2005 at 05:37:56AM +, Chris Wright wrote: * David Madore ([EMAIL PROTECTED]) wrote: * Second, a much more extensive change, the patch introduces a third set of capabilities for every process, the bounding set. Normally this is not a good idea. don't add more sets. Could you elaborate? Why is adding sets bad? From what I read of the June 2000 discussions on the linux-privs-discuss mailing-list (URL: http://sourceforge.net/mailarchive/forum.php?forum_id=25120max_rows=25style=flatviewmonth=26 ), a rather large consensus had formed around the idea that some kind of bounding set was a useful idea (as a matter of fact, the sendmail problem came essentially from the fact that some people wanted an inheritable set and other people wanted a bounding set, and the code was some mixture of the two); and it had been argued convincincly that it could be made POSIX compliant if that is the issue. Plus, Solaris privileges also come in four sets. If it's compatibility you're worried about, it seems to me that the user interface can be made so that it will still work with the old libcap and merely ignore the bounding set. So full binary compatibility will be achieved, at least on the user level. Finally, if it's a matter of kernel policy, I seem to understand that my patch has a snowball's chance in hell of ever being accepted in the mainstream kernel (I mean, it's not as though this were new: patches to make capabilities work have been available ever since the sendmail exploit, and in five years they haven't ever been accepted, so I suppose there's a reason to this), so adding a fourth set of capabilities of my own initiative isn't going to change a thing there. So what's the problem? if you really want to work on this i'll give you all the patches that have been done thus far, plus a set of tests that look at all the execve, ptrace, setuid type of corner cases. Yes, I'm very interested in the test suite. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: capabilities patch (v 0.1)
On Tue, Aug 09, 2005 at 04:28:31PM -0400, [EMAIL PROTECTED] wrote: On Tue, 09 Aug 2005 07:26:21 +0200, David Madore said: * Second, a much more extensive change, the patch introduces a third set of capabilities for every process, the bounding set. Normally the bounding set has every capability in it How is this different in semantics from the existing 'permitted' capset? The permitted sets is a set of capabilities really available to the process (though they may be temporarily dropped by removing them from the effective set, they are still available to take back). In contrast, the bounding set capabilities are not readily available to the process; it just means that the capabilities in question *might* be acquired by running a suid program (or setcap program if filesystem support for capabilities ever comes to Linux). Currently this is more or less an all-or-nothing process: since capabilities can only be acquired by running a suid program, removing any capability from the bounding set means the program will never be permitted to execute a suid program any more (execve() will fail with EPERM). But maybe I'll reinstate the CAP_SETPCAP thing in some future version of the patch (I'm still waiting for someone to tell me what was wrong with CAP_SETPCAP and why it was removed), and then the bounding set should also prohibit capabilities being given through that interface. The bottom line is: if you have some untrusted process, it might be wise to remove empty its bounding set, making it incapable of executing a suid root program and thus acquiring new capabilities. (I also plan to add some normally-available-to-all capabilities such as permission to fork(), permission to exec() and so on, and then it will also be useful to remove these from a process's permitted set.) include/linux/capabilities.h: typedef struct __user_cap_data_struct { __u32 effective; __u32 permitted; __u32 inheritable; } __user *cap_user_data_t; And my patch adds a __u32 bounding to that structure. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: capabilities patch (v 0.1)
On Tue, Aug 09, 2005 at 01:52:06PM -0700, Chris Wright wrote: * Bodo Eggert ([EMAIL PROTECTED]) wrote: How are you going to tell processes that may exec suid (or set-capability-) programs from those that aren't supposed to gain certain capabilities? typically you'd expect exec suid will reset to full caps. suid exec _must_ reset to full caps or we have the sendmail disaster again. However, that is _if_ execve() succeeds. It is quite possible that execve() should fail, and that is precisely what my patch does: if a process has bounded capabilities, it _may not_ exec suid. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: capabilities patch (v 0.1)
On Tue, Aug 09, 2005 at 11:36:00PM +0200, Bodo Eggert wrote: 1) I wouldn't want an exploited service to gain any privileges, even by chaining userspace exploits (e.g. exec sendmail exploitstring). For most services, I'd like CAP_EXEC being unset (but it doesn't exist). I intend to add a couple of capabilities which are normally available to all user processes, including capability to exec(), capability to fork() and a couple of others (maybe a capability to perform any kind of write operation, but that seems a bit more difficult to implement). So keep an eye open[#] for future versions of my patch. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) [#] On the other hand, I have a strong tendency not to finish anything I start :-( so maybe this is all just vaporware. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
capabilities patch (v 0.1)
Well, I wasn't sleepy tonight, so I produced the following patch for Linux capabilities, which attempts to make them useful. It is supposed to do the following (which may or may not conform with the POSIX semantics, I don't think it matters much): * First, and most importantly, capabilities are carried across execve(). More precisely, on execve(), the inheritable set of the process (which is always a subset of the permitted set) is copied to the permitted set, and ANDed on the effective set; except when a suid root binary is executed, in which case the permitted, inheritable and effective sets are fully set. * Second, a much more extensive change, the patch introduces a third set of capabilities for every process, the "bounding" set. Normally the bounding set has every capability in it. If a capability is removed from it, that means the process is never allowed to gain it on exec. In the current state of affairs, since the only way of gaining capabilities is through suid root programs, the bounding set is essentially an all-or-nothing affair: if you do not have every capability in your bounding set, you may not run a suid root program (execve() will fail with EPERM). This can still be very useful on untrusted programs. This patch hasn't been tested very much; in fact, it has been hardly tested at all (I just ran the kernel in a qemu and made a few basic checks). Since it adds a whole new set of capabilities to every process, it also requires a specially modified version of libcap (the one I have right now is pretty buggy, so I'm not posting the patch here). Consider this more a "proof of concept" than a serious patch, but I'm interested in any comments. ### cut after ### --- linux-2.6.12.4/fs/proc/array.c 2005-08-05 09:04:37.0 +0200 +++ linux-2.6.12.4.caps/fs/proc/array.c 2005-08-09 01:32:07.0 +0200 @@ -281,10 +281,12 @@ static inline char *task_cap(struct task { return buffer + sprintf(buffer, "CapInh:\t%016x\n" "CapPrm:\t%016x\n" - "CapEff:\t%016x\n", + "CapEff:\t%016x\n" + "CapBnd:\t%016x\n", cap_t(p->cap_inheritable), cap_t(p->cap_permitted), - cap_t(p->cap_effective)); + cap_t(p->cap_effective), + cap_t(p->cap_bounding)); } int proc_pid_status(struct task_struct *task, char * buffer) --- linux-2.6.12.4/include/linux/binfmts.h 2005-08-05 09:04:37.0 +0200 +++ linux-2.6.12.4.caps/include/linux/binfmts.h 2005-08-09 01:41:03.0 +0200 @@ -28,7 +28,7 @@ struct linux_binprm{ int sh_bang; struct file * file; int e_uid, e_gid; - kernel_cap_t cap_inheritable, cap_permitted, cap_effective; + kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bounding; void *security; int argc, envc; char * filename;/* Name of binary as seen by procps */ --- linux-2.6.12.4/include/linux/capability.h 2005-08-05 09:04:37.0 +0200 +++ linux-2.6.12.4.caps/include/linux/capability.h 2005-08-09 03:14:36.0 +0200 @@ -27,7 +27,7 @@ library since the draft standard requires the use of malloc/free etc.. */ -#define _LINUX_CAPABILITY_VERSION 0x19980330 +#define _LINUX_CAPABILITY_VERSION 0x20050809 typedef struct __user_cap_header_struct { __u32 version; @@ -38,6 +38,7 @@ typedef struct __user_cap_data_struct { __u32 effective; __u32 permitted; __u32 inheritable; +__u32 bounding; } __user *cap_user_data_t; #ifdef __KERNEL__ @@ -311,7 +312,7 @@ extern kernel_cap_t cap_bset; #define CAP_EMPTY_SET to_cap_t(0) #define CAP_FULL_SETto_cap_t(~0) #define CAP_INIT_EFF_SETto_cap_t(~0 & ~CAP_TO_MASK(CAP_SETPCAP)) -#define CAP_INIT_INH_SETto_cap_t(0) +#define CAP_INIT_INH_SETto_cap_t(~0 & ~CAP_TO_MASK(CAP_SETPCAP)) #define CAP_TO_MASK(x) (1 << (x)) #define cap_raise(c, flag) (cap_t(c) |= CAP_TO_MASK(flag)) --- linux-2.6.12.4/include/linux/init_task.h2005-08-05 09:04:37.0 +0200 +++ linux-2.6.12.4.caps/include/linux/init_task.h 2005-08-09 05:19:32.0 +0200 @@ -94,6 +94,7 @@ extern struct group_info init_groups; .cap_effective = CAP_INIT_EFF_SET, \ .cap_inheritable = CAP_INIT_INH_SET,\ .cap_permitted = CAP_FULL_SET, \ + .cap_bounding = CAP_FULL_SET, \ .keep_capabilities = 0, \ .user = INIT_USER,\ .comm = "swapper",\ --- linux-2.6.12.4/include/linux/sched.h2005-08-05 09:04:37.0 +0200 +++
Re: understanding Linux capabilities brokenness
On Tue, Aug 09, 2005 at 01:53:50AM +, Theodore Ts'o wrote: > The POSIX specification for capabilities requires filesystem support, > so that each executables can be marked with three capability sets --- > which indicate which capabilities are asserted when the executable > starts, which capabilities the executable is allowed to request, and > which capabilities the executable is allowed to inherit from its > parent process. This effectively takes a single setuid bit and splits > it into a hundred-odd bits. You point out various reasons why the POSIX (draft-)specification is problematic. But nobody says Linux has to abide it, especially as it is a mere withdrawn draft. Solaris 10 has capabilities (except that they're called "privileges") which are similar, but not identical, to the POSIX ones. And even capabilities with no filesystem support can be useful. In fact, as far as I see it, the main interest in capabilities lies in the "process management" part. For example, I might like to run this or that binary, which claims it needs to be run as root, with a limited set of capabilities: the current Linux kernels make this quite impossible. Conversely, I might wish to give a particular capability to a given user; in association with sudo, this might be quite useful: instead of telling sudo to let the user run a given command as root, just let him run a capability-aware wrapper which drops every capability except the required ones and then calls the actual program - so even if the latter is not secure, damage is more limited. I can think of thousands of other uses not requiring any kind of filesystem support. > Note that many some setuid > programs don't necessarily check error returns, and sometimes turning > off permissions can sometimes open up vulnerabilities. Yes, the sendmail vulnerability proved this quite clearly. So certainly a luser should not be permitted to run a suid root program with anything in between the empty set and the full set of capabilities. > Another problem with the POSIX capabilities is that most of the > programs that system administrators run to look for setuid programs > will miss programs that have capabilities encoded in extended > attributes. This problem could be fixed by requiring the setuid bit > to be set before paying attention to the capability EA's; but this > could lead to surprising results if the filesystem is mounted on a > system that doesn't use filesystem capabilities at all. I might suggest encoding the presence of capabilities by a sgid bit for a specific group (say, wheel) on top of the extended attributes. So the careful sysadmin will notice the programs (because sgid wheel is significant enough to be noted) but it will not cause total disaster if mounted on a non-capability-capable ;-) filesystem. > Yet another issue is that the POSIX capabilities model means that a > default executable, such as gcc for example, is not allowed to inherit > _any_ capabilities, even if it is run from a setuid root shell. This > is good from a security point of view, since it means that people > can't get in trouble by doing silly things like typing > "./configure;make" as root and expect any of the build tools to have > override arbitrary file controls. The bad news is that system > administrators aren't particularly happy when their own private tools > have to especially marked to allow them to run with elevated > privileges. Yes, this seems like a reason to deviate from the POSIX model under Linux. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: understanding Linux capabilities brokenness
Sorry for replying to myself... On Mon, Aug 08, 2005 at 09:13:06PM +, David Madore wrote: > However, what I do not understand is precisely _how_ one gets a > sendmail process without CAP_SETUID: for that is the heart of the > problem, and that is where the bug really was. But [#3] and [#4] are > very obscure (and I found nothing conclusive in lkml archives). I > understand that the problem lies in some combination of the > inheritable capability set and the CAP_SETPCAP capability, but I don't > see what that combination is. Certainly removing capabilities from > the inheritable set should not prevent suid root programs from having > them reinstated (in the language of [#6], the suid root bit should > correspond to a full forced set of capabilities), so I don't see what > that has to do with it, and CAP_SETPCAP indeed allows to remove > capabilities from a given process but I don't see how the user could > gain that capability (and indeed if he can then we can expect him to > gain all capabilities very rapidly). After some more intensive Googling, I found the answer in the archives of the linux-privs-discuss mailing-list (whose existence I did not know of): http://sourceforge.net/mailarchive/forum.php?thread_id=1588083_id=25120 > The explanation from the sendmail team was incorrect: CAP_SETPCAP is a red herring, it's only about CAP_SETUID, the implementation of the inheritable set was broken in that it controlled not only capabilities automatically passed across execve() but also those _gained_ by suid root programs (contrary to the claim in the sendmail analysis) and, worse, instead of failing on execve() when the program could not gain privileges, it proceeded with the capabilities missing. Hence the catastrophic failure. This does not tell me, then, why CAP_SETPCAP was globally disabled by default, nor why passing of capabilities across execve() was entirely removed instead of being fixed. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
understanding Linux capabilities brokenness
Hi. Like many people[#1][#2], I have found out that the Linux capability handling utilities are non-functional, and cannot be repaired because the kernel deliberately cripples capabilities (they are reset on every call to execve()). I have found that various people[#1][#2] have proposed patches to restore working capabilities. However, the matter seems rather complicted and I would like to understand the full story. Hours of Google-grepping through the lkml archives has not helped me very much, so I hope someone can get the history straight. I understand that Linux capabilities first appared seven or eight years ago, and in 2000-06 there was a serious fault discovered which caused a local root exploit through the use of the sendmail problem. Rading [#3] and [#4], I understand that the problem was this: When sendmail is invoked by a non-root user, it attempts to drop its root privileges (which it has because the binary is installed suid) by calling setuid(getuid()), which, due to the stupidity of traditional Unix semantics enshrined in the POSIX/SUS standards, operates differently according to whether the process has "appropriate privileges" (in which case it sets all its UIDs to its real UID) or not (in which case it preserves the saved UID); now under Linux, "appropriate privileges" is defined[#5] as possessing the CAP_SETUID capability. So if a non-root user manages to execute sendmail without the CAP_SETUID capability, the setuid(getuid()) call will fail (or rather, not perform as expected), and the genie is out of the bottle. However, what I do not understand is precisely _how_ one gets a sendmail process without CAP_SETUID: for that is the heart of the problem, and that is where the bug really was. But [#3] and [#4] are very obscure (and I found nothing conclusive in lkml archives). I understand that the problem lies in some combination of the inheritable capability set and the CAP_SETPCAP capability, but I don't see what that combination is. Certainly removing capabilities from the inheritable set should not prevent suid root programs from having them reinstated (in the language of [#6], the suid root bit should correspond to a full forced set of capabilities), so I don't see what that has to do with it, and CAP_SETPCAP indeed allows to remove capabilities from a given process but I don't see how the user could gain that capability (and indeed if he can then we can expect him to gain all capabilities very rapidly). Can someone describe very accurately what the problem was? And why was it "fixed"[#7] by completely disabling capability inheritance and also by disabling the CAP_SETPCAP capability? In other words, suppose I restore CAP_SETPCAP on my system and/or make capabilities fully inheritable on execve() (that is, just take the logical AND of the permitted set with the inheritable set, except if the executed program is suid root, in which case all three sets - permitted, effective and inheritable - are set to full): what is the security problem in this? Assuming I want to make capabilities inheritable, is there a recommended patch for doing so? Alexander Nyberg's patch in [#1] looks good to me (at least, it seems to do exactly what I want), but how well has it been tested? Is this something that might eventually make its way into the official kernel, or is this a no-goer? Also, if the author happens to read this, I'd like an explanation on the "Is this a root task that did seteuid before execve? if so it wanted its effective permissions dropped" comment in cap_bprm_apply_creds(). Thanks! -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) [#1] http://groups-beta.google.com/group/fa.linux.kernel/browse_thread/thread/f76dcb9447a77c34 > [#2] http://groups-beta.google.com/group/fa.linux.kernel/browse_thread/thread/4366e557a75a933d > [#3] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf > [#4] http://www.sendmail.org/sendmail.8.10.1.LINUX-SECURITY.txt > [#5] I tend to think that the behavior of setuid() is wrong in the first place, that is, setuid(getuid()) should also change the saved UID as soon as the effective UID is zero, even if CAP_SETUID is not set, to make sure that traditional Unix semantics are observed. (More recent, capability-aware, programs will use setresuid() anyway.) But that is rather beside the point. [#6] http://ftp.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.4/capfaq-0.2.txt > [#7] I wanted to find exactly on which kernel version the changes took place. Unfortunately, http://lxr.linux.no/ > only has major versions, the 2.2.15->2.2.16 patch is very hard to read, and I have neither the patience nor the bandwidth to unpack entire kernel trees on my PC to unravel the full history... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please
understanding Linux capabilities brokenness
Hi. Like many people[#1][#2], I have found out that the Linux capability handling utilities are non-functional, and cannot be repaired because the kernel deliberately cripples capabilities (they are reset on every call to execve()). I have found that various people[#1][#2] have proposed patches to restore working capabilities. However, the matter seems rather complicted and I would like to understand the full story. Hours of Google-grepping through the lkml archives has not helped me very much, so I hope someone can get the history straight. I understand that Linux capabilities first appared seven or eight years ago, and in 2000-06 there was a serious fault discovered which caused a local root exploit through the use of the sendmail problem. Rading [#3] and [#4], I understand that the problem was this: When sendmail is invoked by a non-root user, it attempts to drop its root privileges (which it has because the binary is installed suid) by calling setuid(getuid()), which, due to the stupidity of traditional Unix semantics enshrined in the POSIX/SUS standards, operates differently according to whether the process has appropriate privileges (in which case it sets all its UIDs to its real UID) or not (in which case it preserves the saved UID); now under Linux, appropriate privileges is defined[#5] as possessing the CAP_SETUID capability. So if a non-root user manages to execute sendmail without the CAP_SETUID capability, the setuid(getuid()) call will fail (or rather, not perform as expected), and the genie is out of the bottle. However, what I do not understand is precisely _how_ one gets a sendmail process without CAP_SETUID: for that is the heart of the problem, and that is where the bug really was. But [#3] and [#4] are very obscure (and I found nothing conclusive in lkml archives). I understand that the problem lies in some combination of the inheritable capability set and the CAP_SETPCAP capability, but I don't see what that combination is. Certainly removing capabilities from the inheritable set should not prevent suid root programs from having them reinstated (in the language of [#6], the suid root bit should correspond to a full forced set of capabilities), so I don't see what that has to do with it, and CAP_SETPCAP indeed allows to remove capabilities from a given process but I don't see how the user could gain that capability (and indeed if he can then we can expect him to gain all capabilities very rapidly). Can someone describe very accurately what the problem was? And why was it fixed[#7] by completely disabling capability inheritance and also by disabling the CAP_SETPCAP capability? In other words, suppose I restore CAP_SETPCAP on my system and/or make capabilities fully inheritable on execve() (that is, just take the logical AND of the permitted set with the inheritable set, except if the executed program is suid root, in which case all three sets - permitted, effective and inheritable - are set to full): what is the security problem in this? Assuming I want to make capabilities inheritable, is there a recommended patch for doing so? Alexander Nyberg's patch in [#1] looks good to me (at least, it seems to do exactly what I want), but how well has it been tested? Is this something that might eventually make its way into the official kernel, or is this a no-goer? Also, if the author happens to read this, I'd like an explanation on the Is this a root task that did seteuid before execve? if so it wanted its effective permissions dropped comment in cap_bprm_apply_creds(). Thanks! -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) [#1] URL: http://groups-beta.google.com/group/fa.linux.kernel/browse_thread/thread/f76dcb9447a77c34 [#2] URL: http://groups-beta.google.com/group/fa.linux.kernel/browse_thread/thread/4366e557a75a933d [#3] URL: http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf [#4] URL: http://www.sendmail.org/sendmail.8.10.1.LINUX-SECURITY.txt [#5] I tend to think that the behavior of setuid() is wrong in the first place, that is, setuid(getuid()) should also change the saved UID as soon as the effective UID is zero, even if CAP_SETUID is not set, to make sure that traditional Unix semantics are observed. (More recent, capability-aware, programs will use setresuid() anyway.) But that is rather beside the point. [#6] URL: http://ftp.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.4/capfaq-0.2.txt [#7] I wanted to find exactly on which kernel version the changes took place. Unfortunately, URL: http://lxr.linux.no/ only has major versions, the 2.2.15-2.2.16 patch is very hard to read, and I have neither the patience nor the bandwidth to unpack entire kernel trees on my PC to unravel the full history... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at
Re: understanding Linux capabilities brokenness
Sorry for replying to myself... On Mon, Aug 08, 2005 at 09:13:06PM +, David Madore wrote: However, what I do not understand is precisely _how_ one gets a sendmail process without CAP_SETUID: for that is the heart of the problem, and that is where the bug really was. But [#3] and [#4] are very obscure (and I found nothing conclusive in lkml archives). I understand that the problem lies in some combination of the inheritable capability set and the CAP_SETPCAP capability, but I don't see what that combination is. Certainly removing capabilities from the inheritable set should not prevent suid root programs from having them reinstated (in the language of [#6], the suid root bit should correspond to a full forced set of capabilities), so I don't see what that has to do with it, and CAP_SETPCAP indeed allows to remove capabilities from a given process but I don't see how the user could gain that capability (and indeed if he can then we can expect him to gain all capabilities very rapidly). After some more intensive Googling, I found the answer in the archives of the linux-privs-discuss mailing-list (whose existence I did not know of): URL: http://sourceforge.net/mailarchive/forum.php?thread_id=1588083forum_id=25120 The explanation from the sendmail team was incorrect: CAP_SETPCAP is a red herring, it's only about CAP_SETUID, the implementation of the inheritable set was broken in that it controlled not only capabilities automatically passed across execve() but also those _gained_ by suid root programs (contrary to the claim in the sendmail analysis) and, worse, instead of failing on execve() when the program could not gain privileges, it proceeded with the capabilities missing. Hence the catastrophic failure. This does not tell me, then, why CAP_SETPCAP was globally disabled by default, nor why passing of capabilities across execve() was entirely removed instead of being fixed. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: understanding Linux capabilities brokenness
On Tue, Aug 09, 2005 at 01:53:50AM +, Theodore Ts'o wrote: The POSIX specification for capabilities requires filesystem support, so that each executables can be marked with three capability sets --- which indicate which capabilities are asserted when the executable starts, which capabilities the executable is allowed to request, and which capabilities the executable is allowed to inherit from its parent process. This effectively takes a single setuid bit and splits it into a hundred-odd bits. You point out various reasons why the POSIX (draft-)specification is problematic. But nobody says Linux has to abide it, especially as it is a mere withdrawn draft. Solaris 10 has capabilities (except that they're called privileges) which are similar, but not identical, to the POSIX ones. And even capabilities with no filesystem support can be useful. In fact, as far as I see it, the main interest in capabilities lies in the process management part. For example, I might like to run this or that binary, which claims it needs to be run as root, with a limited set of capabilities: the current Linux kernels make this quite impossible. Conversely, I might wish to give a particular capability to a given user; in association with sudo, this might be quite useful: instead of telling sudo to let the user run a given command as root, just let him run a capability-aware wrapper which drops every capability except the required ones and then calls the actual program - so even if the latter is not secure, damage is more limited. I can think of thousands of other uses not requiring any kind of filesystem support. Note that many some setuid programs don't necessarily check error returns, and sometimes turning off permissions can sometimes open up vulnerabilities. Yes, the sendmail vulnerability proved this quite clearly. So certainly a luser should not be permitted to run a suid root program with anything in between the empty set and the full set of capabilities. Another problem with the POSIX capabilities is that most of the programs that system administrators run to look for setuid programs will miss programs that have capabilities encoded in extended attributes. This problem could be fixed by requiring the setuid bit to be set before paying attention to the capability EA's; but this could lead to surprising results if the filesystem is mounted on a system that doesn't use filesystem capabilities at all. I might suggest encoding the presence of capabilities by a sgid bit for a specific group (say, wheel) on top of the extended attributes. So the careful sysadmin will notice the programs (because sgid wheel is significant enough to be noted) but it will not cause total disaster if mounted on a non-capability-capable ;-) filesystem. Yet another issue is that the POSIX capabilities model means that a default executable, such as gcc for example, is not allowed to inherit _any_ capabilities, even if it is run from a setuid root shell. This is good from a security point of view, since it means that people can't get in trouble by doing silly things like typing ./configure;make as root and expect any of the build tools to have override arbitrary file controls. The bad news is that system administrators aren't particularly happy when their own private tools have to especially marked to allow them to run with elevated privileges. Yes, this seems like a reason to deviate from the POSIX model under Linux. -- David A. Madore ([EMAIL PROTECTED], http://www.madore.org/~david/ ) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
capabilities patch (v 0.1)
Well, I wasn't sleepy tonight, so I produced the following patch for Linux capabilities, which attempts to make them useful. It is supposed to do the following (which may or may not conform with the POSIX semantics, I don't think it matters much): * First, and most importantly, capabilities are carried across execve(). More precisely, on execve(), the inheritable set of the process (which is always a subset of the permitted set) is copied to the permitted set, and ANDed on the effective set; except when a suid root binary is executed, in which case the permitted, inheritable and effective sets are fully set. * Second, a much more extensive change, the patch introduces a third set of capabilities for every process, the bounding set. Normally the bounding set has every capability in it. If a capability is removed from it, that means the process is never allowed to gain it on exec. In the current state of affairs, since the only way of gaining capabilities is through suid root programs, the bounding set is essentially an all-or-nothing affair: if you do not have every capability in your bounding set, you may not run a suid root program (execve() will fail with EPERM). This can still be very useful on untrusted programs. This patch hasn't been tested very much; in fact, it has been hardly tested at all (I just ran the kernel in a qemu and made a few basic checks). Since it adds a whole new set of capabilities to every process, it also requires a specially modified version of libcap (the one I have right now is pretty buggy, so I'm not posting the patch here). Consider this more a proof of concept than a serious patch, but I'm interested in any comments. ### cut after ### --- linux-2.6.12.4/fs/proc/array.c 2005-08-05 09:04:37.0 +0200 +++ linux-2.6.12.4.caps/fs/proc/array.c 2005-08-09 01:32:07.0 +0200 @@ -281,10 +281,12 @@ static inline char *task_cap(struct task { return buffer + sprintf(buffer, CapInh:\t%016x\n CapPrm:\t%016x\n - CapEff:\t%016x\n, + CapEff:\t%016x\n + CapBnd:\t%016x\n, cap_t(p-cap_inheritable), cap_t(p-cap_permitted), - cap_t(p-cap_effective)); + cap_t(p-cap_effective), + cap_t(p-cap_bounding)); } int proc_pid_status(struct task_struct *task, char * buffer) --- linux-2.6.12.4/include/linux/binfmts.h 2005-08-05 09:04:37.0 +0200 +++ linux-2.6.12.4.caps/include/linux/binfmts.h 2005-08-09 01:41:03.0 +0200 @@ -28,7 +28,7 @@ struct linux_binprm{ int sh_bang; struct file * file; int e_uid, e_gid; - kernel_cap_t cap_inheritable, cap_permitted, cap_effective; + kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bounding; void *security; int argc, envc; char * filename;/* Name of binary as seen by procps */ --- linux-2.6.12.4/include/linux/capability.h 2005-08-05 09:04:37.0 +0200 +++ linux-2.6.12.4.caps/include/linux/capability.h 2005-08-09 03:14:36.0 +0200 @@ -27,7 +27,7 @@ library since the draft standard requires the use of malloc/free etc.. */ -#define _LINUX_CAPABILITY_VERSION 0x19980330 +#define _LINUX_CAPABILITY_VERSION 0x20050809 typedef struct __user_cap_header_struct { __u32 version; @@ -38,6 +38,7 @@ typedef struct __user_cap_data_struct { __u32 effective; __u32 permitted; __u32 inheritable; +__u32 bounding; } __user *cap_user_data_t; #ifdef __KERNEL__ @@ -311,7 +312,7 @@ extern kernel_cap_t cap_bset; #define CAP_EMPTY_SET to_cap_t(0) #define CAP_FULL_SETto_cap_t(~0) #define CAP_INIT_EFF_SETto_cap_t(~0 ~CAP_TO_MASK(CAP_SETPCAP)) -#define CAP_INIT_INH_SETto_cap_t(0) +#define CAP_INIT_INH_SETto_cap_t(~0 ~CAP_TO_MASK(CAP_SETPCAP)) #define CAP_TO_MASK(x) (1 (x)) #define cap_raise(c, flag) (cap_t(c) |= CAP_TO_MASK(flag)) --- linux-2.6.12.4/include/linux/init_task.h2005-08-05 09:04:37.0 +0200 +++ linux-2.6.12.4.caps/include/linux/init_task.h 2005-08-09 05:19:32.0 +0200 @@ -94,6 +94,7 @@ extern struct group_info init_groups; .cap_effective = CAP_INIT_EFF_SET, \ .cap_inheritable = CAP_INIT_INH_SET,\ .cap_permitted = CAP_FULL_SET, \ + .cap_bounding = CAP_FULL_SET, \ .keep_capabilities = 0, \ .user = INIT_USER,\ .comm = swapper,\ --- linux-2.6.12.4/include/linux/sched.h2005-08-05 09:04:37.0 +0200 +++ linux-2.6.12.4.caps/include/linux/sched.h