[PATCH 3/3] remove not-used poison pointer macros
Signed-off-by: Vasily Kulikov --- include/linux/poison.h | 7 --- 1 file changed, 7 deletions(-) diff --git a/include/linux/poison.h b/include/linux/poison.h index a721bcd..4a27153 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -73,10 +73,6 @@ #define ATM_POISON_FREE0x12 #define ATM_POISON 0xdeadbeef -/** net/ **/ -#define NEIGHBOR_DEAD 0xdeadbeef -#define NETFILTER_LINK_POISON 0xdead57ac - /** kernel/mutexes **/ #define MUTEX_DEBUG_INIT 0x11 #define MUTEX_DEBUG_FREE 0x22 @@ -87,7 +83,4 @@ /** security/ **/ #define KEY_DESTROY0xbd -/** sound/oss/ **/ -#define OSS_POISON_FREE0xAB - #endif -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] use POISON_POINTER_DELTA for poison pointers
TIMER_ENTRY_STATIC and TAIL_MAPPING are defined as poison pointers which should point to nowhere. Redefine them using POISON_POINTER_DELTA arithmetics to make sure they really point to non-mappable area declared by the target architecture. Signed-off-by: Vasily Kulikov --- include/linux/poison.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/poison.h b/include/linux/poison.h index ee697b9..a721bcd 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -27,14 +27,14 @@ * Magic number "tsta" to indicate a static timer initializer * for the object debugging code. */ -#define TIMER_ENTRY_STATIC ((void *) 0x74737461) +#define TIMER_ENTRY_STATIC ((void *) 0x300 + POISON_POINTER_DELTA) /** mm/debug-pagealloc.c **/ #define PAGE_POISON 0xaa /** mm/page_alloc.c / -#define TAIL_MAPPING ((void *) 0x01014A11 + POISON_POINTER_DELTA) +#define TAIL_MAPPING ((void *) 0x400 + POISON_POINTER_DELTA) /** mm/slab.c **/ /* -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] fix LIST_POISON{1,2} offset
Poison pointer values should be small enough to find a room in non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space" is located starting from 0x0. Given unprivileged users cannot mmap anything below mmap_min_addr, it should be safe to use poison pointers lower than mmap_min_addr. The current poison pointer values of LIST_POISON{1,2} might be too big for mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu uses only 0x1). There is little point to use such a big value given the "poison pointer space" below 1 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. The values are suggested by Solar Designer: http://www.openwall.com/lists/oss-security/2015/05/02/6 Signed-off-by: Vasily Kulikov --- include/linux/poison.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/poison.h b/include/linux/poison.h index 7b2a7fc..ee697b9 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -19,8 +19,8 @@ * under normal circumstances, used to verify that nobody uses * non-initialized list entries. */ -#define LIST_POISON1 ((void *) 0x00100100 + POISON_POINTER_DELTA) -#define LIST_POISON2 ((void *) 0x00200200 + POISON_POINTER_DELTA) +#define LIST_POISON1 ((void *) 0x100 + POISON_POINTER_DELTA) +#define LIST_POISON2 ((void *) 0x200 + POISON_POINTER_DELTA) /** include/linux/timer.h **/ /* -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] fix LIST_POISON{1,2} offset
Poison pointer values should be small enough to find a room in non-mmap'able/hardly-mmap'able space. E.g. on x86 poison pointer space is located starting from 0x0. Given unprivileged users cannot mmap anything below mmap_min_addr, it should be safe to use poison pointers lower than mmap_min_addr. The current poison pointer values of LIST_POISON{1,2} might be too big for mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu uses only 0x1). There is little point to use such a big value given the poison pointer space below 1 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. The values are suggested by Solar Designer: http://www.openwall.com/lists/oss-security/2015/05/02/6 Signed-off-by: Vasily Kulikov seg...@openwall.com --- include/linux/poison.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/poison.h b/include/linux/poison.h index 7b2a7fc..ee697b9 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -19,8 +19,8 @@ * under normal circumstances, used to verify that nobody uses * non-initialized list entries. */ -#define LIST_POISON1 ((void *) 0x00100100 + POISON_POINTER_DELTA) -#define LIST_POISON2 ((void *) 0x00200200 + POISON_POINTER_DELTA) +#define LIST_POISON1 ((void *) 0x100 + POISON_POINTER_DELTA) +#define LIST_POISON2 ((void *) 0x200 + POISON_POINTER_DELTA) /** include/linux/timer.h **/ /* -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] use POISON_POINTER_DELTA for poison pointers
TIMER_ENTRY_STATIC and TAIL_MAPPING are defined as poison pointers which should point to nowhere. Redefine them using POISON_POINTER_DELTA arithmetics to make sure they really point to non-mappable area declared by the target architecture. Signed-off-by: Vasily Kulikov seg...@openwall.com --- include/linux/poison.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/poison.h b/include/linux/poison.h index ee697b9..a721bcd 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -27,14 +27,14 @@ * Magic number tsta to indicate a static timer initializer * for the object debugging code. */ -#define TIMER_ENTRY_STATIC ((void *) 0x74737461) +#define TIMER_ENTRY_STATIC ((void *) 0x300 + POISON_POINTER_DELTA) /** mm/debug-pagealloc.c **/ #define PAGE_POISON 0xaa /** mm/page_alloc.c / -#define TAIL_MAPPING ((void *) 0x01014A11 + POISON_POINTER_DELTA) +#define TAIL_MAPPING ((void *) 0x400 + POISON_POINTER_DELTA) /** mm/slab.c **/ /* -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] remove not-used poison pointer macros
Signed-off-by: Vasily Kulikov seg...@openwall.com --- include/linux/poison.h | 7 --- 1 file changed, 7 deletions(-) diff --git a/include/linux/poison.h b/include/linux/poison.h index a721bcd..4a27153 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -73,10 +73,6 @@ #define ATM_POISON_FREE0x12 #define ATM_POISON 0xdeadbeef -/** net/ **/ -#define NEIGHBOR_DEAD 0xdeadbeef -#define NETFILTER_LINK_POISON 0xdead57ac - /** kernel/mutexes **/ #define MUTEX_DEBUG_INIT 0x11 #define MUTEX_DEBUG_FREE 0x22 @@ -87,7 +83,4 @@ /** security/ **/ #define KEY_DESTROY0xbd -/** sound/oss/ **/ -#define OSS_POISON_FREE0xAB - #endif -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pointer poisoning macro
Hi, Any feedback? On Thu, May 14, 2015 at 14:07 +0300, Vasily Kulikov wrote: > This is a raw version of the patch inspired by the discussion: > http://www.openwall.com/lists/oss-security/2015/05/02/6. > > The patch tries to achieve two goals: > 1) Move "overflowed" poison pointers out of the mmap'able memory zone > 2) Simplify addition of new poison pointers > > The current 0x00200200 poison pointer value of LIST_POISON2 might be > too big for mmap_min_addr values equal or less than 2 MB (common case, > e.g. Ubuntu uses only 0x1). There is little point to use such a big > value given the "poison pointer space" below 2 MB is not yet exhausted. > Changing it to a smaller value solves the problem for small > mmap_min_addr setups. > > In general, poisoned pointers should meet the following criteria: > > 1) poisoned pointer base must be non-mmap'able (already done via > CONFIG_ILLEGAL_POINTER_VALUE) > 2) all poisoned pointers (i.e. base+offset) must be non-mmap'able > 3) a small offset relative to poisoned pointers must be non-mmap'able > 4) poisoned pointers from different subsystems should be different > > (2) can be solved at the compile time by BUILD_BUG_ON(). > (3) and (4) should be solved by a creator of the new poisoned pointer. > > At least (2) can be done automatically. I propose a new macro for this > purpose, POISON_POINTER(). It should check whether the poison pointer > offset is not too big. E.g. in case of too big offset the compilation > fails with the following message: > > mm/page_alloc.c: In function 'free_pages_prepare': > mm/page_alloc.c:840:23: error: call to '__compiletime_assert_840' > declared with attribute error: BUILD_BUG_ON failed: 0x0111400 >= > POISON_AREA_SIZE > > There is still an unsolved issue with the macro related to > static variables initialization. If you uncomment the line just after > "FIXME" line, you'll see: > > kernel/irq/spurious.c:23:8: error: braced-group within expression allowed > only inside a function > > I'll be happy with the comments to the idea and the implementation. > > diff --git a/include/linux/poison.h b/include/linux/poison.h > index 7b2a7fc..47b43ed 100644 > --- a/include/linux/poison.h > +++ b/include/linux/poison.h > @@ -1,40 +1,54 @@ > #ifndef _LINUX_POISON_H > #define _LINUX_POISON_H > +#include > > /** include/linux/list.h **/ > > /* > * Architectures might want to move the poison pointer offset > * into some well-recognized area such as 0xdead, > - * that is also not mappable by user-space exploits: > + * that is also not mappable by user-space exploits, > + * by adjusting CONFIG_ILLEGAL_POINTER_VALUE: > */ > #ifdef CONFIG_ILLEGAL_POINTER_VALUE > # define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL) > #else > # define POISON_POINTER_DELTA 0 > #endif > +/* > + * Poisoned pointers of different subsystems should be different > + * but must not move far away from POISON_POINTER_DELTA. > + * Otherwise poisoned pointer might be mmap'able on some architectures. > + */ > +#define POISON_AREA_SIZE 0x1000 > +#define POISON_POINTER(x) \ > + ({ \ > + BUILD_BUG_ON(x >= POISON_AREA_SIZE); \ > + ((void *)(x) + POISON_POINTER_DELTA);}) > > /* > * These are non-NULL pointers that will result in page faults > * under normal circumstances, used to verify that nobody uses > * non-initialized list entries. > */ > -#define LIST_POISON1 ((void *) 0x00100100 + POISON_POINTER_DELTA) > -#define LIST_POISON2 ((void *) 0x00200200 + POISON_POINTER_DELTA) > +#define LIST_POISON1 POISON_POINTER(0x0100) > +#define LIST_POISON2 POISON_POINTER(0x0200) > > /** include/linux/timer.h **/ > /* > * Magic number "tsta" to indicate a static timer initializer > * for the object debugging code. > */ > -#define TIMER_ENTRY_STATIC ((void *) 0x74737461) > +#define TIMER_ENTRY_STATIC ((void*)0x0300 + POISON_POINTER_DELTA) > +// FIXME > +//#define TIMER_ENTRY_STATIC POISON_POINTER(0x0300) > > /** mm/debug-pagealloc.c **/ > #define PAGE_POISON 0xaa > > /** mm/page_alloc.c / > > -#define TAIL_MAPPING ((void *) 0x01014A11 + POISON_POINTER_DELTA) > +#define TAIL_MAPPING POISON_POINTER(0x400) > > /** mm/slab.c **/ > /* > > -- > Vasily -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pointer poisoning macro
Hi, Any feedback? On Thu, May 14, 2015 at 14:07 +0300, Vasily Kulikov wrote: This is a raw version of the patch inspired by the discussion: http://www.openwall.com/lists/oss-security/2015/05/02/6. The patch tries to achieve two goals: 1) Move overflowed poison pointers out of the mmap'able memory zone 2) Simplify addition of new poison pointers The current 0x00200200 poison pointer value of LIST_POISON2 might be too big for mmap_min_addr values equal or less than 2 MB (common case, e.g. Ubuntu uses only 0x1). There is little point to use such a big value given the poison pointer space below 2 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. In general, poisoned pointers should meet the following criteria: 1) poisoned pointer base must be non-mmap'able (already done via CONFIG_ILLEGAL_POINTER_VALUE) 2) all poisoned pointers (i.e. base+offset) must be non-mmap'able 3) a small offset relative to poisoned pointers must be non-mmap'able 4) poisoned pointers from different subsystems should be different (2) can be solved at the compile time by BUILD_BUG_ON(). (3) and (4) should be solved by a creator of the new poisoned pointer. At least (2) can be done automatically. I propose a new macro for this purpose, POISON_POINTER(). It should check whether the poison pointer offset is not too big. E.g. in case of too big offset the compilation fails with the following message: mm/page_alloc.c: In function 'free_pages_prepare': mm/page_alloc.c:840:23: error: call to '__compiletime_assert_840' declared with attribute error: BUILD_BUG_ON failed: 0x0111400 = POISON_AREA_SIZE There is still an unsolved issue with the macro related to static variables initialization. If you uncomment the line just after FIXME line, you'll see: kernel/irq/spurious.c:23:8: error: braced-group within expression allowed only inside a function I'll be happy with the comments to the idea and the implementation. diff --git a/include/linux/poison.h b/include/linux/poison.h index 7b2a7fc..47b43ed 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -1,40 +1,54 @@ #ifndef _LINUX_POISON_H #define _LINUX_POISON_H +#include linux/bug.h /** include/linux/list.h **/ /* * Architectures might want to move the poison pointer offset * into some well-recognized area such as 0xdead, - * that is also not mappable by user-space exploits: + * that is also not mappable by user-space exploits, + * by adjusting CONFIG_ILLEGAL_POINTER_VALUE: */ #ifdef CONFIG_ILLEGAL_POINTER_VALUE # define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL) #else # define POISON_POINTER_DELTA 0 #endif +/* + * Poisoned pointers of different subsystems should be different + * but must not move far away from POISON_POINTER_DELTA. + * Otherwise poisoned pointer might be mmap'able on some architectures. + */ +#define POISON_AREA_SIZE 0x1000 +#define POISON_POINTER(x) \ + ({ \ + BUILD_BUG_ON(x = POISON_AREA_SIZE); \ + ((void *)(x) + POISON_POINTER_DELTA);}) /* * These are non-NULL pointers that will result in page faults * under normal circumstances, used to verify that nobody uses * non-initialized list entries. */ -#define LIST_POISON1 ((void *) 0x00100100 + POISON_POINTER_DELTA) -#define LIST_POISON2 ((void *) 0x00200200 + POISON_POINTER_DELTA) +#define LIST_POISON1 POISON_POINTER(0x0100) +#define LIST_POISON2 POISON_POINTER(0x0200) /** include/linux/timer.h **/ /* * Magic number tsta to indicate a static timer initializer * for the object debugging code. */ -#define TIMER_ENTRY_STATIC ((void *) 0x74737461) +#define TIMER_ENTRY_STATIC ((void*)0x0300 + POISON_POINTER_DELTA) +// FIXME +//#define TIMER_ENTRY_STATIC POISON_POINTER(0x0300) /** mm/debug-pagealloc.c **/ #define PAGE_POISON 0xaa /** mm/page_alloc.c / -#define TAIL_MAPPING ((void *) 0x01014A11 + POISON_POINTER_DELTA) +#define TAIL_MAPPING POISON_POINTER(0x400) /** mm/slab.c **/ /* -- Vasily -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] pointer poisoning macro
This is a raw version of the patch inspired by the discussion: http://www.openwall.com/lists/oss-security/2015/05/02/6. The patch tries to achieve two goals: 1) Move "overflowed" poison pointers out of the mmap'able memory zone 2) Simplify addition of new poison pointers The current 0x00200200 poison pointer value of LIST_POISON2 might be too big for mmap_min_addr values equal or less than 2 MB (common case, e.g. Ubuntu uses only 0x1). There is little point to use such a big value given the "poison pointer space" below 2 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. In general, poisoned pointers should meet the following criteria: 1) poisoned pointer base must be non-mmap'able (already done via CONFIG_ILLEGAL_POINTER_VALUE) 2) all poisoned pointers (i.e. base+offset) must be non-mmap'able 3) a small offset relative to poisoned pointers must be non-mmap'able 4) poisoned pointers from different subsystems should be different (2) can be solved at the compile time by BUILD_BUG_ON(). (3) and (4) should be solved by a creator of the new poisoned pointer. At least (2) can be done automatically. I propose a new macro for this purpose, POISON_POINTER(). It should check whether the poison pointer offset is not too big. E.g. in case of too big offset the compilation fails with the following message: mm/page_alloc.c: In function 'free_pages_prepare': mm/page_alloc.c:840:23: error: call to '__compiletime_assert_840' declared with attribute error: BUILD_BUG_ON failed: 0x0111400 >= POISON_AREA_SIZE There is still an unsolved issue with the macro related to static variables initialization. If you uncomment the line just after "FIXME" line, you'll see: kernel/irq/spurious.c:23:8: error: braced-group within expression allowed only inside a function I'll be happy with the comments to the idea and the implementation. diff --git a/include/linux/poison.h b/include/linux/poison.h index 7b2a7fc..47b43ed 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -1,40 +1,54 @@ #ifndef _LINUX_POISON_H #define _LINUX_POISON_H +#include /** include/linux/list.h **/ /* * Architectures might want to move the poison pointer offset * into some well-recognized area such as 0xdead, - * that is also not mappable by user-space exploits: + * that is also not mappable by user-space exploits, + * by adjusting CONFIG_ILLEGAL_POINTER_VALUE: */ #ifdef CONFIG_ILLEGAL_POINTER_VALUE # define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL) #else # define POISON_POINTER_DELTA 0 #endif +/* + * Poisoned pointers of different subsystems should be different + * but must not move far away from POISON_POINTER_DELTA. + * Otherwise poisoned pointer might be mmap'able on some architectures. + */ +#define POISON_AREA_SIZE 0x1000 +#define POISON_POINTER(x) \ + ({ \ + BUILD_BUG_ON(x >= POISON_AREA_SIZE); \ + ((void *)(x) + POISON_POINTER_DELTA);}) /* * These are non-NULL pointers that will result in page faults * under normal circumstances, used to verify that nobody uses * non-initialized list entries. */ -#define LIST_POISON1 ((void *) 0x00100100 + POISON_POINTER_DELTA) -#define LIST_POISON2 ((void *) 0x00200200 + POISON_POINTER_DELTA) +#define LIST_POISON1 POISON_POINTER(0x0100) +#define LIST_POISON2 POISON_POINTER(0x0200) /** include/linux/timer.h **/ /* * Magic number "tsta" to indicate a static timer initializer * for the object debugging code. */ -#define TIMER_ENTRY_STATIC ((void *) 0x74737461) +#define TIMER_ENTRY_STATIC ((void*)0x0300 + POISON_POINTER_DELTA) +// FIXME +//#define TIMER_ENTRY_STATIC POISON_POINTER(0x0300) /** mm/debug-pagealloc.c **/ #define PAGE_POISON 0xaa /** mm/page_alloc.c / -#define TAIL_MAPPING ((void *) 0x01014A11 + POISON_POINTER_DELTA) +#define TAIL_MAPPING POISON_POINTER(0x400) /** mm/slab.c **/ /* -- Vasily -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] pointer poisoning macro
This is a raw version of the patch inspired by the discussion: http://www.openwall.com/lists/oss-security/2015/05/02/6. The patch tries to achieve two goals: 1) Move overflowed poison pointers out of the mmap'able memory zone 2) Simplify addition of new poison pointers The current 0x00200200 poison pointer value of LIST_POISON2 might be too big for mmap_min_addr values equal or less than 2 MB (common case, e.g. Ubuntu uses only 0x1). There is little point to use such a big value given the poison pointer space below 2 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. In general, poisoned pointers should meet the following criteria: 1) poisoned pointer base must be non-mmap'able (already done via CONFIG_ILLEGAL_POINTER_VALUE) 2) all poisoned pointers (i.e. base+offset) must be non-mmap'able 3) a small offset relative to poisoned pointers must be non-mmap'able 4) poisoned pointers from different subsystems should be different (2) can be solved at the compile time by BUILD_BUG_ON(). (3) and (4) should be solved by a creator of the new poisoned pointer. At least (2) can be done automatically. I propose a new macro for this purpose, POISON_POINTER(). It should check whether the poison pointer offset is not too big. E.g. in case of too big offset the compilation fails with the following message: mm/page_alloc.c: In function 'free_pages_prepare': mm/page_alloc.c:840:23: error: call to '__compiletime_assert_840' declared with attribute error: BUILD_BUG_ON failed: 0x0111400 = POISON_AREA_SIZE There is still an unsolved issue with the macro related to static variables initialization. If you uncomment the line just after FIXME line, you'll see: kernel/irq/spurious.c:23:8: error: braced-group within expression allowed only inside a function I'll be happy with the comments to the idea and the implementation. diff --git a/include/linux/poison.h b/include/linux/poison.h index 7b2a7fc..47b43ed 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -1,40 +1,54 @@ #ifndef _LINUX_POISON_H #define _LINUX_POISON_H +#include linux/bug.h /** include/linux/list.h **/ /* * Architectures might want to move the poison pointer offset * into some well-recognized area such as 0xdead, - * that is also not mappable by user-space exploits: + * that is also not mappable by user-space exploits, + * by adjusting CONFIG_ILLEGAL_POINTER_VALUE: */ #ifdef CONFIG_ILLEGAL_POINTER_VALUE # define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL) #else # define POISON_POINTER_DELTA 0 #endif +/* + * Poisoned pointers of different subsystems should be different + * but must not move far away from POISON_POINTER_DELTA. + * Otherwise poisoned pointer might be mmap'able on some architectures. + */ +#define POISON_AREA_SIZE 0x1000 +#define POISON_POINTER(x) \ + ({ \ + BUILD_BUG_ON(x = POISON_AREA_SIZE); \ + ((void *)(x) + POISON_POINTER_DELTA);}) /* * These are non-NULL pointers that will result in page faults * under normal circumstances, used to verify that nobody uses * non-initialized list entries. */ -#define LIST_POISON1 ((void *) 0x00100100 + POISON_POINTER_DELTA) -#define LIST_POISON2 ((void *) 0x00200200 + POISON_POINTER_DELTA) +#define LIST_POISON1 POISON_POINTER(0x0100) +#define LIST_POISON2 POISON_POINTER(0x0200) /** include/linux/timer.h **/ /* * Magic number tsta to indicate a static timer initializer * for the object debugging code. */ -#define TIMER_ENTRY_STATIC ((void *) 0x74737461) +#define TIMER_ENTRY_STATIC ((void*)0x0300 + POISON_POINTER_DELTA) +// FIXME +//#define TIMER_ENTRY_STATIC POISON_POINTER(0x0300) /** mm/debug-pagealloc.c **/ #define PAGE_POISON 0xaa /** mm/page_alloc.c / -#define TAIL_MAPPING ((void *) 0x01014A11 + POISON_POINTER_DELTA) +#define TAIL_MAPPING POISON_POINTER(0x400) /** mm/slab.c **/ /* -- Vasily -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns
On Thu, May 29, 2014 at 16:53 +0400, Pavel Emelyanov wrote: > On 05/29/2014 03:59 PM, Vasily Kulikov wrote: > > On Thu, May 29, 2014 at 15:31 +0400, Pavel Emelyanov wrote: > >> On 05/29/2014 03:12 PM, Vasily Kulikov wrote: > >>> On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote: > >>>> On 05/29/2014 09:59 AM, Vasily Kulikov wrote: > >>>>> On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote: > >>>>> ] We need a direct method of getting the pid inside containers. > >>>>> ] If some issues occurred inside container guest, host user > >>>>> ] could not know which process is in trouble just by guest pid: > >>>>> ] the users of container guest only knew the pid inside containers. > >>>>> ] This will bring obstacle for trouble shooting. > >>>>> > >>>>> A new syscall might complicate trouble shooting by admin. > >>>> > >>>> Pure syscall -- yes. What if we teach the ps and top utilities to show > >>>> additional > >>>> info? I think that would help. > >>> > >>> I like the idea with low level non-shell API which can be used by > >>> utility like ps (or implementation of a new tool to work with complex > >>> namespace hierarchies). It should fit for troublesooting. Then there > >>> should be no reason to implement two different APIs for observation from > >>> shell via FS and from applications. > >> > >> Maybe we can reuse the existing kcmp() system call? We would have to store > >> the collected pid values in some hash/tree anyway, and kcmp() provides us > >> good comparing function for doing this. > >> > >> Like we can call kcmp(pid1, pid2, KCMP_PID, nsfd1, nsfd2) which will mean > >> "Are tasks with pid1 in namespace pointed by nsfd1 and with pid2 in > >> namespace > >> nsfd2 the same?" > >> > >> What do you think? > > > > kcmp() is not needed, just compare inode numbers: > > > > # ls -il /proc/{43,self}/ns/mnt > > 208182 lrwxrwxrwx 1 root root 0 мая 29 15:52 /proc/43/ns/mnt -> > > mnt:[4026531856] > > 216556 lrwxrwxrwx 1 root root 0 мая 29 15:57 /proc/self/ns/mnt -> > > mnt:[4026531840] > > But that's for comparing the namespaces, while I'm proposing the kcmp to > check for PIDs. Hm, right. What about the following solution: export global process ID (PID in init ns) which is visible inside of any namespace. Then you can compare numbers regardless in what namespace you are. -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns
On Thu, May 29, 2014 at 16:53 +0400, Pavel Emelyanov wrote: On 05/29/2014 03:59 PM, Vasily Kulikov wrote: On Thu, May 29, 2014 at 15:31 +0400, Pavel Emelyanov wrote: On 05/29/2014 03:12 PM, Vasily Kulikov wrote: On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote: On 05/29/2014 09:59 AM, Vasily Kulikov wrote: On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote: ] We need a direct method of getting the pid inside containers. ] If some issues occurred inside container guest, host user ] could not know which process is in trouble just by guest pid: ] the users of container guest only knew the pid inside containers. ] This will bring obstacle for trouble shooting. A new syscall might complicate trouble shooting by admin. Pure syscall -- yes. What if we teach the ps and top utilities to show additional info? I think that would help. I like the idea with low level non-shell API which can be used by utility like ps (or implementation of a new tool to work with complex namespace hierarchies). It should fit for troublesooting. Then there should be no reason to implement two different APIs for observation from shell via FS and from applications. Maybe we can reuse the existing kcmp() system call? We would have to store the collected pid values in some hash/tree anyway, and kcmp() provides us good comparing function for doing this. Like we can call kcmp(pid1, pid2, KCMP_PID, nsfd1, nsfd2) which will mean Are tasks with pid1 in namespace pointed by nsfd1 and with pid2 in namespace nsfd2 the same? What do you think? kcmp() is not needed, just compare inode numbers: # ls -il /proc/{43,self}/ns/mnt 208182 lrwxrwxrwx 1 root root 0 мая 29 15:52 /proc/43/ns/mnt - mnt:[4026531856] 216556 lrwxrwxrwx 1 root root 0 мая 29 15:57 /proc/self/ns/mnt - mnt:[4026531840] But that's for comparing the namespaces, while I'm proposing the kcmp to check for PIDs. Hm, right. What about the following solution: export global process ID (PID in init ns) which is visible inside of any namespace. Then you can compare numbers regardless in what namespace you are. -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns
On Thu, May 29, 2014 at 15:31 +0400, Pavel Emelyanov wrote: > On 05/29/2014 03:12 PM, Vasily Kulikov wrote: > > On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote: > >> On 05/29/2014 09:59 AM, Vasily Kulikov wrote: > >>> On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote: > >>> ] We need a direct method of getting the pid inside containers. > >>> ] If some issues occurred inside container guest, host user > >>> ] could not know which process is in trouble just by guest pid: > >>> ] the users of container guest only knew the pid inside containers. > >>> ] This will bring obstacle for trouble shooting. > >>> > >>> A new syscall might complicate trouble shooting by admin. > >> > >> Pure syscall -- yes. What if we teach the ps and top utilities to show > >> additional > >> info? I think that would help. > > > > I like the idea with low level non-shell API which can be used by > > utility like ps (or implementation of a new tool to work with complex > > namespace hierarchies). It should fit for troublesooting. Then there > > should be no reason to implement two different APIs for observation from > > shell via FS and from applications. > > Maybe we can reuse the existing kcmp() system call? We would have to store > the collected pid values in some hash/tree anyway, and kcmp() provides us > good comparing function for doing this. > > Like we can call kcmp(pid1, pid2, KCMP_PID, nsfd1, nsfd2) which will mean > "Are tasks with pid1 in namespace pointed by nsfd1 and with pid2 in namespace > nsfd2 the same?" > > What do you think? kcmp() is not needed, just compare inode numbers: # ls -il /proc/{43,self}/ns/mnt 208182 lrwxrwxrwx 1 root root 0 мая 29 15:52 /proc/43/ns/mnt -> mnt:[4026531856] 216556 lrwxrwxrwx 1 root root 0 мая 29 15:57 /proc/self/ns/mnt -> mnt:[4026531840] > > However, maybe it is possible to implement not via new syscall but > > by implementation of new symlink in sysfs? Then both ps-like tool and > > CRIU-like tool is able to obtain the ns information by the same means. > > Maybe sort of a symlink to a parent namespace or a process which is > > inside of the parent namespace? Then a process may identify IDs using > > following steps: > > > > 1) identify target NS by walking current procfs > > 2) do setns(2)/chroot(2) > > 3) look at procfs to identify target IDs in the target NS > > Can you elaborate on this? Let's imagine we have picked two tasks with > init_pid_ns' PIDs being 11 and 12 and we've found out using /proc/pid/ns/pid > links that they both live in some non-init pid namespace. > > Then we have to look at this ns' proc. It says that there are also two > tasks -- 2 and 3. How can we determine which pid is which? Oh, right. My idea is broken. -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns
On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote: > On 05/29/2014 09:59 AM, Vasily Kulikov wrote: > > On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote: > > ] We need a direct method of getting the pid inside containers. > > ] If some issues occurred inside container guest, host user > > ] could not know which process is in trouble just by guest pid: > > ] the users of container guest only knew the pid inside containers. > > ] This will bring obstacle for trouble shooting. > > > > A new syscall might complicate trouble shooting by admin. > > Pure syscall -- yes. What if we teach the ps and top utilities to show > additional > info? I think that would help. I like the idea with low level non-shell API which can be used by utility like ps (or implementation of a new tool to work with complex namespace hierarchies). It should fit for troublesooting. Then there should be no reason to implement two different APIs for observation from shell via FS and from applications. However, maybe it is possible to implement not via new syscall but by implementation of new symlink in sysfs? Then both ps-like tool and CRIU-like tool is able to obtain the ns information by the same means. Maybe sort of a symlink to a parent namespace or a process which is inside of the parent namespace? Then a process may identify IDs using following steps: 1) identify target NS by walking current procfs 2) do setns(2)/chroot(2) 3) look at procfs to identify target IDs in the target NS It would be impossible to identify foreign IDs for unprivileged processes, however. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns
On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote: > On 05/28/2014 10:28 PM, Vasily Kulikov wrote: > > On Wed, May 28, 2014 at 16:44 +0400, Pavel Emelyanov wrote: > > It will be simplier > > to parse the file -- if 'ns_ids' file contains some ID then this ID for > > every ns can be obtained regardless of the specific ID name (SID, PID, > > PGID, etc.). > > True, but given a task PID how to determine which pid namespaces it lives in > to get the idea of how PIDs map to each other? Maybe we need some explicit > API for converting (ID, NS1, NS2) into (ID)? AFAIU the idea of the patch is to add a new debugging information which can be trivially obtained via 'cat /proc/...': ] We need a direct method of getting the pid inside containers. ] If some issues occurred inside container guest, host user ] could not know which process is in trouble just by guest pid: ] the users of container guest only knew the pid inside containers. ] This will bring obstacle for trouble shooting. A new syscall might complicate trouble shooting by admin. -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns
On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote: On 05/28/2014 10:28 PM, Vasily Kulikov wrote: On Wed, May 28, 2014 at 16:44 +0400, Pavel Emelyanov wrote: It will be simplier to parse the file -- if 'ns_ids' file contains some ID then this ID for every ns can be obtained regardless of the specific ID name (SID, PID, PGID, etc.). True, but given a task PID how to determine which pid namespaces it lives in to get the idea of how PIDs map to each other? Maybe we need some explicit API for converting (ID, NS1, NS2) into (ID)? AFAIU the idea of the patch is to add a new debugging information which can be trivially obtained via 'cat /proc/...': ] We need a direct method of getting the pid inside containers. ] If some issues occurred inside container guest, host user ] could not know which process is in trouble just by guest pid: ] the users of container guest only knew the pid inside containers. ] This will bring obstacle for trouble shooting. A new syscall might complicate trouble shooting by admin. -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns
On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote: On 05/29/2014 09:59 AM, Vasily Kulikov wrote: On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote: ] We need a direct method of getting the pid inside containers. ] If some issues occurred inside container guest, host user ] could not know which process is in trouble just by guest pid: ] the users of container guest only knew the pid inside containers. ] This will bring obstacle for trouble shooting. A new syscall might complicate trouble shooting by admin. Pure syscall -- yes. What if we teach the ps and top utilities to show additional info? I think that would help. I like the idea with low level non-shell API which can be used by utility like ps (or implementation of a new tool to work with complex namespace hierarchies). It should fit for troublesooting. Then there should be no reason to implement two different APIs for observation from shell via FS and from applications. However, maybe it is possible to implement not via new syscall but by implementation of new symlink in sysfs? Then both ps-like tool and CRIU-like tool is able to obtain the ns information by the same means. Maybe sort of a symlink to a parent namespace or a process which is inside of the parent namespace? Then a process may identify IDs using following steps: 1) identify target NS by walking current procfs 2) do setns(2)/chroot(2) 3) look at procfs to identify target IDs in the target NS It would be impossible to identify foreign IDs for unprivileged processes, however. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns
On Thu, May 29, 2014 at 15:31 +0400, Pavel Emelyanov wrote: On 05/29/2014 03:12 PM, Vasily Kulikov wrote: On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote: On 05/29/2014 09:59 AM, Vasily Kulikov wrote: On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote: ] We need a direct method of getting the pid inside containers. ] If some issues occurred inside container guest, host user ] could not know which process is in trouble just by guest pid: ] the users of container guest only knew the pid inside containers. ] This will bring obstacle for trouble shooting. A new syscall might complicate trouble shooting by admin. Pure syscall -- yes. What if we teach the ps and top utilities to show additional info? I think that would help. I like the idea with low level non-shell API which can be used by utility like ps (or implementation of a new tool to work with complex namespace hierarchies). It should fit for troublesooting. Then there should be no reason to implement two different APIs for observation from shell via FS and from applications. Maybe we can reuse the existing kcmp() system call? We would have to store the collected pid values in some hash/tree anyway, and kcmp() provides us good comparing function for doing this. Like we can call kcmp(pid1, pid2, KCMP_PID, nsfd1, nsfd2) which will mean Are tasks with pid1 in namespace pointed by nsfd1 and with pid2 in namespace nsfd2 the same? What do you think? kcmp() is not needed, just compare inode numbers: # ls -il /proc/{43,self}/ns/mnt 208182 lrwxrwxrwx 1 root root 0 мая 29 15:52 /proc/43/ns/mnt - mnt:[4026531856] 216556 lrwxrwxrwx 1 root root 0 мая 29 15:57 /proc/self/ns/mnt - mnt:[4026531840] However, maybe it is possible to implement not via new syscall but by implementation of new symlink in sysfs? Then both ps-like tool and CRIU-like tool is able to obtain the ns information by the same means. Maybe sort of a symlink to a parent namespace or a process which is inside of the parent namespace? Then a process may identify IDs using following steps: 1) identify target NS by walking current procfs 2) do setns(2)/chroot(2) 3) look at procfs to identify target IDs in the target NS Can you elaborate on this? Let's imagine we have picked two tasks with init_pid_ns' PIDs being 11 and 12 and we've found out using /proc/pid/ns/pid links that they both live in some non-init pid namespace. Then we have to look at this ns' proc. It says that there are also two tasks -- 2 and 3. How can we determine which pid is which? Oh, right. My idea is broken. -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns
On Wed, May 28, 2014 at 16:44 +0400, Pavel Emelyanov wrote: > On 05/28/2014 02:24 PM, Chen Hanxiao wrote: > > We need a direct method of getting the pid inside containers. > > But there's more generic issue -- some day we'll need to know not only > PIDs as seen from different namespaces, but also SIDs and PGIDs. Maybe include all per-ns ID in a separate file? Then the old 'status' file includes IDs from the current namespace only, the new file (e.g. 'ids' or 'ns_ids') contains only hierarchical IDs which differ from namespace to namespace for all possible namespaces. It will be simplier to parse the file -- if 'ns_ids' file contains some ID then this ID for every ns can be obtained regardless of the specific ID name (SID, PID, PGID, etc.). > > > If some issues occurred inside container guest, host user > > could not know which process is in trouble just by guest pid: > > the users of container guest only knew the pid inside containers. > > This will bring obstacle for trouble shooting. > > > > This patch adds two fields: > > > > NStgid and NSpid. > > > > a) In init_pid_ns, nothing changed; > > > > b) In one pidns, will tell the pid inside containers: > > NStgid: 16289 3 > > NSpid: 16289 3 > > ** Process id is 1628 in level 0, 9 in level 1, 3 in level 2. > > > > c) If pidns is nested, it depends on which pidns are you in. > > NStgid: 9 3 > > NSpid: 9 3 > > ** Views from level 1 Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns
On Wed, May 28, 2014 at 16:44 +0400, Pavel Emelyanov wrote: On 05/28/2014 02:24 PM, Chen Hanxiao wrote: We need a direct method of getting the pid inside containers. But there's more generic issue -- some day we'll need to know not only PIDs as seen from different namespaces, but also SIDs and PGIDs. Maybe include all per-ns ID in a separate file? Then the old 'status' file includes IDs from the current namespace only, the new file (e.g. 'ids' or 'ns_ids') contains only hierarchical IDs which differ from namespace to namespace for all possible namespaces. It will be simplier to parse the file -- if 'ns_ids' file contains some ID then this ID for every ns can be obtained regardless of the specific ID name (SID, PID, PGID, etc.). If some issues occurred inside container guest, host user could not know which process is in trouble just by guest pid: the users of container guest only knew the pid inside containers. This will bring obstacle for trouble shooting. This patch adds two fields: NStgid and NSpid. a) In init_pid_ns, nothing changed; b) In one pidns, will tell the pid inside containers: NStgid: 16289 3 NSpid: 16289 3 ** Process id is 1628 in level 0, 9 in level 1, 3 in level 2. c) If pidns is nested, it depends on which pidns are you in. NStgid: 9 3 NSpid: 9 3 ** Views from level 1 Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] /proc/pid/status: show all sets of pid according to ns
Hi Chen, On Mon, May 26, 2014 at 18:05 +0800, Chen Hanxiao wrote: > We need a direct method of getting the pid inside containers. > If some issues occurred inside a container guest, host user > could not know which process is in trouble just by guest pid: > the users of container guest only knew the pid inside containers. > This will bring obstacle for trouble shooting. > > This patch expands fields of Tgid and Pid: > a) In init_pid_ns, nothing changed; > > b) In one pidns, they will tell the pid inside containers: > Tgid: 16289 3 > Pid: 16289 3 > ** process id is 1628 in level 0, 9 in level 1, 3 in level 2. 1. It breaks ABI. Any application which does something like "grep pid: | cut -d: -f2" is now broken by the patch. Maybe add a new field like 'Pid-ns', 'PidNS', or 'Pids' and leave the old one for compatibility? 2. Is it OK to show internal pids to unprivileged processes? I cannot see anything obviously dangerous with it, though. > c) If pidns is nested, it depends on which pidns are you in. > Tgid: 9 3 > Pid: 9 3 > ** Views from level 1 for Pid 1628 in host. -- Vasily -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] /proc/pid/status: show all sets of pid according to ns
Hi Chen, On Mon, May 26, 2014 at 18:05 +0800, Chen Hanxiao wrote: We need a direct method of getting the pid inside containers. If some issues occurred inside a container guest, host user could not know which process is in trouble just by guest pid: the users of container guest only knew the pid inside containers. This will bring obstacle for trouble shooting. This patch expands fields of Tgid and Pid: a) In init_pid_ns, nothing changed; b) In one pidns, they will tell the pid inside containers: Tgid: 16289 3 Pid: 16289 3 ** process id is 1628 in level 0, 9 in level 1, 3 in level 2. 1. It breaks ABI. Any application which does something like grep pid: | cut -d: -f2 is now broken by the patch. Maybe add a new field like 'Pid-ns', 'PidNS', or 'Pids' and leave the old one for compatibility? 2. Is it OK to show internal pids to unprivileged processes? I cannot see anything obviously dangerous with it, though. c) If pidns is nested, it depends on which pidns are you in. Tgid: 9 3 Pid: 9 3 ** Views from level 1 for Pid 1628 in host. -- Vasily -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] call_usermodehelper in containers
Hi Jeff, On Mon, Nov 11, 2013 at 07:18 -0500, Jeff Layton wrote: > What's the correct approach to fix this? One possibility would be to > keep a kernel thread around that sits in the correct namespace(s) and > has the right privileges, and then use that to launch UMH programs. > That thread could be spawned whenever someone runs rpc.nfsd inside a > container. > > Not very elegant, but it seems like something that would work. > > Are there better approaches? What's the reasoning behind this? I mean, it is not very obvious what we should keep here. Compare 2 cases: 1) root process with all caps spawns new ns, then drops some of caps; 2) root process with all caps drops some of his caps and then spawns new ns. >From the programmer's POV both cases are valid and lead to absolutely the same limitations inside of the new namespace. However, from kernel POV they differ -- if save cap set when ns is created then in (1) we'll have cap'ed UMH, in (2) we'll have UMH with only several caps. It might significantly influence on ability of UMH to do its job and ability of this limited ns to escape from the sandbox. So, what semantic should UMH privileges have? Also, an orthogonal addition: you might want to keep only minimum information about capabilities or something -- keep only cap_t field in namespace structure without explicit kernel thread for each ns. When UMH is created, just fill the required caps in it. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Devel] call_usermodehelper in containers
Hi Jeff, On Mon, Nov 11, 2013 at 07:18 -0500, Jeff Layton wrote: What's the correct approach to fix this? One possibility would be to keep a kernel thread around that sits in the correct namespace(s) and has the right privileges, and then use that to launch UMH programs. That thread could be spawned whenever someone runs rpc.nfsd inside a container. Not very elegant, but it seems like something that would work. Are there better approaches? What's the reasoning behind this? I mean, it is not very obvious what we should keep here. Compare 2 cases: 1) root process with all caps spawns new ns, then drops some of caps; 2) root process with all caps drops some of his caps and then spawns new ns. From the programmer's POV both cases are valid and lead to absolutely the same limitations inside of the new namespace. However, from kernel POV they differ -- if save cap set when ns is created then in (1) we'll have cap'ed UMH, in (2) we'll have UMH with only several caps. It might significantly influence on ability of UMH to do its job and ability of this limited ns to escape from the sandbox. So, what semantic should UMH privileges have? Also, an orthogonal addition: you might want to keep only minimum information about capabilities or something -- keep only cap_t field in namespace structure without explicit kernel thread for each ns. When UMH is created, just fill the required caps in it. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] fs: Limit sys_mount to only request filesystem modules.
(cc'ed kernel-hardening) On Sun, Mar 03, 2013 at 23:51 -0800, Eric W. Biederman wrote: > Modify the request_module to prefix the file system type with "fs-" > and add aliases to all of the filesystems that can be built as modules > to match. > > A common practice is to build all of the kernel code and leave code > that is not commonly needed as modules, with the result that many > users are exposed to any bug anywhere in the kernel. > > Looking for filesystems with a fs- prefix limits the pool of possible > modules that can be loaded by mount to just filesystems trivially > making things safer with no real cost. > > Using aliases means user space can control the policy of which > filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf > with blacklist and alias directives. Allowing simple, safe, > well understood work-arounds to known problematic software. > > This also addresses a rare but unfortunate problem where the filesystem > name is not the same as it's module name and module auto-loading > would not work. While writing this patch I saw a handful of such > cases. The most significant being autofs that lives in the module > autofs4. > > This is relevant to user namespaces because we can reach the request > module in get_fs_type() without having any special permissions, and > people get uncomfortable when a user specified string (in this case > the filesystem type) goes all of the way to request_module. > > After having looked at this issue I don't think there is any > particular reason to perform any filtering or permission checks beyond > making it clear in the module request that we want a filesystem > module. The common pattern in the kernel is to call request_module() > without regards to the users permissions. In general all a filesystem > module does once loaded is call register_filesystem() and go to sleep. > Which means there is not much attack surface exposed by loading a > filesytem module unless the filesystem is mounted. In a user > namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, > which most filesystems do not set today. > > Acked-by: Serge Hallyn > Acked-by: Kees Cook > Reported-by: Kees Cook > Signed-off-by: "Eric W. Biederman" ... > diff --git a/fs/filesystems.c b/fs/filesystems.c > index da165f6..92567d9 100644 > --- a/fs/filesystems.c > +++ b/fs/filesystems.c > @@ -273,7 +273,7 @@ struct file_system_type *get_fs_type(const char *name) > int len = dot ? dot - name : strlen(name); > > fs = __get_fs_type(name, len); > - if (!fs && (request_module("%.*s", len, name) == 0)) > + if (!fs && (request_module("fs-%.*s", len, name) == 0)) > fs = __get_fs_type(name, len); > > if (dot && fs && !(fs->fs_flags & FS_HAS_SUBTYPE)) { Maybe we should divide request_module() into several functions regarding expected caller's privileges? - request_module() for CAP_SYS_MODULE in init_ns - request_module_relaxed() for everybody request_module_relaxed() is used in get_fs_type(), dev_load() and all places where the safety of module loading is manually checked. All old not yet checked users of request_module() will not be triggerable from user_ns. That's the same scheme as with capable() and ns_capable(). Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] fs: Limit sys_mount to only request filesystem modules.
(cc'ed kernel-hardening) On Sun, Mar 03, 2013 at 23:51 -0800, Eric W. Biederman wrote: Modify the request_module to prefix the file system type with fs- and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn serge.hal...@canonical.com Acked-by: Kees Cook keesc...@chromium.org Reported-by: Kees Cook keesc...@google.com Signed-off-by: Eric W. Biederman ebied...@xmission.com ... diff --git a/fs/filesystems.c b/fs/filesystems.c index da165f6..92567d9 100644 --- a/fs/filesystems.c +++ b/fs/filesystems.c @@ -273,7 +273,7 @@ struct file_system_type *get_fs_type(const char *name) int len = dot ? dot - name : strlen(name); fs = __get_fs_type(name, len); - if (!fs (request_module(%.*s, len, name) == 0)) + if (!fs (request_module(fs-%.*s, len, name) == 0)) fs = __get_fs_type(name, len); if (dot fs !(fs-fs_flags FS_HAS_SUBTYPE)) { Maybe we should divide request_module() into several functions regarding expected caller's privileges? - request_module() for CAP_SYS_MODULE in init_ns - request_module_relaxed() for everybody request_module_relaxed() is used in get_fs_type(), dev_load() and all places where the safety of module loading is manually checked. All old not yet checked users of request_module() will not be triggerable from user_ns. That's the same scheme as with capable() and ns_capable(). Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: For review: pid_namespaces(7) man page
Hi Michael, On Thu, Feb 28, 2013 at 12:24 +0100, Michael Kerrisk (man-pages) wrote: >The namespace init process >The first process created in a new namespace (i.e., the process >created using clone(2) with the CLONE_NEWPID flag, or the first >child created by a process after a call to unshare(2) using the >CLONE_NEWPID flag) has the PID 1, and is the "init" process for >the namespace (see init(1)). Children that are orphaned within >the namespace will be reparented to this process rather than >init(1). Probably it worth noting here that this is true unless prctl() with PR_SET_CHILD_SUBREAPER option is called. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: For review: pid_namespaces(7) man page
Hi Michael, On Thu, Feb 28, 2013 at 12:24 +0100, Michael Kerrisk (man-pages) wrote: The namespace init process The first process created in a new namespace (i.e., the process created using clone(2) with the CLONE_NEWPID flag, or the first child created by a process after a call to unshare(2) using the CLONE_NEWPID flag) has the PID 1, and is the init process for the namespace (see init(1)). Children that are orphaned within the namespace will be reparented to this process rather than init(1). Probably it worth noting here that this is true unless prctl() with PR_SET_CHILD_SUBREAPER option is called. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH review 1/6] userns: Avoid recursion in put_user_ns
On Fri, Jan 25, 2013 at 18:19 -0800, Eric W. Biederman wrote: > > When freeing a deeply nested user namespace free_user_ns calls > put_user_ns on it's parent which may in turn call free_user_ns again. > When -fno-optimize-sibling-calls is passed to gcc one stack frame per > user namespace is left on the stack, potentially overflowing the > kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls > so we can't count on gcc to optimize this code. > > Remove struct kref and use a plain atomic_t. Making the code more > flexible and easier to comprehend. Make the loop in free_user_ns > explict to guarantee that the stack does not overflow with > CONFIG_FRAME_POINTER enabled. > > I have tested this fix with a simple program that uses unshare to > create a deeply nested user namespace structure and then calls exit. > With 1000 nesteuser namespaces before this change running my test > program causes the kernel to die a horrible death. With 10,000,000 > nested user namespaces after this change my test program runs to > completion and causes no harm. > > Pointed-out-by: Vasily Kulikov > Signed-off-by: "Eric W. Biederman" Looks sane, thanks. Acked-by: Vasily Kulikov The second bug I've noted in the same email (OOM) looks like should be "fixed" by using memcg to limit kernel memory. So, I'm fine with this side of user_ns :) -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH review 1/6] userns: Avoid recursion in put_user_ns
On Fri, Jan 25, 2013 at 18:19 -0800, Eric W. Biederman wrote: When freeing a deeply nested user namespace free_user_ns calls put_user_ns on it's parent which may in turn call free_user_ns again. When -fno-optimize-sibling-calls is passed to gcc one stack frame per user namespace is left on the stack, potentially overflowing the kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls so we can't count on gcc to optimize this code. Remove struct kref and use a plain atomic_t. Making the code more flexible and easier to comprehend. Make the loop in free_user_ns explict to guarantee that the stack does not overflow with CONFIG_FRAME_POINTER enabled. I have tested this fix with a simple program that uses unshare to create a deeply nested user namespace structure and then calls exit. With 1000 nesteuser namespaces before this change running my test program causes the kernel to die a horrible death. With 10,000,000 nested user namespaces after this change my test program runs to completion and causes no harm. Pointed-out-by: Vasily Kulikov seg...@openwall.com Signed-off-by: Eric W. Biederman ebied...@xmission.com Looks sane, thanks. Acked-by: Vasily Kulikov seg...@openwall.com The second bug I've noted in the same email (OOM) looks like should be fixed by using memcg to limit kernel memory. So, I'm fine with this side of user_ns :) -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Friendlier EPERM e-mail on lkml
Hi Eric, On Wed, Jan 09, 2013 at 11:18 -0500, Eric Paris wrote: > http://marc.info/?l=linux-kernel=135774748910987=2 > > Basic idea is a proc file which tells you WHY the last EPERM from the > kernel happened. Was it SELinux? Was it SMACK? Was it capabilities? > Was it file rwx bits? > > I'd like comments. I'll cc LSM on any future replies to that thread... Why does application need it? Why does anybody but admin need this information? Probably this may be implemented as root-only audit messages? Or introduce new NETLINK_EPERMREASON netlink family? -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Friendlier EPERM e-mail on lkml
Hi Eric, On Wed, Jan 09, 2013 at 11:18 -0500, Eric Paris wrote: http://marc.info/?l=linux-kernelm=135774748910987w=2 Basic idea is a proc file which tells you WHY the last EPERM from the kernel happened. Was it SELinux? Was it SMACK? Was it capabilities? Was it file rwx bits? I'd like comments. I'll cc LSM on any future replies to that thread... Why does application need it? Why does anybody but admin need this information? Probably this may be implemented as root-only audit messages? Or introduce new NETLINK_EPERMREASON netlink family? -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v12 0/9] LSM: Multiple concurrent LSMs
On Mon, Jan 07, 2013 at 20:02 -0800, Casey Schaufler wrote: > On 1/7/2013 7:01 PM, Stephen Rothwell wrote: > > Let me ask Andrew's question: Why do you want to do this (what is the > > use case)? What does this gain us? > > There has been an amazing amount of development in system security > over the past three years. Almost none of it has been in the kernel. > One important reason that it is not getting done in the kernel is > that the current single LSM restriction requires an all or nothing > approach to security. Either you address all your needs with a single > LSM or you have to go with a user space solution, in which case you > may as well do everything in user space. [...] You should also update Documentation/security/LSM.txt with new "security=" rules and rules of LSM stacking limitations. Motivation of stacking is probably worth noting in Documentation/ too. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v12 0/9] LSM: Multiple concurrent LSMs
On Mon, Jan 07, 2013 at 20:11 -0800, Casey Schaufler wrote: > On 1/7/2013 7:59 PM, Stephen Rothwell wrote: > > You probably also want to think a bit harder about the order of the > > patches - you should introduce new APIs before you use them and remove > > calls to functions before you remove the functions. > > > The unfortunate reality is that I couldn't find a good way to stage the > changes. It's a wonking big set of infrastructure change. I could introduce > the security blob abstraction separately but that is a fraction of the > change. If it would have gone through mail filters as a single patch I'd > have sent it that way. > > I can spend time on patch presentation, and will if necessary. I guess it can be divided this way: 1) Introduce lsm_get_cred(), etc. which unconditionally return ->security, ->i_security, etc. 2) Move all LSMs, procfs, etc. to the new API, a patch per LSM/subsystem. 3) Change structures along with new API. The pro of the division is that if you have a bug in the series (and you surely have! ;)) it is MUCH more simple to locate this bug (bisect, etc.). Also it is more descriptive as you divide LSM changes and the core security subsystem itself targeted on multiple LSMs which divides LSM implementations (which might be not very important for someone) and the core architecture (which is important for everybody). Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v12 0/9] LSM: Multiple concurrent LSMs
On Mon, Jan 07, 2013 at 20:11 -0800, Casey Schaufler wrote: On 1/7/2013 7:59 PM, Stephen Rothwell wrote: You probably also want to think a bit harder about the order of the patches - you should introduce new APIs before you use them and remove calls to functions before you remove the functions. The unfortunate reality is that I couldn't find a good way to stage the changes. It's a wonking big set of infrastructure change. I could introduce the security blob abstraction separately but that is a fraction of the change. If it would have gone through mail filters as a single patch I'd have sent it that way. I can spend time on patch presentation, and will if necessary. I guess it can be divided this way: 1) Introduce lsm_get_cred(), etc. which unconditionally return -security, -i_security, etc. 2) Move all LSMs, procfs, etc. to the new API, a patch per LSM/subsystem. 3) Change structures along with new API. The pro of the division is that if you have a bug in the series (and you surely have! ;)) it is MUCH more simple to locate this bug (bisect, etc.). Also it is more descriptive as you divide LSM changes and the core security subsystem itself targeted on multiple LSMs which divides LSM implementations (which might be not very important for someone) and the core architecture (which is important for everybody). Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v12 0/9] LSM: Multiple concurrent LSMs
On Mon, Jan 07, 2013 at 20:02 -0800, Casey Schaufler wrote: On 1/7/2013 7:01 PM, Stephen Rothwell wrote: Let me ask Andrew's question: Why do you want to do this (what is the use case)? What does this gain us? There has been an amazing amount of development in system security over the past three years. Almost none of it has been in the kernel. One important reason that it is not getting done in the kernel is that the current single LSM restriction requires an all or nothing approach to security. Either you address all your needs with a single LSM or you have to go with a user space solution, in which case you may as well do everything in user space. [...] You should also update Documentation/security/LSM.txt with new security= rules and rules of LSM stacking limitations. Motivation of stacking is probably worth noting in Documentation/ too. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts
On Fri, Dec 28, 2012 at 20:05 -0800, Eric W. Biederman wrote: > > A related issue which is NOT FIXED HERE is limits for all resources > > available for containerized pseudo roots. E.g. I succeeded creating > > thousands of veth network devices without problems by a non-root user, > > there seems no limit in number of network devices. I suspect it is > > possible to setup routing and net_ns'es the way it will be very > > time-consuming for kernel to handle IP packets inside of ksoftirq, which > > is not counted as this user scheduler time. I suppose the issue is not > > veth-specific, almost all newly available for unprivileged users code > > pathes are vulnerable to DoS attacks. > > veth at least should process packets synchronously so I don't see how > you will get softirq action. What do you mean -- synchronously? From my limited understanding of veth job, it is handled like every network packet in system, via: veth_xmit() -> dev_forward_skb() -> netif_rx() -> enqueue_to_backlog() enqueue_to_backlog() adds the packet to softnet_data->input_pkt_queue. Then inside of softirq process_backlog() moves ->input_pkt_queue to ->process_queue and calls __netif_receive_skb(), which does all networking stack magic. AFAICS, one could create user_ns, net_ns inside of it, and setup routing tables and netfilter to infinitely pass few network packets from and to veth, abusing ksoftirq. -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts
On Fri, Dec 28, 2012 at 20:05 -0800, Eric W. Biederman wrote: A related issue which is NOT FIXED HERE is limits for all resources available for containerized pseudo roots. E.g. I succeeded creating thousands of veth network devices without problems by a non-root user, there seems no limit in number of network devices. I suspect it is possible to setup routing and net_ns'es the way it will be very time-consuming for kernel to handle IP packets inside of ksoftirq, which is not counted as this user scheduler time. I suppose the issue is not veth-specific, almost all newly available for unprivileged users code pathes are vulnerable to DoS attacks. veth at least should process packets synchronously so I don't see how you will get softirq action. What do you mean -- synchronously? From my limited understanding of veth job, it is handled like every network packet in system, via: veth_xmit() - dev_forward_skb() - netif_rx() - enqueue_to_backlog() enqueue_to_backlog() adds the packet to softnet_data-input_pkt_queue. Then inside of softirq process_backlog() moves -input_pkt_queue to -process_queue and calls __netif_receive_skb(), which does all networking stack magic. AFAICS, one could create user_ns, net_ns inside of it, and setup routing tables and netfilter to infinitely pass few network packets from and to veth, abusing ksoftirq. -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts
On Fri, Dec 28, 2012 at 20:05 -0800, Eric W. Biederman wrote: > Vasily Kulikov writes: > > > Currently there is completely no limiting in number of user namespaces > > created by unprivileged users. One can freely create thousands of > > user_ns'es and exhaust kernel memory without even bumping in > > RLIMIT_NPROC or similar. > > First for a proper sense of scale it will take roughly 14,000 to consume > a megabyte. So it will take hundreds of millions of user namespaces to > eat up all of kernel memory. Yes, but you can freely create *any* number of nested userns by a loop: for() { unshare() write to /proc/self/{u,g}id_map } > > The code needs several checks. First, noone should be able to create > > user_ns of arbitrary depth. Besides kernel stack overflow one could > > create too big depth to DoS processes belonging to other users by > > forcing them to loop a long time in cap_capable called from some > > ns_capable() (e.g. in case one does smth like "ls -R /proc"). > > Where do you get a ns_capable call from "ls -R /proc" ? E.g. if procfs is mounted with hidepid=2 then ls does ptrace_may_access() check. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts
On Fri, Dec 28, 2012 at 18:43 +, Al Viro wrote: > On Fri, Dec 28, 2012 at 09:56:27PM +0400, Vasily Kulikov wrote: > > The included patch is a basic fix for both or them. Both values are > > hardcoded here to 100 max depth and 1000 max in total. I'm not sure how > > better to make them configurable. Looks like it needs some sysctl value > > like kernel.max_user_ns_per_user, but also something more configurable > > like new rlimit'ish limit may be created for user_ns needs. E.g. in > > case root wants one user to contain hundreds of private containers > > (container owner user), but he doesn't want anybody to fill the kernel > > with hundreds of containers multiplied by number of system users (equals > > to thousands). > > I'm sorry, but this is not a solution. Kernel is not x86-only; there are > architectures with far bigger minimal stack frame size. E.g. on sparc64 > every fucking stack frame is at least 176 bytes. So your 100 calls deep > call chain will happily overflow the damn stack all by itself - kernel > stack on sparc64 is 16Kb total, including struct thread_info living there. Understood. How to properly fix it then? Looks like there are quite many kernel structures which may reference other structures which indirectly reference each other via kref, IOW it is not user_ns specific issue. With unprivileged user_ns the way it should be freed must be somehow changed. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH/RFC] user_ns: fix missing limiting of user_ns counts
Currently there is completely no limiting in number of user namespaces created by unprivileged users. One can freely create thousands of user_ns'es and exhaust kernel memory without even bumping in RLIMIT_NPROC or similar. Even more -- it allows user to overflow kernel stack theoretically allowing user to overwrite some important kernel data. The problem is that free_user_ns() may also free its parent user_namespace recursively calling free_user_ns(). As kernel stack is very limited, it leads to kernel stack overflow. The code needs several checks. First, noone should be able to create user_ns of arbitrary depth. Besides kernel stack overflow one could create too big depth to DoS processes belonging to other users by forcing them to loop a long time in cap_capable called from some ns_capable() (e.g. in case one does smth like "ls -R /proc"). Second, non-privileged users must not be able to overlimit some count of namespaces to not be able to exhaust kernel memory. The included patch is a basic fix for both or them. Both values are hardcoded here to 100 max depth and 1000 max in total. I'm not sure how better to make them configurable. Looks like it needs some sysctl value like kernel.max_user_ns_per_user, but also something more configurable like new rlimit'ish limit may be created for user_ns needs. E.g. in case root wants one user to contain hundreds of private containers (container owner user), but he doesn't want anybody to fill the kernel with hundreds of containers multiplied by number of system users (equals to thousands). I'm not sure how it is an approved way for user_ns. Eric? A related issue which is NOT FIXED HERE is limits for all resources available for containerized pseudo roots. E.g. I succeeded creating thousands of veth network devices without problems by a non-root user, there seems no limit in number of network devices. I suspect it is possible to setup routing and net_ns'es the way it will be very time-consuming for kernel to handle IP packets inside of ksoftirq, which is not counted as this user scheduler time. I suppose the issue is not veth-specific, almost all newly available for unprivileged users code pathes are vulnerable to DoS attacks. Signed-off-by: Vasily Kulikov -- include/linux/sched.h |3 +++ kernel/user_namespace.c | 26 ++ 2 files changed, 29 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 206bb08..479940e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -706,6 +706,9 @@ struct user_struct { #ifdef CONFIG_EPOLL atomic_long_t epoll_watches; /* The number of file descriptors currently watched */ #endif +#ifdef CONFIG_USER_NS + atomic_t user_namespaces; /* How many user_ns does this user created? */ +#endif #ifdef CONFIG_POSIX_MQUEUE /* protected by mq_lock */ unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */ diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 2b042c4..a52c4e8 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -45,6 +45,16 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) cred->user_ns = user_ns; } +static long get_user_ns_depth(struct user_namespace *ns) +{ + long depth; + + for (depth = 1; ns != _user_ns; ns = ns->parent) + depth++; + + return depth; +} + /* * Create a new user namespace, deriving the creator from the user in the * passed credentials, and replacing that user with the new root user for the @@ -56,6 +66,7 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) int create_user_ns(struct cred *new) { struct user_namespace *ns, *parent_ns = new->user_ns; + struct user_struct *user = current->cred->user; kuid_t owner = new->euid; kgid_t group = new->egid; int ret; @@ -68,6 +79,18 @@ int create_user_ns(struct cred *new) !kgid_has_mapping(parent_ns, group)) return -EPERM; + /* Too long user_ns chains, might overflow kernel stack on kref_put() */ + if (get_user_ns_depth(parent_ns) > 100) + return -ENOMEM; + + atomic_inc(>user_namespaces); + /* FIXME: probably it's better to configure the number +*instead of hardcoding 1000 */ + if (atomic_read(>user_namespaces) > 1000) { + atomic_dec(>user_namespaces); + return -ENOMEM; + } + ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL); if (!ns) return -ENOMEM; @@ -108,10 +131,13 @@ void free_user_ns(struct kref *kref) { struct user_namespace *parent, *ns = container_of(kref, struct user_namespace, kref); + struct user_struct *user = find_user(ns->owner); parent = ns->parent; proc_free_inum(ns->
[PATCH/RFC] user_ns: fix missing limiting of user_ns counts
Currently there is completely no limiting in number of user namespaces created by unprivileged users. One can freely create thousands of user_ns'es and exhaust kernel memory without even bumping in RLIMIT_NPROC or similar. Even more -- it allows user to overflow kernel stack theoretically allowing user to overwrite some important kernel data. The problem is that free_user_ns() may also free its parent user_namespace recursively calling free_user_ns(). As kernel stack is very limited, it leads to kernel stack overflow. The code needs several checks. First, noone should be able to create user_ns of arbitrary depth. Besides kernel stack overflow one could create too big depth to DoS processes belonging to other users by forcing them to loop a long time in cap_capable called from some ns_capable() (e.g. in case one does smth like ls -R /proc). Second, non-privileged users must not be able to overlimit some count of namespaces to not be able to exhaust kernel memory. The included patch is a basic fix for both or them. Both values are hardcoded here to 100 max depth and 1000 max in total. I'm not sure how better to make them configurable. Looks like it needs some sysctl value like kernel.max_user_ns_per_user, but also something more configurable like new rlimit'ish limit may be created for user_ns needs. E.g. in case root wants one user to contain hundreds of private containers (container owner user), but he doesn't want anybody to fill the kernel with hundreds of containers multiplied by number of system users (equals to thousands). I'm not sure how it is an approved way for user_ns. Eric? A related issue which is NOT FIXED HERE is limits for all resources available for containerized pseudo roots. E.g. I succeeded creating thousands of veth network devices without problems by a non-root user, there seems no limit in number of network devices. I suspect it is possible to setup routing and net_ns'es the way it will be very time-consuming for kernel to handle IP packets inside of ksoftirq, which is not counted as this user scheduler time. I suppose the issue is not veth-specific, almost all newly available for unprivileged users code pathes are vulnerable to DoS attacks. Signed-off-by: Vasily Kulikov seg...@openwall.com -- include/linux/sched.h |3 +++ kernel/user_namespace.c | 26 ++ 2 files changed, 29 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 206bb08..479940e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -706,6 +706,9 @@ struct user_struct { #ifdef CONFIG_EPOLL atomic_long_t epoll_watches; /* The number of file descriptors currently watched */ #endif +#ifdef CONFIG_USER_NS + atomic_t user_namespaces; /* How many user_ns does this user created? */ +#endif #ifdef CONFIG_POSIX_MQUEUE /* protected by mq_lock */ unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */ diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 2b042c4..a52c4e8 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -45,6 +45,16 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) cred-user_ns = user_ns; } +static long get_user_ns_depth(struct user_namespace *ns) +{ + long depth; + + for (depth = 1; ns != init_user_ns; ns = ns-parent) + depth++; + + return depth; +} + /* * Create a new user namespace, deriving the creator from the user in the * passed credentials, and replacing that user with the new root user for the @@ -56,6 +66,7 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) int create_user_ns(struct cred *new) { struct user_namespace *ns, *parent_ns = new-user_ns; + struct user_struct *user = current-cred-user; kuid_t owner = new-euid; kgid_t group = new-egid; int ret; @@ -68,6 +79,18 @@ int create_user_ns(struct cred *new) !kgid_has_mapping(parent_ns, group)) return -EPERM; + /* Too long user_ns chains, might overflow kernel stack on kref_put() */ + if (get_user_ns_depth(parent_ns) 100) + return -ENOMEM; + + atomic_inc(user-user_namespaces); + /* FIXME: probably it's better to configure the number +*instead of hardcoding 1000 */ + if (atomic_read(user-user_namespaces) 1000) { + atomic_dec(user-user_namespaces); + return -ENOMEM; + } + ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL); if (!ns) return -ENOMEM; @@ -108,10 +131,13 @@ void free_user_ns(struct kref *kref) { struct user_namespace *parent, *ns = container_of(kref, struct user_namespace, kref); + struct user_struct *user = find_user(ns-owner); parent = ns-parent; proc_free_inum(ns-proc_inum); kmem_cache_free
Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts
On Fri, Dec 28, 2012 at 18:43 +, Al Viro wrote: On Fri, Dec 28, 2012 at 09:56:27PM +0400, Vasily Kulikov wrote: The included patch is a basic fix for both or them. Both values are hardcoded here to 100 max depth and 1000 max in total. I'm not sure how better to make them configurable. Looks like it needs some sysctl value like kernel.max_user_ns_per_user, but also something more configurable like new rlimit'ish limit may be created for user_ns needs. E.g. in case root wants one user to contain hundreds of private containers (container owner user), but he doesn't want anybody to fill the kernel with hundreds of containers multiplied by number of system users (equals to thousands). I'm sorry, but this is not a solution. Kernel is not x86-only; there are architectures with far bigger minimal stack frame size. E.g. on sparc64 every fucking stack frame is at least 176 bytes. So your 100 calls deep call chain will happily overflow the damn stack all by itself - kernel stack on sparc64 is 16Kb total, including struct thread_info living there. Understood. How to properly fix it then? Looks like there are quite many kernel structures which may reference other structures which indirectly reference each other via kref, IOW it is not user_ns specific issue. With unprivileged user_ns the way it should be freed must be somehow changed. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts
On Fri, Dec 28, 2012 at 20:05 -0800, Eric W. Biederman wrote: Vasily Kulikov seg...@openwall.com writes: Currently there is completely no limiting in number of user namespaces created by unprivileged users. One can freely create thousands of user_ns'es and exhaust kernel memory without even bumping in RLIMIT_NPROC or similar. First for a proper sense of scale it will take roughly 14,000 to consume a megabyte. So it will take hundreds of millions of user namespaces to eat up all of kernel memory. Yes, but you can freely create *any* number of nested userns by a loop: for() { unshare() write to /proc/self/{u,g}id_map } The code needs several checks. First, noone should be able to create user_ns of arbitrary depth. Besides kernel stack overflow one could create too big depth to DoS processes belonging to other users by forcing them to loop a long time in cap_capable called from some ns_capable() (e.g. in case one does smth like ls -R /proc). Where do you get a ns_capable call from ls -R /proc ? E.g. if procfs is mounted with hidepid=2 then ls does ptrace_may_access() check. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kernel-hardening] [PATCH 1/2] fs: add link restrictions
On Sat, Aug 11, 2012 at 23:34 -0700, Kees Cook wrote: > On Wed, Aug 8, 2012 at 5:19 AM, Vasily Kulikov wrote: > > Hi Kees, > > > > On Wed, Jul 25, 2012 at 17:29 -0700, Kees Cook wrote: > >> +/** > >> + * safe_hardlink_source - Check for safe hardlink conditions > >> + * @inode: the source inode to hardlink from > >> + * > >> + * Return false if at least one of the following conditions: > >> + *- inode is not a regular file > >> + *- inode is setuid > >> + *- inode is setgid and group-exec > >> + *- access failure for read and write > >> + * > >> + * Otherwise returns true. > >> + */ > >> +static bool safe_hardlink_source(struct inode *inode) > >> +{ > >> + umode_t mode = inode->i_mode; > >> + > >> + /* Special files should not get pinned to the filesystem. */ > >> + if (!S_ISREG(mode)) > >> + return false; > >> + > >> + /* Setuid files should not get pinned to the filesystem. */ > >> + if (mode & S_ISUID) > >> + return false; > > > > We don't want to make hardlinks of SUID files, but we still allow to create > > hardlinks to SUID'ish cap'ed files. Probably check whether the inode is > > setcap'ed? > > Excellent idea. It doesn't look like there is anything "simple" to do > this already. It'd be close to get_file_caps() but without the bprm. > Maybe just get_vfs_caps_from_disk() and a walk of the caps? What would > you recommend? Yes, I think get_vfs_caps_from_disk() plus identifying whether any permitted or inheritable capability is set or effective bit is set. IOW, if there is _anything_ related to cababilities in the file, it is protected. > > Probably we can enhance this further and allow LSMs to define whether this > > particular file is special in LSM's point of view (IOW, it can be able to > > move > > a process to another security domain which is served by LSM). > > Yeah. Perhaps implementing the needed check above with a new security > check and have commoncaps do the vfs fetch with LSMs able to override? Yep. Then uid and gid checks from the above code should move to commoncaps function as default candidates for "privileged files" checks. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kernel-hardening] [PATCH 1/2] fs: add link restrictions
On Sat, Aug 11, 2012 at 23:34 -0700, Kees Cook wrote: On Wed, Aug 8, 2012 at 5:19 AM, Vasily Kulikov seg...@openwall.com wrote: Hi Kees, On Wed, Jul 25, 2012 at 17:29 -0700, Kees Cook wrote: +/** + * safe_hardlink_source - Check for safe hardlink conditions + * @inode: the source inode to hardlink from + * + * Return false if at least one of the following conditions: + *- inode is not a regular file + *- inode is setuid + *- inode is setgid and group-exec + *- access failure for read and write + * + * Otherwise returns true. + */ +static bool safe_hardlink_source(struct inode *inode) +{ + umode_t mode = inode-i_mode; + + /* Special files should not get pinned to the filesystem. */ + if (!S_ISREG(mode)) + return false; + + /* Setuid files should not get pinned to the filesystem. */ + if (mode S_ISUID) + return false; We don't want to make hardlinks of SUID files, but we still allow to create hardlinks to SUID'ish cap'ed files. Probably check whether the inode is setcap'ed? Excellent idea. It doesn't look like there is anything simple to do this already. It'd be close to get_file_caps() but without the bprm. Maybe just get_vfs_caps_from_disk() and a walk of the caps? What would you recommend? Yes, I think get_vfs_caps_from_disk() plus identifying whether any permitted or inheritable capability is set or effective bit is set. IOW, if there is _anything_ related to cababilities in the file, it is protected. Probably we can enhance this further and allow LSMs to define whether this particular file is special in LSM's point of view (IOW, it can be able to move a process to another security domain which is served by LSM). Yeah. Perhaps implementing the needed check above with a new security check and have commoncaps do the vfs fetch with LSMs able to override? Yep. Then uid and gid checks from the above code should move to commoncaps function as default candidates for privileged files checks. Thanks, -- Vasily Kulikov http://www.openwall.com - bringing security into open computing environments -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kernel-hardening] [PATCH 1/2] fs: add link restrictions
Hi Kees, On Wed, Jul 25, 2012 at 17:29 -0700, Kees Cook wrote: > +/** > + * safe_hardlink_source - Check for safe hardlink conditions > + * @inode: the source inode to hardlink from > + * > + * Return false if at least one of the following conditions: > + *- inode is not a regular file > + *- inode is setuid > + *- inode is setgid and group-exec > + *- access failure for read and write > + * > + * Otherwise returns true. > + */ > +static bool safe_hardlink_source(struct inode *inode) > +{ > + umode_t mode = inode->i_mode; > + > + /* Special files should not get pinned to the filesystem. */ > + if (!S_ISREG(mode)) > + return false; > + > + /* Setuid files should not get pinned to the filesystem. */ > + if (mode & S_ISUID) > + return false; We don't want to make hardlinks of SUID files, but we still allow to create hardlinks to SUID'ish cap'ed files. Probably check whether the inode is setcap'ed? Probably we can enhance this further and allow LSMs to define whether this particular file is special in LSM's point of view (IOW, it can be able to move a process to another security domain which is served by LSM). > + > + /* Executable setgid files should not get pinned to the filesystem. */ > + if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) > + return false; > + > + /* Hardlinking to unreadable or unwritable sources is dangerous. */ > + if (inode_permission(inode, MAY_READ | MAY_WRITE)) > + return false; > + > + return true; > +} Thanks, -- Vasily -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kernel-hardening] [PATCH 1/2] fs: add link restrictions
Hi Kees, On Wed, Jul 25, 2012 at 17:29 -0700, Kees Cook wrote: +/** + * safe_hardlink_source - Check for safe hardlink conditions + * @inode: the source inode to hardlink from + * + * Return false if at least one of the following conditions: + *- inode is not a regular file + *- inode is setuid + *- inode is setgid and group-exec + *- access failure for read and write + * + * Otherwise returns true. + */ +static bool safe_hardlink_source(struct inode *inode) +{ + umode_t mode = inode-i_mode; + + /* Special files should not get pinned to the filesystem. */ + if (!S_ISREG(mode)) + return false; + + /* Setuid files should not get pinned to the filesystem. */ + if (mode S_ISUID) + return false; We don't want to make hardlinks of SUID files, but we still allow to create hardlinks to SUID'ish cap'ed files. Probably check whether the inode is setcap'ed? Probably we can enhance this further and allow LSMs to define whether this particular file is special in LSM's point of view (IOW, it can be able to move a process to another security domain which is served by LSM). + + /* Executable setgid files should not get pinned to the filesystem. */ + if ((mode (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) + return false; + + /* Hardlinking to unreadable or unwritable sources is dangerous. */ + if (inode_permission(inode, MAY_READ | MAY_WRITE)) + return false; + + return true; +} Thanks, -- Vasily -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ip: fix error handling in ip_finish_output2()
__neigh_create() returns either a pointer to struct neighbour or PTR_ERR(). But the caller expects it to return either a pointer or NULL. Replace the NULL check with IS_ERR() check. The bug was introduced in a263b3093641fb1ec377582c90986a7fd0625184. Signed-off-by: Vasily Kulikov --- Compile tested only. net/ipv4/ip_output.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index ba39a52..76dde25 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -197,7 +197,7 @@ static inline int ip_finish_output2(struct sk_buff *skb) neigh = __ipv4_neigh_lookup_noref(dev, nexthop); if (unlikely(!neigh)) neigh = __neigh_create(_tbl, , dev, false); - if (neigh) { + if (!IS_ERR(neigh)) { int res = dst_neigh_output(dst, neigh, skb); rcu_read_unlock_bh(); -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ip: fix error handling in ip_finish_output2()
__neigh_create() returns either a pointer to struct neighbour or PTR_ERR(). But the caller expects it to return either a pointer or NULL. Replace the NULL check with IS_ERR() check. The bug was introduced in a263b3093641fb1ec377582c90986a7fd0625184. Signed-off-by: Vasily Kulikov seg...@openwall.com --- Compile tested only. net/ipv4/ip_output.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index ba39a52..76dde25 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -197,7 +197,7 @@ static inline int ip_finish_output2(struct sk_buff *skb) neigh = __ipv4_neigh_lookup_noref(dev, nexthop); if (unlikely(!neigh)) neigh = __neigh_create(arp_tbl, nexthop, dev, false); - if (neigh) { + if (!IS_ERR(neigh)) { int res = dst_neigh_output(dst, neigh, skb); rcu_read_unlock_bh(); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/