[PATCH 3/3] remove not-used poison pointer macros

2015-07-26 Thread Vasily Kulikov
Signed-off-by: Vasily Kulikov  
---
 include/linux/poison.h | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index a721bcd..4a27153 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -73,10 +73,6 @@
 #define ATM_POISON_FREE0x12
 #define ATM_POISON 0xdeadbeef
 
-/** net/ **/
-#define NEIGHBOR_DEAD  0xdeadbeef
-#define NETFILTER_LINK_POISON  0xdead57ac
-
 /** kernel/mutexes **/
 #define MUTEX_DEBUG_INIT   0x11
 #define MUTEX_DEBUG_FREE   0x22
@@ -87,7 +83,4 @@
 /** security/ **/
 #define KEY_DESTROY0xbd
 
-/** sound/oss/ **/
-#define OSS_POISON_FREE0xAB
-
 #endif
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] use POISON_POINTER_DELTA for poison pointers

2015-07-26 Thread Vasily Kulikov
TIMER_ENTRY_STATIC and TAIL_MAPPING are defined as poison pointers which
should point to nowhere.  Redefine them using POISON_POINTER_DELTA arithmetics
to make sure they really point to non-mappable area declared by the
target architecture.

Signed-off-by: Vasily Kulikov  
---
 include/linux/poison.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index ee697b9..a721bcd 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -27,14 +27,14 @@
  * Magic number "tsta" to indicate a static timer initializer
  * for the object debugging code.
  */
-#define TIMER_ENTRY_STATIC ((void *) 0x74737461)
+#define TIMER_ENTRY_STATIC ((void *) 0x300 + POISON_POINTER_DELTA)
 
 /** mm/debug-pagealloc.c **/
 #define PAGE_POISON 0xaa
 
 /** mm/page_alloc.c /
 
-#define TAIL_MAPPING   ((void *) 0x01014A11 + POISON_POINTER_DELTA)
+#define TAIL_MAPPING   ((void *) 0x400 + POISON_POINTER_DELTA)
 
 /** mm/slab.c **/
 /*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] fix LIST_POISON{1,2} offset

2015-07-26 Thread Vasily Kulikov
Poison pointer values should be small enough to find a room in
non-mmap'able/hardly-mmap'able space.  E.g. on x86 "poison pointer
space" is located starting from 0x0.  Given unprivileged users cannot
mmap anything below mmap_min_addr, it should be safe to use poison
pointers lower than mmap_min_addr.

The current poison pointer values of LIST_POISON{1,2} might be
too big for mmap_min_addr values equal or less than 1 MB (common case,
e.g. Ubuntu uses only 0x1).  There is little point to use such a big
value given the "poison pointer space" below 1 MB is not yet exhausted.
Changing it to a smaller value solves the problem for small
mmap_min_addr setups.

The values are suggested by Solar Designer:
http://www.openwall.com/lists/oss-security/2015/05/02/6

Signed-off-by: Vasily Kulikov  
---
 include/linux/poison.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 7b2a7fc..ee697b9 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -19,8 +19,8 @@
  * under normal circumstances, used to verify that nobody uses
  * non-initialized list entries.
  */
-#define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
-#define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
+#define LIST_POISON1  ((void *) 0x100 + POISON_POINTER_DELTA)
+#define LIST_POISON2  ((void *) 0x200 + POISON_POINTER_DELTA)
 
 /** include/linux/timer.h **/
 /*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] fix LIST_POISON{1,2} offset

2015-07-26 Thread Vasily Kulikov
Poison pointer values should be small enough to find a room in
non-mmap'able/hardly-mmap'able space.  E.g. on x86 poison pointer
space is located starting from 0x0.  Given unprivileged users cannot
mmap anything below mmap_min_addr, it should be safe to use poison
pointers lower than mmap_min_addr.

The current poison pointer values of LIST_POISON{1,2} might be
too big for mmap_min_addr values equal or less than 1 MB (common case,
e.g. Ubuntu uses only 0x1).  There is little point to use such a big
value given the poison pointer space below 1 MB is not yet exhausted.
Changing it to a smaller value solves the problem for small
mmap_min_addr setups.

The values are suggested by Solar Designer:
http://www.openwall.com/lists/oss-security/2015/05/02/6

Signed-off-by: Vasily Kulikov seg...@openwall.com 
---
 include/linux/poison.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 7b2a7fc..ee697b9 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -19,8 +19,8 @@
  * under normal circumstances, used to verify that nobody uses
  * non-initialized list entries.
  */
-#define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
-#define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
+#define LIST_POISON1  ((void *) 0x100 + POISON_POINTER_DELTA)
+#define LIST_POISON2  ((void *) 0x200 + POISON_POINTER_DELTA)
 
 /** include/linux/timer.h **/
 /*
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] use POISON_POINTER_DELTA for poison pointers

2015-07-26 Thread Vasily Kulikov
TIMER_ENTRY_STATIC and TAIL_MAPPING are defined as poison pointers which
should point to nowhere.  Redefine them using POISON_POINTER_DELTA arithmetics
to make sure they really point to non-mappable area declared by the
target architecture.

Signed-off-by: Vasily Kulikov seg...@openwall.com 
---
 include/linux/poison.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index ee697b9..a721bcd 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -27,14 +27,14 @@
  * Magic number tsta to indicate a static timer initializer
  * for the object debugging code.
  */
-#define TIMER_ENTRY_STATIC ((void *) 0x74737461)
+#define TIMER_ENTRY_STATIC ((void *) 0x300 + POISON_POINTER_DELTA)
 
 /** mm/debug-pagealloc.c **/
 #define PAGE_POISON 0xaa
 
 /** mm/page_alloc.c /
 
-#define TAIL_MAPPING   ((void *) 0x01014A11 + POISON_POINTER_DELTA)
+#define TAIL_MAPPING   ((void *) 0x400 + POISON_POINTER_DELTA)
 
 /** mm/slab.c **/
 /*
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] remove not-used poison pointer macros

2015-07-26 Thread Vasily Kulikov
Signed-off-by: Vasily Kulikov seg...@openwall.com 
---
 include/linux/poison.h | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index a721bcd..4a27153 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -73,10 +73,6 @@
 #define ATM_POISON_FREE0x12
 #define ATM_POISON 0xdeadbeef
 
-/** net/ **/
-#define NEIGHBOR_DEAD  0xdeadbeef
-#define NETFILTER_LINK_POISON  0xdead57ac
-
 /** kernel/mutexes **/
 #define MUTEX_DEBUG_INIT   0x11
 #define MUTEX_DEBUG_FREE   0x22
@@ -87,7 +83,4 @@
 /** security/ **/
 #define KEY_DESTROY0xbd
 
-/** sound/oss/ **/
-#define OSS_POISON_FREE0xAB
-
 #endif
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pointer poisoning macro

2015-06-06 Thread Vasily Kulikov
Hi,

Any feedback?

On Thu, May 14, 2015 at 14:07 +0300, Vasily Kulikov wrote:
> This is a raw version of the patch inspired by the discussion:
> http://www.openwall.com/lists/oss-security/2015/05/02/6.
> 
> The patch tries to achieve two goals:
> 1) Move "overflowed" poison pointers out of the mmap'able memory zone
> 2) Simplify addition of new poison pointers
> 
> The current 0x00200200 poison pointer value of LIST_POISON2 might be
> too big for mmap_min_addr values equal or less than 2 MB (common case,
> e.g. Ubuntu uses only 0x1).  There is little point to use such a big
> value given the "poison pointer space" below 2 MB is not yet exhausted.
> Changing it to a smaller value solves the problem for small
> mmap_min_addr setups.
> 
> In general, poisoned pointers should meet the following criteria:
> 
> 1) poisoned pointer base must be non-mmap'able (already done via
> CONFIG_ILLEGAL_POINTER_VALUE)
> 2) all poisoned pointers (i.e. base+offset) must be non-mmap'able
> 3) a small offset relative to poisoned pointers must be non-mmap'able
> 4) poisoned pointers from different subsystems should be different
> 
> (2) can be solved at the compile time by BUILD_BUG_ON().
> (3) and (4) should be solved by a creator of the new poisoned pointer.
> 
> At least (2) can be done automatically.  I propose a new macro for this
> purpose, POISON_POINTER().  It should check whether the poison pointer
> offset is not too big.  E.g. in case of too big offset the compilation
> fails with the following message:
> 
> mm/page_alloc.c: In function 'free_pages_prepare':
> mm/page_alloc.c:840:23: error: call to '__compiletime_assert_840' 
> declared with attribute error: BUILD_BUG_ON failed: 0x0111400 >= 
> POISON_AREA_SIZE
> 
> There is still an unsolved issue with the macro related to
> static variables initialization.  If you uncomment the line just after
> "FIXME" line, you'll see:
> 
> kernel/irq/spurious.c:23:8: error: braced-group within expression allowed 
> only inside a function
> 
> I'll be happy with the comments to the idea and the implementation.
> 
> diff --git a/include/linux/poison.h b/include/linux/poison.h
> index 7b2a7fc..47b43ed 100644
> --- a/include/linux/poison.h
> +++ b/include/linux/poison.h
> @@ -1,40 +1,54 @@
>  #ifndef _LINUX_POISON_H
>  #define _LINUX_POISON_H
> +#include 
>  
>  /** include/linux/list.h **/
>  
>  /*
>   * Architectures might want to move the poison pointer offset
>   * into some well-recognized area such as 0xdead,
> - * that is also not mappable by user-space exploits:
> + * that is also not mappable by user-space exploits,
> + * by adjusting CONFIG_ILLEGAL_POINTER_VALUE:
>   */
>  #ifdef CONFIG_ILLEGAL_POINTER_VALUE
>  # define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL)
>  #else
>  # define POISON_POINTER_DELTA 0
>  #endif
> +/* 
> + * Poisoned pointers of different subsystems should be different
> + * but must not move far away from POISON_POINTER_DELTA.
> + * Otherwise poisoned pointer might be mmap'able on some architectures.
> + */
> +#define POISON_AREA_SIZE 0x1000
> +#define POISON_POINTER(x) \
> + ({ \
> + BUILD_BUG_ON(x >= POISON_AREA_SIZE); \
> + ((void *)(x) + POISON_POINTER_DELTA);})
>  
>  /*
>   * These are non-NULL pointers that will result in page faults
>   * under normal circumstances, used to verify that nobody uses
>   * non-initialized list entries.
>   */
> -#define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
> -#define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
> +#define LIST_POISON1  POISON_POINTER(0x0100)
> +#define LIST_POISON2  POISON_POINTER(0x0200)
>  
>  /** include/linux/timer.h **/
>  /*
>   * Magic number "tsta" to indicate a static timer initializer
>   * for the object debugging code.
>   */
> -#define TIMER_ENTRY_STATIC   ((void *) 0x74737461)
> +#define TIMER_ENTRY_STATIC   ((void*)0x0300 + POISON_POINTER_DELTA)
> +// FIXME
> +//#define TIMER_ENTRY_STATIC POISON_POINTER(0x0300)
>  
>  /** mm/debug-pagealloc.c **/
>  #define PAGE_POISON 0xaa
>  
>  /** mm/page_alloc.c /
>  
> -#define TAIL_MAPPING ((void *) 0x01014A11 + POISON_POINTER_DELTA)
> +#define TAIL_MAPPING POISON_POINTER(0x400)
>  
>  /** mm/slab.c **/
>  /*
> 
> -- 
> Vasily

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pointer poisoning macro

2015-06-06 Thread Vasily Kulikov
Hi,

Any feedback?

On Thu, May 14, 2015 at 14:07 +0300, Vasily Kulikov wrote:
 This is a raw version of the patch inspired by the discussion:
 http://www.openwall.com/lists/oss-security/2015/05/02/6.
 
 The patch tries to achieve two goals:
 1) Move overflowed poison pointers out of the mmap'able memory zone
 2) Simplify addition of new poison pointers
 
 The current 0x00200200 poison pointer value of LIST_POISON2 might be
 too big for mmap_min_addr values equal or less than 2 MB (common case,
 e.g. Ubuntu uses only 0x1).  There is little point to use such a big
 value given the poison pointer space below 2 MB is not yet exhausted.
 Changing it to a smaller value solves the problem for small
 mmap_min_addr setups.
 
 In general, poisoned pointers should meet the following criteria:
 
 1) poisoned pointer base must be non-mmap'able (already done via
 CONFIG_ILLEGAL_POINTER_VALUE)
 2) all poisoned pointers (i.e. base+offset) must be non-mmap'able
 3) a small offset relative to poisoned pointers must be non-mmap'able
 4) poisoned pointers from different subsystems should be different
 
 (2) can be solved at the compile time by BUILD_BUG_ON().
 (3) and (4) should be solved by a creator of the new poisoned pointer.
 
 At least (2) can be done automatically.  I propose a new macro for this
 purpose, POISON_POINTER().  It should check whether the poison pointer
 offset is not too big.  E.g. in case of too big offset the compilation
 fails with the following message:
 
 mm/page_alloc.c: In function 'free_pages_prepare':
 mm/page_alloc.c:840:23: error: call to '__compiletime_assert_840' 
 declared with attribute error: BUILD_BUG_ON failed: 0x0111400 = 
 POISON_AREA_SIZE
 
 There is still an unsolved issue with the macro related to
 static variables initialization.  If you uncomment the line just after
 FIXME line, you'll see:
 
 kernel/irq/spurious.c:23:8: error: braced-group within expression allowed 
 only inside a function
 
 I'll be happy with the comments to the idea and the implementation.
 
 diff --git a/include/linux/poison.h b/include/linux/poison.h
 index 7b2a7fc..47b43ed 100644
 --- a/include/linux/poison.h
 +++ b/include/linux/poison.h
 @@ -1,40 +1,54 @@
  #ifndef _LINUX_POISON_H
  #define _LINUX_POISON_H
 +#include linux/bug.h
  
  /** include/linux/list.h **/
  
  /*
   * Architectures might want to move the poison pointer offset
   * into some well-recognized area such as 0xdead,
 - * that is also not mappable by user-space exploits:
 + * that is also not mappable by user-space exploits,
 + * by adjusting CONFIG_ILLEGAL_POINTER_VALUE:
   */
  #ifdef CONFIG_ILLEGAL_POINTER_VALUE
  # define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL)
  #else
  # define POISON_POINTER_DELTA 0
  #endif
 +/* 
 + * Poisoned pointers of different subsystems should be different
 + * but must not move far away from POISON_POINTER_DELTA.
 + * Otherwise poisoned pointer might be mmap'able on some architectures.
 + */
 +#define POISON_AREA_SIZE 0x1000
 +#define POISON_POINTER(x) \
 + ({ \
 + BUILD_BUG_ON(x = POISON_AREA_SIZE); \
 + ((void *)(x) + POISON_POINTER_DELTA);})
  
  /*
   * These are non-NULL pointers that will result in page faults
   * under normal circumstances, used to verify that nobody uses
   * non-initialized list entries.
   */
 -#define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
 -#define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
 +#define LIST_POISON1  POISON_POINTER(0x0100)
 +#define LIST_POISON2  POISON_POINTER(0x0200)
  
  /** include/linux/timer.h **/
  /*
   * Magic number tsta to indicate a static timer initializer
   * for the object debugging code.
   */
 -#define TIMER_ENTRY_STATIC   ((void *) 0x74737461)
 +#define TIMER_ENTRY_STATIC   ((void*)0x0300 + POISON_POINTER_DELTA)
 +// FIXME
 +//#define TIMER_ENTRY_STATIC POISON_POINTER(0x0300)
  
  /** mm/debug-pagealloc.c **/
  #define PAGE_POISON 0xaa
  
  /** mm/page_alloc.c /
  
 -#define TAIL_MAPPING ((void *) 0x01014A11 + POISON_POINTER_DELTA)
 +#define TAIL_MAPPING POISON_POINTER(0x400)
  
  /** mm/slab.c **/
  /*
 
 -- 
 Vasily

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] pointer poisoning macro

2015-05-14 Thread Vasily Kulikov
This is a raw version of the patch inspired by the discussion:
http://www.openwall.com/lists/oss-security/2015/05/02/6.

The patch tries to achieve two goals:
1) Move "overflowed" poison pointers out of the mmap'able memory zone
2) Simplify addition of new poison pointers

The current 0x00200200 poison pointer value of LIST_POISON2 might be
too big for mmap_min_addr values equal or less than 2 MB (common case,
e.g. Ubuntu uses only 0x1).  There is little point to use such a big
value given the "poison pointer space" below 2 MB is not yet exhausted.
Changing it to a smaller value solves the problem for small
mmap_min_addr setups.

In general, poisoned pointers should meet the following criteria:

1) poisoned pointer base must be non-mmap'able (already done via
CONFIG_ILLEGAL_POINTER_VALUE)
2) all poisoned pointers (i.e. base+offset) must be non-mmap'able
3) a small offset relative to poisoned pointers must be non-mmap'able
4) poisoned pointers from different subsystems should be different

(2) can be solved at the compile time by BUILD_BUG_ON().
(3) and (4) should be solved by a creator of the new poisoned pointer.

At least (2) can be done automatically.  I propose a new macro for this
purpose, POISON_POINTER().  It should check whether the poison pointer
offset is not too big.  E.g. in case of too big offset the compilation
fails with the following message:

mm/page_alloc.c: In function 'free_pages_prepare':
mm/page_alloc.c:840:23: error: call to '__compiletime_assert_840' declared 
with attribute error: BUILD_BUG_ON failed: 0x0111400 >= POISON_AREA_SIZE

There is still an unsolved issue with the macro related to
static variables initialization.  If you uncomment the line just after
"FIXME" line, you'll see:

kernel/irq/spurious.c:23:8: error: braced-group within expression allowed 
only inside a function

I'll be happy with the comments to the idea and the implementation.

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 7b2a7fc..47b43ed 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -1,40 +1,54 @@
 #ifndef _LINUX_POISON_H
 #define _LINUX_POISON_H
+#include 
 
 /** include/linux/list.h **/
 
 /*
  * Architectures might want to move the poison pointer offset
  * into some well-recognized area such as 0xdead,
- * that is also not mappable by user-space exploits:
+ * that is also not mappable by user-space exploits,
+ * by adjusting CONFIG_ILLEGAL_POINTER_VALUE:
  */
 #ifdef CONFIG_ILLEGAL_POINTER_VALUE
 # define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL)
 #else
 # define POISON_POINTER_DELTA 0
 #endif
+/* 
+ * Poisoned pointers of different subsystems should be different
+ * but must not move far away from POISON_POINTER_DELTA.
+ * Otherwise poisoned pointer might be mmap'able on some architectures.
+ */
+#define POISON_AREA_SIZE 0x1000
+#define POISON_POINTER(x) \
+   ({ \
+   BUILD_BUG_ON(x >= POISON_AREA_SIZE); \
+   ((void *)(x) + POISON_POINTER_DELTA);})
 
 /*
  * These are non-NULL pointers that will result in page faults
  * under normal circumstances, used to verify that nobody uses
  * non-initialized list entries.
  */
-#define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
-#define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
+#define LIST_POISON1  POISON_POINTER(0x0100)
+#define LIST_POISON2  POISON_POINTER(0x0200)
 
 /** include/linux/timer.h **/
 /*
  * Magic number "tsta" to indicate a static timer initializer
  * for the object debugging code.
  */
-#define TIMER_ENTRY_STATIC ((void *) 0x74737461)
+#define TIMER_ENTRY_STATIC ((void*)0x0300 + POISON_POINTER_DELTA)
+// FIXME
+//#define TIMER_ENTRY_STATIC   POISON_POINTER(0x0300)
 
 /** mm/debug-pagealloc.c **/
 #define PAGE_POISON 0xaa
 
 /** mm/page_alloc.c /
 
-#define TAIL_MAPPING   ((void *) 0x01014A11 + POISON_POINTER_DELTA)
+#define TAIL_MAPPING   POISON_POINTER(0x400)
 
 /** mm/slab.c **/
 /*

-- 
Vasily
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] pointer poisoning macro

2015-05-14 Thread Vasily Kulikov
This is a raw version of the patch inspired by the discussion:
http://www.openwall.com/lists/oss-security/2015/05/02/6.

The patch tries to achieve two goals:
1) Move overflowed poison pointers out of the mmap'able memory zone
2) Simplify addition of new poison pointers

The current 0x00200200 poison pointer value of LIST_POISON2 might be
too big for mmap_min_addr values equal or less than 2 MB (common case,
e.g. Ubuntu uses only 0x1).  There is little point to use such a big
value given the poison pointer space below 2 MB is not yet exhausted.
Changing it to a smaller value solves the problem for small
mmap_min_addr setups.

In general, poisoned pointers should meet the following criteria:

1) poisoned pointer base must be non-mmap'able (already done via
CONFIG_ILLEGAL_POINTER_VALUE)
2) all poisoned pointers (i.e. base+offset) must be non-mmap'able
3) a small offset relative to poisoned pointers must be non-mmap'able
4) poisoned pointers from different subsystems should be different

(2) can be solved at the compile time by BUILD_BUG_ON().
(3) and (4) should be solved by a creator of the new poisoned pointer.

At least (2) can be done automatically.  I propose a new macro for this
purpose, POISON_POINTER().  It should check whether the poison pointer
offset is not too big.  E.g. in case of too big offset the compilation
fails with the following message:

mm/page_alloc.c: In function 'free_pages_prepare':
mm/page_alloc.c:840:23: error: call to '__compiletime_assert_840' declared 
with attribute error: BUILD_BUG_ON failed: 0x0111400 = POISON_AREA_SIZE

There is still an unsolved issue with the macro related to
static variables initialization.  If you uncomment the line just after
FIXME line, you'll see:

kernel/irq/spurious.c:23:8: error: braced-group within expression allowed 
only inside a function

I'll be happy with the comments to the idea and the implementation.

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 7b2a7fc..47b43ed 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -1,40 +1,54 @@
 #ifndef _LINUX_POISON_H
 #define _LINUX_POISON_H
+#include linux/bug.h
 
 /** include/linux/list.h **/
 
 /*
  * Architectures might want to move the poison pointer offset
  * into some well-recognized area such as 0xdead,
- * that is also not mappable by user-space exploits:
+ * that is also not mappable by user-space exploits,
+ * by adjusting CONFIG_ILLEGAL_POINTER_VALUE:
  */
 #ifdef CONFIG_ILLEGAL_POINTER_VALUE
 # define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL)
 #else
 # define POISON_POINTER_DELTA 0
 #endif
+/* 
+ * Poisoned pointers of different subsystems should be different
+ * but must not move far away from POISON_POINTER_DELTA.
+ * Otherwise poisoned pointer might be mmap'able on some architectures.
+ */
+#define POISON_AREA_SIZE 0x1000
+#define POISON_POINTER(x) \
+   ({ \
+   BUILD_BUG_ON(x = POISON_AREA_SIZE); \
+   ((void *)(x) + POISON_POINTER_DELTA);})
 
 /*
  * These are non-NULL pointers that will result in page faults
  * under normal circumstances, used to verify that nobody uses
  * non-initialized list entries.
  */
-#define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
-#define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)
+#define LIST_POISON1  POISON_POINTER(0x0100)
+#define LIST_POISON2  POISON_POINTER(0x0200)
 
 /** include/linux/timer.h **/
 /*
  * Magic number tsta to indicate a static timer initializer
  * for the object debugging code.
  */
-#define TIMER_ENTRY_STATIC ((void *) 0x74737461)
+#define TIMER_ENTRY_STATIC ((void*)0x0300 + POISON_POINTER_DELTA)
+// FIXME
+//#define TIMER_ENTRY_STATIC   POISON_POINTER(0x0300)
 
 /** mm/debug-pagealloc.c **/
 #define PAGE_POISON 0xaa
 
 /** mm/page_alloc.c /
 
-#define TAIL_MAPPING   ((void *) 0x01014A11 + POISON_POINTER_DELTA)
+#define TAIL_MAPPING   POISON_POINTER(0x400)
 
 /** mm/slab.c **/
 /*

-- 
Vasily
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns

2014-05-31 Thread Vasily Kulikov
On Thu, May 29, 2014 at 16:53 +0400, Pavel Emelyanov wrote:
> On 05/29/2014 03:59 PM, Vasily Kulikov wrote:
> > On Thu, May 29, 2014 at 15:31 +0400, Pavel Emelyanov wrote:
> >> On 05/29/2014 03:12 PM, Vasily Kulikov wrote:
> >>> On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote:
> >>>> On 05/29/2014 09:59 AM, Vasily Kulikov wrote:
> >>>>> On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote:
> >>>>> ] We need a direct method of getting the pid inside containers.
> >>>>> ] If some issues occurred inside container guest, host user
> >>>>> ] could not know which process is in trouble just by guest pid:
> >>>>> ] the users of container guest only knew the pid inside containers.
> >>>>> ] This will bring obstacle for trouble shooting.
> >>>>>
> >>>>> A new syscall might complicate trouble shooting by admin.
> >>>>
> >>>> Pure syscall -- yes. What if we teach the ps and top utilities to show 
> >>>> additional
> >>>> info? I think that would help.
> >>>
> >>> I like the idea with low level non-shell API which can be used by
> >>> utility like ps (or implementation of a new tool to work with complex
> >>> namespace hierarchies).  It should fit for troublesooting.  Then there
> >>> should be no reason to implement two different APIs for observation from
> >>> shell via FS and from applications.
> >>
> >> Maybe we can reuse the existing kcmp() system call? We would have to store
> >> the collected pid values in some hash/tree anyway, and kcmp() provides us
> >> good comparing function for doing this.
> >>
> >> Like we can call kcmp(pid1, pid2, KCMP_PID, nsfd1, nsfd2) which will mean
> >> "Are tasks with pid1 in namespace pointed by nsfd1 and with pid2 in 
> >> namespace
> >> nsfd2 the same?"
> >>
> >> What do you think?
> > 
> > kcmp() is not needed, just compare inode numbers:
> > 
> > # ls -il /proc/{43,self}/ns/mnt
> > 208182 lrwxrwxrwx 1 root root 0 мая   29 15:52 /proc/43/ns/mnt -> 
> > mnt:[4026531856]
> > 216556 lrwxrwxrwx 1 root root 0 мая   29 15:57 /proc/self/ns/mnt -> 
> > mnt:[4026531840]
> 
> But that's for comparing the namespaces, while I'm proposing the kcmp to
> check for PIDs.

Hm, right.

What about the following solution: export global process ID (PID in
init ns) which is visible inside of any namespace.  Then you can compare
numbers regardless in what namespace you are.

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns

2014-05-31 Thread Vasily Kulikov
On Thu, May 29, 2014 at 16:53 +0400, Pavel Emelyanov wrote:
 On 05/29/2014 03:59 PM, Vasily Kulikov wrote:
  On Thu, May 29, 2014 at 15:31 +0400, Pavel Emelyanov wrote:
  On 05/29/2014 03:12 PM, Vasily Kulikov wrote:
  On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote:
  On 05/29/2014 09:59 AM, Vasily Kulikov wrote:
  On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote:
  ] We need a direct method of getting the pid inside containers.
  ] If some issues occurred inside container guest, host user
  ] could not know which process is in trouble just by guest pid:
  ] the users of container guest only knew the pid inside containers.
  ] This will bring obstacle for trouble shooting.
 
  A new syscall might complicate trouble shooting by admin.
 
  Pure syscall -- yes. What if we teach the ps and top utilities to show 
  additional
  info? I think that would help.
 
  I like the idea with low level non-shell API which can be used by
  utility like ps (or implementation of a new tool to work with complex
  namespace hierarchies).  It should fit for troublesooting.  Then there
  should be no reason to implement two different APIs for observation from
  shell via FS and from applications.
 
  Maybe we can reuse the existing kcmp() system call? We would have to store
  the collected pid values in some hash/tree anyway, and kcmp() provides us
  good comparing function for doing this.
 
  Like we can call kcmp(pid1, pid2, KCMP_PID, nsfd1, nsfd2) which will mean
  Are tasks with pid1 in namespace pointed by nsfd1 and with pid2 in 
  namespace
  nsfd2 the same?
 
  What do you think?
  
  kcmp() is not needed, just compare inode numbers:
  
  # ls -il /proc/{43,self}/ns/mnt
  208182 lrwxrwxrwx 1 root root 0 мая   29 15:52 /proc/43/ns/mnt - 
  mnt:[4026531856]
  216556 lrwxrwxrwx 1 root root 0 мая   29 15:57 /proc/self/ns/mnt - 
  mnt:[4026531840]
 
 But that's for comparing the namespaces, while I'm proposing the kcmp to
 check for PIDs.

Hm, right.

What about the following solution: export global process ID (PID in
init ns) which is visible inside of any namespace.  Then you can compare
numbers regardless in what namespace you are.

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns

2014-05-29 Thread Vasily Kulikov
On Thu, May 29, 2014 at 15:31 +0400, Pavel Emelyanov wrote:
> On 05/29/2014 03:12 PM, Vasily Kulikov wrote:
> > On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote:
> >> On 05/29/2014 09:59 AM, Vasily Kulikov wrote:
> >>> On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote:
> >>> ] We need a direct method of getting the pid inside containers.
> >>> ] If some issues occurred inside container guest, host user
> >>> ] could not know which process is in trouble just by guest pid:
> >>> ] the users of container guest only knew the pid inside containers.
> >>> ] This will bring obstacle for trouble shooting.
> >>>
> >>> A new syscall might complicate trouble shooting by admin.
> >>
> >> Pure syscall -- yes. What if we teach the ps and top utilities to show 
> >> additional
> >> info? I think that would help.
> > 
> > I like the idea with low level non-shell API which can be used by
> > utility like ps (or implementation of a new tool to work with complex
> > namespace hierarchies).  It should fit for troublesooting.  Then there
> > should be no reason to implement two different APIs for observation from
> > shell via FS and from applications.
> 
> Maybe we can reuse the existing kcmp() system call? We would have to store
> the collected pid values in some hash/tree anyway, and kcmp() provides us
> good comparing function for doing this.
> 
> Like we can call kcmp(pid1, pid2, KCMP_PID, nsfd1, nsfd2) which will mean
> "Are tasks with pid1 in namespace pointed by nsfd1 and with pid2 in namespace
> nsfd2 the same?"
> 
> What do you think?

kcmp() is not needed, just compare inode numbers:

# ls -il /proc/{43,self}/ns/mnt
208182 lrwxrwxrwx 1 root root 0 мая   29 15:52 /proc/43/ns/mnt -> 
mnt:[4026531856]
216556 lrwxrwxrwx 1 root root 0 мая   29 15:57 /proc/self/ns/mnt -> 
mnt:[4026531840]

> > However, maybe it is possible to implement not via new syscall but
> > by implementation of new symlink in sysfs?  Then both ps-like tool and
> > CRIU-like tool is able to obtain the ns information by the same means.
> > Maybe sort of a symlink to a parent namespace or a process which is
> > inside of the parent namespace?  Then a process may identify IDs using
> > following steps:
> > 
> > 1) identify target NS by walking current procfs
> > 2) do setns(2)/chroot(2)
> > 3) look at procfs to identify target IDs in the target NS
> 
> Can you elaborate on this? Let's imagine we have picked two tasks with
> init_pid_ns' PIDs being 11 and 12 and we've found out using /proc/pid/ns/pid
> links that they both live in some non-init pid namespace.
> 
> Then we have to look at this ns' proc. It says that there are also two 
> tasks -- 2 and 3. How can we determine which pid is which?

Oh, right.  My idea is broken.

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns

2014-05-29 Thread Vasily Kulikov
On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote:
> On 05/29/2014 09:59 AM, Vasily Kulikov wrote:
> > On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote:
> > ] We need a direct method of getting the pid inside containers.
> > ] If some issues occurred inside container guest, host user
> > ] could not know which process is in trouble just by guest pid:
> > ] the users of container guest only knew the pid inside containers.
> > ] This will bring obstacle for trouble shooting.
> > 
> > A new syscall might complicate trouble shooting by admin.
> 
> Pure syscall -- yes. What if we teach the ps and top utilities to show 
> additional
> info? I think that would help.

I like the idea with low level non-shell API which can be used by
utility like ps (or implementation of a new tool to work with complex
namespace hierarchies).  It should fit for troublesooting.  Then there
should be no reason to implement two different APIs for observation from
shell via FS and from applications.

However, maybe it is possible to implement not via new syscall but
by implementation of new symlink in sysfs?  Then both ps-like tool and
CRIU-like tool is able to obtain the ns information by the same means.
Maybe sort of a symlink to a parent namespace or a process which is
inside of the parent namespace?  Then a process may identify IDs using
following steps:

1) identify target NS by walking current procfs
2) do setns(2)/chroot(2)
3) look at procfs to identify target IDs in the target NS

It would be impossible to identify foreign IDs for unprivileged
processes, however.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns

2014-05-29 Thread Vasily Kulikov
On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote:
> On 05/28/2014 10:28 PM, Vasily Kulikov wrote:
> > On Wed, May 28, 2014 at 16:44 +0400, Pavel Emelyanov wrote:
> > It will be simplier
> > to parse the file -- if 'ns_ids' file contains some ID then this ID for
> > every ns can be obtained regardless of the specific ID name (SID, PID,
> > PGID, etc.).
> 
> True, but given a task PID how to determine which pid namespaces it lives in
> to get the idea of how PIDs map to each other? Maybe we need some explicit
> API for converting (ID, NS1, NS2) into (ID)?

AFAIU the idea of the patch is to add a new debugging information which
can be trivially obtained via 'cat /proc/...':

] We need a direct method of getting the pid inside containers.
] If some issues occurred inside container guest, host user
] could not know which process is in trouble just by guest pid:
] the users of container guest only knew the pid inside containers.
] This will bring obstacle for trouble shooting.

A new syscall might complicate trouble shooting by admin.

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns

2014-05-29 Thread Vasily Kulikov
On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote:
 On 05/28/2014 10:28 PM, Vasily Kulikov wrote:
  On Wed, May 28, 2014 at 16:44 +0400, Pavel Emelyanov wrote:
  It will be simplier
  to parse the file -- if 'ns_ids' file contains some ID then this ID for
  every ns can be obtained regardless of the specific ID name (SID, PID,
  PGID, etc.).
 
 True, but given a task PID how to determine which pid namespaces it lives in
 to get the idea of how PIDs map to each other? Maybe we need some explicit
 API for converting (ID, NS1, NS2) into (ID)?

AFAIU the idea of the patch is to add a new debugging information which
can be trivially obtained via 'cat /proc/...':

] We need a direct method of getting the pid inside containers.
] If some issues occurred inside container guest, host user
] could not know which process is in trouble just by guest pid:
] the users of container guest only knew the pid inside containers.
] This will bring obstacle for trouble shooting.

A new syscall might complicate trouble shooting by admin.

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns

2014-05-29 Thread Vasily Kulikov
On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote:
 On 05/29/2014 09:59 AM, Vasily Kulikov wrote:
  On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote:
  ] We need a direct method of getting the pid inside containers.
  ] If some issues occurred inside container guest, host user
  ] could not know which process is in trouble just by guest pid:
  ] the users of container guest only knew the pid inside containers.
  ] This will bring obstacle for trouble shooting.
  
  A new syscall might complicate trouble shooting by admin.
 
 Pure syscall -- yes. What if we teach the ps and top utilities to show 
 additional
 info? I think that would help.

I like the idea with low level non-shell API which can be used by
utility like ps (or implementation of a new tool to work with complex
namespace hierarchies).  It should fit for troublesooting.  Then there
should be no reason to implement two different APIs for observation from
shell via FS and from applications.

However, maybe it is possible to implement not via new syscall but
by implementation of new symlink in sysfs?  Then both ps-like tool and
CRIU-like tool is able to obtain the ns information by the same means.
Maybe sort of a symlink to a parent namespace or a process which is
inside of the parent namespace?  Then a process may identify IDs using
following steps:

1) identify target NS by walking current procfs
2) do setns(2)/chroot(2)
3) look at procfs to identify target IDs in the target NS

It would be impossible to identify foreign IDs for unprivileged
processes, however.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns

2014-05-29 Thread Vasily Kulikov
On Thu, May 29, 2014 at 15:31 +0400, Pavel Emelyanov wrote:
 On 05/29/2014 03:12 PM, Vasily Kulikov wrote:
  On Thu, May 29, 2014 at 13:07 +0400, Pavel Emelyanov wrote:
  On 05/29/2014 09:59 AM, Vasily Kulikov wrote:
  On Wed, May 28, 2014 at 23:27 +0400, Pavel Emelyanov wrote:
  ] We need a direct method of getting the pid inside containers.
  ] If some issues occurred inside container guest, host user
  ] could not know which process is in trouble just by guest pid:
  ] the users of container guest only knew the pid inside containers.
  ] This will bring obstacle for trouble shooting.
 
  A new syscall might complicate trouble shooting by admin.
 
  Pure syscall -- yes. What if we teach the ps and top utilities to show 
  additional
  info? I think that would help.
  
  I like the idea with low level non-shell API which can be used by
  utility like ps (or implementation of a new tool to work with complex
  namespace hierarchies).  It should fit for troublesooting.  Then there
  should be no reason to implement two different APIs for observation from
  shell via FS and from applications.
 
 Maybe we can reuse the existing kcmp() system call? We would have to store
 the collected pid values in some hash/tree anyway, and kcmp() provides us
 good comparing function for doing this.
 
 Like we can call kcmp(pid1, pid2, KCMP_PID, nsfd1, nsfd2) which will mean
 Are tasks with pid1 in namespace pointed by nsfd1 and with pid2 in namespace
 nsfd2 the same?
 
 What do you think?

kcmp() is not needed, just compare inode numbers:

# ls -il /proc/{43,self}/ns/mnt
208182 lrwxrwxrwx 1 root root 0 мая   29 15:52 /proc/43/ns/mnt - 
mnt:[4026531856]
216556 lrwxrwxrwx 1 root root 0 мая   29 15:57 /proc/self/ns/mnt - 
mnt:[4026531840]

  However, maybe it is possible to implement not via new syscall but
  by implementation of new symlink in sysfs?  Then both ps-like tool and
  CRIU-like tool is able to obtain the ns information by the same means.
  Maybe sort of a symlink to a parent namespace or a process which is
  inside of the parent namespace?  Then a process may identify IDs using
  following steps:
  
  1) identify target NS by walking current procfs
  2) do setns(2)/chroot(2)
  3) look at procfs to identify target IDs in the target NS
 
 Can you elaborate on this? Let's imagine we have picked two tasks with
 init_pid_ns' PIDs being 11 and 12 and we've found out using /proc/pid/ns/pid
 links that they both live in some non-init pid namespace.
 
 Then we have to look at this ns' proc. It says that there are also two 
 tasks -- 2 and 3. How can we determine which pid is which?

Oh, right.  My idea is broken.

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns

2014-05-28 Thread Vasily Kulikov
On Wed, May 28, 2014 at 16:44 +0400, Pavel Emelyanov wrote:
> On 05/28/2014 02:24 PM, Chen Hanxiao wrote:
> > We need a direct method of getting the pid inside containers.
> 
> But there's more generic issue -- some day we'll need to know not only
> PIDs as seen from different namespaces, but also SIDs and PGIDs.

Maybe include all per-ns ID in a separate file?  Then the old 'status'
file includes IDs from the current namespace only, the new file (e.g.
'ids' or 'ns_ids') contains only hierarchical IDs which differ from
namespace to namespace for all possible namespaces.  It will be simplier
to parse the file -- if 'ns_ids' file contains some ID then this ID for
every ns can be obtained regardless of the specific ID name (SID, PID,
PGID, etc.).

> 
> > If some issues occurred inside container guest, host user
> > could not know which process is in trouble just by guest pid:
> > the users of container guest only knew the pid inside containers.
> > This will bring obstacle for trouble shooting.
> > 
> > This patch adds two fields:
> > 
> > NStgid and NSpid.
> > 
> > a) In init_pid_ns, nothing changed;
> > 
> > b) In one pidns, will tell the pid inside containers:
> > NStgid: 16289   3
> > NSpid:  16289   3
> > ** Process id is 1628 in level 0, 9 in level 1, 3 in level 2.
> > 
> > c) If pidns is nested, it depends on which pidns are you in.
> > NStgid: 9   3
> > NSpid:  9   3
> > ** Views from level 1

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] /proc/pid/status: show all sets of pid according to ns

2014-05-28 Thread Vasily Kulikov
On Wed, May 28, 2014 at 16:44 +0400, Pavel Emelyanov wrote:
 On 05/28/2014 02:24 PM, Chen Hanxiao wrote:
  We need a direct method of getting the pid inside containers.
 
 But there's more generic issue -- some day we'll need to know not only
 PIDs as seen from different namespaces, but also SIDs and PGIDs.

Maybe include all per-ns ID in a separate file?  Then the old 'status'
file includes IDs from the current namespace only, the new file (e.g.
'ids' or 'ns_ids') contains only hierarchical IDs which differ from
namespace to namespace for all possible namespaces.  It will be simplier
to parse the file -- if 'ns_ids' file contains some ID then this ID for
every ns can be obtained regardless of the specific ID name (SID, PID,
PGID, etc.).

 
  If some issues occurred inside container guest, host user
  could not know which process is in trouble just by guest pid:
  the users of container guest only knew the pid inside containers.
  This will bring obstacle for trouble shooting.
  
  This patch adds two fields:
  
  NStgid and NSpid.
  
  a) In init_pid_ns, nothing changed;
  
  b) In one pidns, will tell the pid inside containers:
  NStgid: 16289   3
  NSpid:  16289   3
  ** Process id is 1628 in level 0, 9 in level 1, 3 in level 2.
  
  c) If pidns is nested, it depends on which pidns are you in.
  NStgid: 9   3
  NSpid:  9   3
  ** Views from level 1

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] /proc/pid/status: show all sets of pid according to ns

2014-05-26 Thread Vasily Kulikov
Hi Chen,

On Mon, May 26, 2014 at 18:05 +0800, Chen Hanxiao wrote:
> We need a direct method of getting the pid inside containers.
> If some issues occurred inside a container guest, host user
> could not know which process is in trouble just by guest pid:
> the users of container guest only knew the pid inside containers.
> This will bring obstacle for trouble shooting.
> 
> This patch expands fields of Tgid and Pid:
> a) In init_pid_ns, nothing changed;
> 
> b) In one pidns, they will tell the pid inside containers:
> Tgid: 16289   3
> Pid:  16289   3
> ** process id is 1628 in level 0, 9 in level 1, 3 in level 2.

1. It breaks ABI.  Any application which does something like "grep pid: | cut 
-d: -f2"
is now broken by the patch.  Maybe add a new field like 'Pid-ns', 'PidNS',
or 'Pids' and leave the old one for compatibility?

2. Is it OK to show internal pids to unprivileged processes?  I cannot
see anything obviously dangerous with it, though.

> c) If pidns is nested, it depends on which pidns are you in.
> Tgid: 9   3
> Pid:  9   3
> ** Views from level 1 for Pid 1628 in host.

-- 
Vasily
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] /proc/pid/status: show all sets of pid according to ns

2014-05-26 Thread Vasily Kulikov
Hi Chen,

On Mon, May 26, 2014 at 18:05 +0800, Chen Hanxiao wrote:
 We need a direct method of getting the pid inside containers.
 If some issues occurred inside a container guest, host user
 could not know which process is in trouble just by guest pid:
 the users of container guest only knew the pid inside containers.
 This will bring obstacle for trouble shooting.
 
 This patch expands fields of Tgid and Pid:
 a) In init_pid_ns, nothing changed;
 
 b) In one pidns, they will tell the pid inside containers:
 Tgid: 16289   3
 Pid:  16289   3
 ** process id is 1628 in level 0, 9 in level 1, 3 in level 2.

1. It breaks ABI.  Any application which does something like grep pid: | cut 
-d: -f2
is now broken by the patch.  Maybe add a new field like 'Pid-ns', 'PidNS',
or 'Pids' and leave the old one for compatibility?

2. Is it OK to show internal pids to unprivileged processes?  I cannot
see anything obviously dangerous with it, though.

 c) If pidns is nested, it depends on which pidns are you in.
 Tgid: 9   3
 Pid:  9   3
 ** Views from level 1 for Pid 1628 in host.

-- 
Vasily
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Devel] call_usermodehelper in containers

2013-11-11 Thread Vasily Kulikov
Hi Jeff,

On Mon, Nov 11, 2013 at 07:18 -0500, Jeff Layton wrote:
> What's the correct approach to fix this? One possibility would be to
> keep a kernel thread around that sits in the correct namespace(s) and
> has the right privileges, and then use that to launch UMH programs.
> That thread could be spawned whenever someone runs rpc.nfsd inside a
> container.
> 
> Not very elegant, but it seems like something that would work.
> 
> Are there better approaches?

What's the reasoning behind this?  I mean, it is not very obvious what
we should keep here.  Compare 2 cases:

1) root process with all caps spawns new ns, then drops some of caps;

2) root process with all caps drops some of his caps and then spawns new ns.

>From the programmer's POV both cases are valid and lead to absolutely
the same limitations inside of the new namespace.  However, from kernel
POV they differ -- if save cap set when ns is created then in (1) we'll
have cap'ed UMH, in (2) we'll have UMH with only several caps.  It might
significantly influence on ability of UMH to do its job and ability of
this limited ns to escape from the sandbox.

So, what semantic should UMH privileges have?


Also, an orthogonal addition: you might want to keep only minimum
information about capabilities or something -- keep only cap_t field in
namespace structure without explicit kernel thread for each ns.  When UMH is
created, just fill the required caps in it.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Devel] call_usermodehelper in containers

2013-11-11 Thread Vasily Kulikov
Hi Jeff,

On Mon, Nov 11, 2013 at 07:18 -0500, Jeff Layton wrote:
 What's the correct approach to fix this? One possibility would be to
 keep a kernel thread around that sits in the correct namespace(s) and
 has the right privileges, and then use that to launch UMH programs.
 That thread could be spawned whenever someone runs rpc.nfsd inside a
 container.
 
 Not very elegant, but it seems like something that would work.
 
 Are there better approaches?

What's the reasoning behind this?  I mean, it is not very obvious what
we should keep here.  Compare 2 cases:

1) root process with all caps spawns new ns, then drops some of caps;

2) root process with all caps drops some of his caps and then spawns new ns.

From the programmer's POV both cases are valid and lead to absolutely
the same limitations inside of the new namespace.  However, from kernel
POV they differ -- if save cap set when ns is created then in (1) we'll
have cap'ed UMH, in (2) we'll have UMH with only several caps.  It might
significantly influence on ability of UMH to do its job and ability of
this limited ns to escape from the sandbox.

So, what semantic should UMH privileges have?


Also, an orthogonal addition: you might want to keep only minimum
information about capabilities or something -- keep only cap_t field in
namespace structure without explicit kernel thread for each ns.  When UMH is
created, just fill the required caps in it.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] fs: Limit sys_mount to only request filesystem modules.

2013-03-04 Thread Vasily Kulikov
(cc'ed kernel-hardening)

On Sun, Mar 03, 2013 at 23:51 -0800, Eric W. Biederman wrote:
> Modify the request_module to prefix the file system type with "fs-"
> and add aliases to all of the filesystems that can be built as modules
> to match.
> 
> A common practice is to build all of the kernel code and leave code
> that is not commonly needed as modules, with the result that many
> users are exposed to any bug anywhere in the kernel.
> 
> Looking for filesystems with a fs- prefix limits the pool of possible
> modules that can be loaded by mount to just filesystems trivially
> making things safer with no real cost.
> 
> Using aliases means user space can control the policy of which
> filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
> with blacklist and alias directives.  Allowing simple, safe,
> well understood work-arounds to known problematic software.
> 
> This also addresses a rare but unfortunate problem where the filesystem
> name is not the same as it's module name and module auto-loading
> would not work.  While writing this patch I saw a handful of such
> cases.  The most significant being autofs that lives in the module
> autofs4.
> 
> This is relevant to user namespaces because we can reach the request
> module in get_fs_type() without having any special permissions, and
> people get uncomfortable when a user specified string (in this case
> the filesystem type) goes all of the way to request_module.
> 
> After having looked at this issue I don't think there is any
> particular reason to perform any filtering or permission checks beyond
> making it clear in the module request that we want a filesystem
> module.  The common pattern in the kernel is to call request_module()
> without regards to the users permissions.  In general all a filesystem
> module does once loaded is call register_filesystem() and go to sleep.
> Which means there is not much attack surface exposed by loading a
> filesytem module unless the filesystem is mounted.  In a user
> namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
> which most filesystems do not set today.
> 
> Acked-by: Serge Hallyn 
> Acked-by: Kees Cook 
> Reported-by: Kees Cook 
> Signed-off-by: "Eric W. Biederman" 
...
> diff --git a/fs/filesystems.c b/fs/filesystems.c
> index da165f6..92567d9 100644
> --- a/fs/filesystems.c
> +++ b/fs/filesystems.c
> @@ -273,7 +273,7 @@ struct file_system_type *get_fs_type(const char *name)
>   int len = dot ? dot - name : strlen(name);
>  
>   fs = __get_fs_type(name, len);
> - if (!fs && (request_module("%.*s", len, name) == 0))
> + if (!fs && (request_module("fs-%.*s", len, name) == 0))
>   fs = __get_fs_type(name, len);
>  
>   if (dot && fs && !(fs->fs_flags & FS_HAS_SUBTYPE)) {

Maybe we should divide request_module() into several functions regarding
expected caller's privileges?

- request_module() for CAP_SYS_MODULE in init_ns
- request_module_relaxed() for everybody

request_module_relaxed() is used in get_fs_type(), dev_load() and all
places where the safety of module loading is manually checked.  All old
not yet checked users of request_module() will not be triggerable from user_ns.
That's the same scheme as with capable() and ns_capable().

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] fs: Limit sys_mount to only request filesystem modules.

2013-03-04 Thread Vasily Kulikov
(cc'ed kernel-hardening)

On Sun, Mar 03, 2013 at 23:51 -0800, Eric W. Biederman wrote:
 Modify the request_module to prefix the file system type with fs-
 and add aliases to all of the filesystems that can be built as modules
 to match.
 
 A common practice is to build all of the kernel code and leave code
 that is not commonly needed as modules, with the result that many
 users are exposed to any bug anywhere in the kernel.
 
 Looking for filesystems with a fs- prefix limits the pool of possible
 modules that can be loaded by mount to just filesystems trivially
 making things safer with no real cost.
 
 Using aliases means user space can control the policy of which
 filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
 with blacklist and alias directives.  Allowing simple, safe,
 well understood work-arounds to known problematic software.
 
 This also addresses a rare but unfortunate problem where the filesystem
 name is not the same as it's module name and module auto-loading
 would not work.  While writing this patch I saw a handful of such
 cases.  The most significant being autofs that lives in the module
 autofs4.
 
 This is relevant to user namespaces because we can reach the request
 module in get_fs_type() without having any special permissions, and
 people get uncomfortable when a user specified string (in this case
 the filesystem type) goes all of the way to request_module.
 
 After having looked at this issue I don't think there is any
 particular reason to perform any filtering or permission checks beyond
 making it clear in the module request that we want a filesystem
 module.  The common pattern in the kernel is to call request_module()
 without regards to the users permissions.  In general all a filesystem
 module does once loaded is call register_filesystem() and go to sleep.
 Which means there is not much attack surface exposed by loading a
 filesytem module unless the filesystem is mounted.  In a user
 namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
 which most filesystems do not set today.
 
 Acked-by: Serge Hallyn serge.hal...@canonical.com
 Acked-by: Kees Cook keesc...@chromium.org
 Reported-by: Kees Cook keesc...@google.com
 Signed-off-by: Eric W. Biederman ebied...@xmission.com
...
 diff --git a/fs/filesystems.c b/fs/filesystems.c
 index da165f6..92567d9 100644
 --- a/fs/filesystems.c
 +++ b/fs/filesystems.c
 @@ -273,7 +273,7 @@ struct file_system_type *get_fs_type(const char *name)
   int len = dot ? dot - name : strlen(name);
  
   fs = __get_fs_type(name, len);
 - if (!fs  (request_module(%.*s, len, name) == 0))
 + if (!fs  (request_module(fs-%.*s, len, name) == 0))
   fs = __get_fs_type(name, len);
  
   if (dot  fs  !(fs-fs_flags  FS_HAS_SUBTYPE)) {

Maybe we should divide request_module() into several functions regarding
expected caller's privileges?

- request_module() for CAP_SYS_MODULE in init_ns
- request_module_relaxed() for everybody

request_module_relaxed() is used in get_fs_type(), dev_load() and all
places where the safety of module loading is manually checked.  All old
not yet checked users of request_module() will not be triggerable from user_ns.
That's the same scheme as with capable() and ns_capable().

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: For review: pid_namespaces(7) man page

2013-02-28 Thread Vasily Kulikov
Hi Michael,

On Thu, Feb 28, 2013 at 12:24 +0100, Michael Kerrisk (man-pages) wrote:
>The namespace init process
>The first process created in a new namespace (i.e., the process
>created using clone(2) with the CLONE_NEWPID flag, or the first
>child created by a process after a call to unshare(2) using the
>CLONE_NEWPID flag) has the PID 1, and is the "init" process for
>the namespace (see init(1)).  Children that are orphaned within
>the namespace will be reparented to this  process  rather  than
>init(1).

Probably it worth noting here that this is true unless
prctl() with PR_SET_CHILD_SUBREAPER option is called.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: For review: pid_namespaces(7) man page

2013-02-28 Thread Vasily Kulikov
Hi Michael,

On Thu, Feb 28, 2013 at 12:24 +0100, Michael Kerrisk (man-pages) wrote:
The namespace init process
The first process created in a new namespace (i.e., the process
created using clone(2) with the CLONE_NEWPID flag, or the first
child created by a process after a call to unshare(2) using the
CLONE_NEWPID flag) has the PID 1, and is the init process for
the namespace (see init(1)).  Children that are orphaned within
the namespace will be reparented to this  process  rather  than
init(1).

Probably it worth noting here that this is true unless
prctl() with PR_SET_CHILD_SUBREAPER option is called.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 1/6] userns: Avoid recursion in put_user_ns

2013-01-28 Thread Vasily Kulikov
On Fri, Jan 25, 2013 at 18:19 -0800, Eric W. Biederman wrote:
> 
> When freeing a deeply nested user namespace free_user_ns calls
> put_user_ns on it's parent which may in turn call free_user_ns again.
> When -fno-optimize-sibling-calls is passed to gcc one stack frame per
> user namespace is left on the stack, potentially overflowing the
> kernel stack.  CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
> so we can't count on gcc to optimize this code.
> 
> Remove struct kref and use a plain atomic_t.  Making the code more
> flexible and easier to comprehend.  Make the loop in free_user_ns
> explict to guarantee that the stack does not overflow with
> CONFIG_FRAME_POINTER enabled.
> 
> I have tested this fix with a simple program that uses unshare to
> create a deeply nested user namespace structure and then calls exit.
> With 1000 nesteuser namespaces before this change running my test
> program causes the kernel to die a horrible death.  With 10,000,000
> nested user namespaces after this change my test program runs to
> completion and causes no harm.
> 
> Pointed-out-by: Vasily Kulikov 
> Signed-off-by: "Eric W. Biederman" 

Looks sane, thanks.

Acked-by: Vasily Kulikov 

The second bug I've noted in the same email (OOM) looks like should be
"fixed" by using memcg to limit kernel memory.  So, I'm fine with this
side of user_ns :)

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 1/6] userns: Avoid recursion in put_user_ns

2013-01-28 Thread Vasily Kulikov
On Fri, Jan 25, 2013 at 18:19 -0800, Eric W. Biederman wrote:
 
 When freeing a deeply nested user namespace free_user_ns calls
 put_user_ns on it's parent which may in turn call free_user_ns again.
 When -fno-optimize-sibling-calls is passed to gcc one stack frame per
 user namespace is left on the stack, potentially overflowing the
 kernel stack.  CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
 so we can't count on gcc to optimize this code.
 
 Remove struct kref and use a plain atomic_t.  Making the code more
 flexible and easier to comprehend.  Make the loop in free_user_ns
 explict to guarantee that the stack does not overflow with
 CONFIG_FRAME_POINTER enabled.
 
 I have tested this fix with a simple program that uses unshare to
 create a deeply nested user namespace structure and then calls exit.
 With 1000 nesteuser namespaces before this change running my test
 program causes the kernel to die a horrible death.  With 10,000,000
 nested user namespaces after this change my test program runs to
 completion and causes no harm.
 
 Pointed-out-by: Vasily Kulikov seg...@openwall.com
 Signed-off-by: Eric W. Biederman ebied...@xmission.com

Looks sane, thanks.

Acked-by: Vasily Kulikov seg...@openwall.com

The second bug I've noted in the same email (OOM) looks like should be
fixed by using memcg to limit kernel memory.  So, I'm fine with this
side of user_ns :)

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Friendlier EPERM e-mail on lkml

2013-01-09 Thread Vasily Kulikov
Hi Eric,

On Wed, Jan 09, 2013 at 11:18 -0500, Eric Paris wrote:
> http://marc.info/?l=linux-kernel=135774748910987=2
> 
> Basic idea is a proc file which tells you WHY the last EPERM from the
> kernel happened.  Was it SELinux?  Was it SMACK?  Was it capabilities?
>  Was it file rwx bits?
> 
> I'd like comments.  I'll cc LSM on any future replies to that thread...

Why does application need it?  Why does anybody but admin need this
information?

Probably this may be implemented as root-only audit messages?  Or
introduce new NETLINK_EPERMREASON netlink family?

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Friendlier EPERM e-mail on lkml

2013-01-09 Thread Vasily Kulikov
Hi Eric,

On Wed, Jan 09, 2013 at 11:18 -0500, Eric Paris wrote:
 http://marc.info/?l=linux-kernelm=135774748910987w=2
 
 Basic idea is a proc file which tells you WHY the last EPERM from the
 kernel happened.  Was it SELinux?  Was it SMACK?  Was it capabilities?
  Was it file rwx bits?
 
 I'd like comments.  I'll cc LSM on any future replies to that thread...

Why does application need it?  Why does anybody but admin need this
information?

Probably this may be implemented as root-only audit messages?  Or
introduce new NETLINK_EPERMREASON netlink family?

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v12 0/9] LSM: Multiple concurrent LSMs

2013-01-07 Thread Vasily Kulikov
On Mon, Jan 07, 2013 at 20:02 -0800, Casey Schaufler wrote:
> On 1/7/2013 7:01 PM, Stephen Rothwell wrote:
> > Let me ask Andrew's question:  Why do you want to do this (what is the
> > use case)?  What does this gain us?
> 
> There has been an amazing amount of development in system security
> over the past three years. Almost none of it has been in the kernel.
> One important reason that it is not getting done in the kernel is
> that the current single LSM restriction requires an all or nothing
> approach to security. Either you address all your needs with a single
> LSM or you have to go with a user space solution, in which case you
> may as well do everything in user space.
[...]

You should also update Documentation/security/LSM.txt with new "security="
rules and rules of LSM stacking limitations.  Motivation of stacking is
probably worth noting in Documentation/ too.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v12 0/9] LSM: Multiple concurrent LSMs

2013-01-07 Thread Vasily Kulikov
On Mon, Jan 07, 2013 at 20:11 -0800, Casey Schaufler wrote:
> On 1/7/2013 7:59 PM, Stephen Rothwell wrote:
> > You probably also want to think a bit harder about the order of the
> > patches - you should introduce new APIs before you use them and remove
> > calls to functions before you remove the functions.
> >
> The unfortunate reality is that I couldn't find a good way to stage the
> changes. It's a wonking big set of infrastructure change. I could introduce
> the security blob abstraction separately but that is a fraction of the
> change. If it would have gone through mail filters as a single patch I'd
> have sent it that way.
> 
> I can spend time on patch presentation, and will if necessary.

I guess it can be divided this way:

1) Introduce lsm_get_cred(), etc. which unconditionally return
->security, ->i_security, etc.

2) Move all LSMs, procfs, etc. to the new API, a patch per LSM/subsystem.

3) Change structures along with new API.


The pro of the division is that if you have a bug in the series (and you
surely have! ;)) it is MUCH more simple to locate this bug (bisect, etc.).
Also it is more descriptive as you divide LSM changes and the core security
subsystem itself targeted on multiple LSMs which divides LSM
implementations (which might be not very important for someone) and the core
architecture (which is important for everybody).

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v12 0/9] LSM: Multiple concurrent LSMs

2013-01-07 Thread Vasily Kulikov
On Mon, Jan 07, 2013 at 20:11 -0800, Casey Schaufler wrote:
 On 1/7/2013 7:59 PM, Stephen Rothwell wrote:
  You probably also want to think a bit harder about the order of the
  patches - you should introduce new APIs before you use them and remove
  calls to functions before you remove the functions.
 
 The unfortunate reality is that I couldn't find a good way to stage the
 changes. It's a wonking big set of infrastructure change. I could introduce
 the security blob abstraction separately but that is a fraction of the
 change. If it would have gone through mail filters as a single patch I'd
 have sent it that way.
 
 I can spend time on patch presentation, and will if necessary.

I guess it can be divided this way:

1) Introduce lsm_get_cred(), etc. which unconditionally return
-security, -i_security, etc.

2) Move all LSMs, procfs, etc. to the new API, a patch per LSM/subsystem.

3) Change structures along with new API.


The pro of the division is that if you have a bug in the series (and you
surely have! ;)) it is MUCH more simple to locate this bug (bisect, etc.).
Also it is more descriptive as you divide LSM changes and the core security
subsystem itself targeted on multiple LSMs which divides LSM
implementations (which might be not very important for someone) and the core
architecture (which is important for everybody).

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v12 0/9] LSM: Multiple concurrent LSMs

2013-01-07 Thread Vasily Kulikov
On Mon, Jan 07, 2013 at 20:02 -0800, Casey Schaufler wrote:
 On 1/7/2013 7:01 PM, Stephen Rothwell wrote:
  Let me ask Andrew's question:  Why do you want to do this (what is the
  use case)?  What does this gain us?
 
 There has been an amazing amount of development in system security
 over the past three years. Almost none of it has been in the kernel.
 One important reason that it is not getting done in the kernel is
 that the current single LSM restriction requires an all or nothing
 approach to security. Either you address all your needs with a single
 LSM or you have to go with a user space solution, in which case you
 may as well do everything in user space.
[...]

You should also update Documentation/security/LSM.txt with new security=
rules and rules of LSM stacking limitations.  Motivation of stacking is
probably worth noting in Documentation/ too.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts

2012-12-30 Thread Vasily Kulikov
On Fri, Dec 28, 2012 at 20:05 -0800, Eric W. Biederman wrote:
> > A related issue which is NOT FIXED HERE is limits for all resources
> > available for containerized pseudo roots.  E.g. I succeeded creating
> > thousands of veth network devices without problems by a non-root user,
> > there seems no limit in number of network devices.  I suspect it is
> > possible to setup routing and net_ns'es the way it will be very
> > time-consuming for kernel to handle IP packets inside of ksoftirq, which
> > is not counted as this user scheduler time.   I suppose the issue is not
> > veth-specific, almost all newly available for unprivileged users code
> > pathes are vulnerable to DoS attacks.
> 
> veth at least should process packets synchronously so I don't see how
> you will get softirq action.

What do you mean -- synchronously?  From my limited understanding of
veth job, it is handled like every network packet in system, via:

veth_xmit() -> dev_forward_skb() -> netif_rx() -> enqueue_to_backlog()

enqueue_to_backlog() adds the packet to softnet_data->input_pkt_queue. 

Then inside of softirq process_backlog() moves ->input_pkt_queue to
->process_queue and calls __netif_receive_skb(), which does all networking
stack magic.

AFAICS, one could create user_ns, net_ns inside of it, and setup routing
tables and netfilter to infinitely pass few network packets from and to
veth, abusing ksoftirq.

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts

2012-12-30 Thread Vasily Kulikov
On Fri, Dec 28, 2012 at 20:05 -0800, Eric W. Biederman wrote:
  A related issue which is NOT FIXED HERE is limits for all resources
  available for containerized pseudo roots.  E.g. I succeeded creating
  thousands of veth network devices without problems by a non-root user,
  there seems no limit in number of network devices.  I suspect it is
  possible to setup routing and net_ns'es the way it will be very
  time-consuming for kernel to handle IP packets inside of ksoftirq, which
  is not counted as this user scheduler time.   I suppose the issue is not
  veth-specific, almost all newly available for unprivileged users code
  pathes are vulnerable to DoS attacks.
 
 veth at least should process packets synchronously so I don't see how
 you will get softirq action.

What do you mean -- synchronously?  From my limited understanding of
veth job, it is handled like every network packet in system, via:

veth_xmit() - dev_forward_skb() - netif_rx() - enqueue_to_backlog()

enqueue_to_backlog() adds the packet to softnet_data-input_pkt_queue. 

Then inside of softirq process_backlog() moves -input_pkt_queue to
-process_queue and calls __netif_receive_skb(), which does all networking
stack magic.

AFAICS, one could create user_ns, net_ns inside of it, and setup routing
tables and netfilter to infinitely pass few network packets from and to
veth, abusing ksoftirq.

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts

2012-12-28 Thread Vasily Kulikov
On Fri, Dec 28, 2012 at 20:05 -0800, Eric W. Biederman wrote:
> Vasily Kulikov  writes:
> 
> > Currently there is completely no limiting in number of user namespaces
> > created by unprivileged users.  One can freely create thousands of
> > user_ns'es and exhaust kernel memory without even bumping in
> > RLIMIT_NPROC or similar.
> 
> First for a proper sense of scale it will take roughly 14,000 to consume
> a megabyte.  So it will take hundreds of millions of user namespaces to
> eat up all of kernel memory.

Yes, but you can freely create *any* number of nested userns by a loop:

for() {
unshare()
write to /proc/self/{u,g}id_map
}

> > The code needs several checks.  First, noone should be able to create
> > user_ns of arbitrary depth.  Besides kernel stack overflow one could
> > create too big depth to DoS processes belonging to other users by
> > forcing them to loop a long time in cap_capable called from some
> > ns_capable() (e.g. in case one does smth like "ls -R /proc").
> 
> Where do you get a ns_capable call from "ls -R /proc" ?

E.g. if procfs is mounted with hidepid=2 then ls does
ptrace_may_access() check. 

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts

2012-12-28 Thread Vasily Kulikov
On Fri, Dec 28, 2012 at 18:43 +, Al Viro wrote:
> On Fri, Dec 28, 2012 at 09:56:27PM +0400, Vasily Kulikov wrote:
> > The included patch is a basic fix for both or them.  Both values are
> > hardcoded here to 100 max depth and 1000 max in total.  I'm not sure how
> > better to make them configurable.  Looks like it needs some sysctl value
> > like kernel.max_user_ns_per_user, but also something more configurable
> > like new rlimit'ish limit may be created for user_ns needs.  E.g. in
> > case root wants one user to contain hundreds of private containers
> > (container owner user), but he doesn't want anybody to fill the kernel
> > with hundreds of containers multiplied by number of system users (equals
> > to thousands).
> 
> I'm sorry, but this is not a solution.  Kernel is not x86-only; there are
> architectures with far bigger minimal stack frame size.  E.g. on sparc64
> every fucking stack frame is at least 176 bytes.  So your 100 calls deep
> call chain will happily overflow the damn stack all by itself - kernel
> stack on sparc64 is 16Kb total, including struct thread_info living there.

Understood.  How to properly fix it then?  Looks like there are quite
many kernel structures which may reference other structures which
indirectly reference each other via kref, IOW it is not user_ns specific
issue.  With unprivileged user_ns the way it should be freed must be
somehow changed.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH/RFC] user_ns: fix missing limiting of user_ns counts

2012-12-28 Thread Vasily Kulikov
Currently there is completely no limiting in number of user namespaces
created by unprivileged users.  One can freely create thousands of
user_ns'es and exhaust kernel memory without even bumping in
RLIMIT_NPROC or similar.

Even more -- it allows user to overflow kernel stack theoretically
allowing user to overwrite some important kernel data.  The problem is
that free_user_ns() may also free its parent user_namespace recursively
calling free_user_ns().  As kernel stack is very limited, it leads to
kernel stack overflow.

The code needs several checks.  First, noone should be able to create
user_ns of arbitrary depth.  Besides kernel stack overflow one could
create too big depth to DoS processes belonging to other users by
forcing them to loop a long time in cap_capable called from some
ns_capable() (e.g. in case one does smth like "ls -R /proc").  Second,
non-privileged users must not be able to overlimit some count of
namespaces to not be able to exhaust kernel memory.

The included patch is a basic fix for both or them.  Both values are
hardcoded here to 100 max depth and 1000 max in total.  I'm not sure how
better to make them configurable.  Looks like it needs some sysctl value
like kernel.max_user_ns_per_user, but also something more configurable
like new rlimit'ish limit may be created for user_ns needs.  E.g. in
case root wants one user to contain hundreds of private containers
(container owner user), but he doesn't want anybody to fill the kernel
with hundreds of containers multiplied by number of system users (equals
to thousands).

I'm not sure how it is an approved way for user_ns.  Eric?

A related issue which is NOT FIXED HERE is limits for all resources
available for containerized pseudo roots.  E.g. I succeeded creating
thousands of veth network devices without problems by a non-root user,
there seems no limit in number of network devices.  I suspect it is
possible to setup routing and net_ns'es the way it will be very
time-consuming for kernel to handle IP packets inside of ksoftirq, which
is not counted as this user scheduler time.   I suppose the issue is not
veth-specific, almost all newly available for unprivileged users code
pathes are vulnerable to DoS attacks.

Signed-off-by: Vasily Kulikov 
-- 
 include/linux/sched.h   |3 +++
 kernel/user_namespace.c |   26 ++
 2 files changed, 29 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..479940e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -706,6 +706,9 @@ struct user_struct {
 #ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors 
currently watched */
 #endif
+#ifdef CONFIG_USER_NS
+   atomic_t user_namespaces; /* How many user_ns does this user created? */
+#endif
 #ifdef CONFIG_POSIX_MQUEUE
/* protected by mq_lock */
unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 2b042c4..a52c4e8 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -45,6 +45,16 @@ static void set_cred_user_ns(struct cred *cred, struct 
user_namespace *user_ns)
cred->user_ns = user_ns;
 }
 
+static long get_user_ns_depth(struct user_namespace *ns)
+{
+   long depth;
+
+   for (depth = 1; ns != _user_ns; ns = ns->parent)
+   depth++;
+
+   return depth;
+}
+
 /*
  * Create a new user namespace, deriving the creator from the user in the
  * passed credentials, and replacing that user with the new root user for the
@@ -56,6 +66,7 @@ static void set_cred_user_ns(struct cred *cred, struct 
user_namespace *user_ns)
 int create_user_ns(struct cred *new)
 {
struct user_namespace *ns, *parent_ns = new->user_ns;
+   struct user_struct *user = current->cred->user;
kuid_t owner = new->euid;
kgid_t group = new->egid;
int ret;
@@ -68,6 +79,18 @@ int create_user_ns(struct cred *new)
!kgid_has_mapping(parent_ns, group))
return -EPERM;
 
+   /* Too long user_ns chains, might overflow kernel stack on kref_put() */
+   if (get_user_ns_depth(parent_ns) > 100)
+   return -ENOMEM;
+
+   atomic_inc(>user_namespaces);
+   /* FIXME: probably it's better to configure the number
+*instead of hardcoding 1000 */
+   if (atomic_read(>user_namespaces) > 1000) {
+   atomic_dec(>user_namespaces);
+   return -ENOMEM;
+   }
+
ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
if (!ns)
return -ENOMEM;
@@ -108,10 +131,13 @@ void free_user_ns(struct kref *kref)
 {
struct user_namespace *parent, *ns =
container_of(kref, struct user_namespace, kref);
+   struct user_struct *user = find_user(ns->owner);
 
parent = ns->parent;
proc_free_inum(ns->

[PATCH/RFC] user_ns: fix missing limiting of user_ns counts

2012-12-28 Thread Vasily Kulikov
Currently there is completely no limiting in number of user namespaces
created by unprivileged users.  One can freely create thousands of
user_ns'es and exhaust kernel memory without even bumping in
RLIMIT_NPROC or similar.

Even more -- it allows user to overflow kernel stack theoretically
allowing user to overwrite some important kernel data.  The problem is
that free_user_ns() may also free its parent user_namespace recursively
calling free_user_ns().  As kernel stack is very limited, it leads to
kernel stack overflow.

The code needs several checks.  First, noone should be able to create
user_ns of arbitrary depth.  Besides kernel stack overflow one could
create too big depth to DoS processes belonging to other users by
forcing them to loop a long time in cap_capable called from some
ns_capable() (e.g. in case one does smth like ls -R /proc).  Second,
non-privileged users must not be able to overlimit some count of
namespaces to not be able to exhaust kernel memory.

The included patch is a basic fix for both or them.  Both values are
hardcoded here to 100 max depth and 1000 max in total.  I'm not sure how
better to make them configurable.  Looks like it needs some sysctl value
like kernel.max_user_ns_per_user, but also something more configurable
like new rlimit'ish limit may be created for user_ns needs.  E.g. in
case root wants one user to contain hundreds of private containers
(container owner user), but he doesn't want anybody to fill the kernel
with hundreds of containers multiplied by number of system users (equals
to thousands).

I'm not sure how it is an approved way for user_ns.  Eric?

A related issue which is NOT FIXED HERE is limits for all resources
available for containerized pseudo roots.  E.g. I succeeded creating
thousands of veth network devices without problems by a non-root user,
there seems no limit in number of network devices.  I suspect it is
possible to setup routing and net_ns'es the way it will be very
time-consuming for kernel to handle IP packets inside of ksoftirq, which
is not counted as this user scheduler time.   I suppose the issue is not
veth-specific, almost all newly available for unprivileged users code
pathes are vulnerable to DoS attacks.

Signed-off-by: Vasily Kulikov seg...@openwall.com
-- 
 include/linux/sched.h   |3 +++
 kernel/user_namespace.c |   26 ++
 2 files changed, 29 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..479940e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -706,6 +706,9 @@ struct user_struct {
 #ifdef CONFIG_EPOLL
atomic_long_t epoll_watches; /* The number of file descriptors 
currently watched */
 #endif
+#ifdef CONFIG_USER_NS
+   atomic_t user_namespaces; /* How many user_ns does this user created? */
+#endif
 #ifdef CONFIG_POSIX_MQUEUE
/* protected by mq_lock */
unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 2b042c4..a52c4e8 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -45,6 +45,16 @@ static void set_cred_user_ns(struct cred *cred, struct 
user_namespace *user_ns)
cred-user_ns = user_ns;
 }
 
+static long get_user_ns_depth(struct user_namespace *ns)
+{
+   long depth;
+
+   for (depth = 1; ns != init_user_ns; ns = ns-parent)
+   depth++;
+
+   return depth;
+}
+
 /*
  * Create a new user namespace, deriving the creator from the user in the
  * passed credentials, and replacing that user with the new root user for the
@@ -56,6 +66,7 @@ static void set_cred_user_ns(struct cred *cred, struct 
user_namespace *user_ns)
 int create_user_ns(struct cred *new)
 {
struct user_namespace *ns, *parent_ns = new-user_ns;
+   struct user_struct *user = current-cred-user;
kuid_t owner = new-euid;
kgid_t group = new-egid;
int ret;
@@ -68,6 +79,18 @@ int create_user_ns(struct cred *new)
!kgid_has_mapping(parent_ns, group))
return -EPERM;
 
+   /* Too long user_ns chains, might overflow kernel stack on kref_put() */
+   if (get_user_ns_depth(parent_ns)  100)
+   return -ENOMEM;
+
+   atomic_inc(user-user_namespaces);
+   /* FIXME: probably it's better to configure the number
+*instead of hardcoding 1000 */
+   if (atomic_read(user-user_namespaces)  1000) {
+   atomic_dec(user-user_namespaces);
+   return -ENOMEM;
+   }
+
ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
if (!ns)
return -ENOMEM;
@@ -108,10 +131,13 @@ void free_user_ns(struct kref *kref)
 {
struct user_namespace *parent, *ns =
container_of(kref, struct user_namespace, kref);
+   struct user_struct *user = find_user(ns-owner);
 
parent = ns-parent;
proc_free_inum(ns-proc_inum);
kmem_cache_free

Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts

2012-12-28 Thread Vasily Kulikov
On Fri, Dec 28, 2012 at 18:43 +, Al Viro wrote:
 On Fri, Dec 28, 2012 at 09:56:27PM +0400, Vasily Kulikov wrote:
  The included patch is a basic fix for both or them.  Both values are
  hardcoded here to 100 max depth and 1000 max in total.  I'm not sure how
  better to make them configurable.  Looks like it needs some sysctl value
  like kernel.max_user_ns_per_user, but also something more configurable
  like new rlimit'ish limit may be created for user_ns needs.  E.g. in
  case root wants one user to contain hundreds of private containers
  (container owner user), but he doesn't want anybody to fill the kernel
  with hundreds of containers multiplied by number of system users (equals
  to thousands).
 
 I'm sorry, but this is not a solution.  Kernel is not x86-only; there are
 architectures with far bigger minimal stack frame size.  E.g. on sparc64
 every fucking stack frame is at least 176 bytes.  So your 100 calls deep
 call chain will happily overflow the damn stack all by itself - kernel
 stack on sparc64 is 16Kb total, including struct thread_info living there.

Understood.  How to properly fix it then?  Looks like there are quite
many kernel structures which may reference other structures which
indirectly reference each other via kref, IOW it is not user_ns specific
issue.  With unprivileged user_ns the way it should be freed must be
somehow changed.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] user_ns: fix missing limiting of user_ns counts

2012-12-28 Thread Vasily Kulikov
On Fri, Dec 28, 2012 at 20:05 -0800, Eric W. Biederman wrote:
 Vasily Kulikov seg...@openwall.com writes:
 
  Currently there is completely no limiting in number of user namespaces
  created by unprivileged users.  One can freely create thousands of
  user_ns'es and exhaust kernel memory without even bumping in
  RLIMIT_NPROC or similar.
 
 First for a proper sense of scale it will take roughly 14,000 to consume
 a megabyte.  So it will take hundreds of millions of user namespaces to
 eat up all of kernel memory.

Yes, but you can freely create *any* number of nested userns by a loop:

for() {
unshare()
write to /proc/self/{u,g}id_map
}

  The code needs several checks.  First, noone should be able to create
  user_ns of arbitrary depth.  Besides kernel stack overflow one could
  create too big depth to DoS processes belonging to other users by
  forcing them to loop a long time in cap_capable called from some
  ns_capable() (e.g. in case one does smth like ls -R /proc).
 
 Where do you get a ns_capable call from ls -R /proc ?

E.g. if procfs is mounted with hidepid=2 then ls does
ptrace_may_access() check. 

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kernel-hardening] [PATCH 1/2] fs: add link restrictions

2012-08-12 Thread Vasily Kulikov
On Sat, Aug 11, 2012 at 23:34 -0700, Kees Cook wrote:
> On Wed, Aug 8, 2012 at 5:19 AM, Vasily Kulikov  wrote:
> > Hi Kees,
> >
> > On Wed, Jul 25, 2012 at 17:29 -0700, Kees Cook wrote:
> >> +/**
> >> + * safe_hardlink_source - Check for safe hardlink conditions
> >> + * @inode: the source inode to hardlink from
> >> + *
> >> + * Return false if at least one of the following conditions:
> >> + *- inode is not a regular file
> >> + *- inode is setuid
> >> + *- inode is setgid and group-exec
> >> + *- access failure for read and write
> >> + *
> >> + * Otherwise returns true.
> >> + */
> >> +static bool safe_hardlink_source(struct inode *inode)
> >> +{
> >> + umode_t mode = inode->i_mode;
> >> +
> >> + /* Special files should not get pinned to the filesystem. */
> >> + if (!S_ISREG(mode))
> >> + return false;
> >> +
> >> + /* Setuid files should not get pinned to the filesystem. */
> >> + if (mode & S_ISUID)
> >> + return false;
> >
> > We don't want to make hardlinks of SUID files, but we still allow to create
> > hardlinks to SUID'ish cap'ed files.  Probably check whether the inode is
> > setcap'ed?
> 
> Excellent idea. It doesn't look like there is anything "simple" to do
> this already. It'd be close to get_file_caps() but without the bprm.
> Maybe just get_vfs_caps_from_disk() and a walk of the caps? What would
> you recommend?

Yes, I think get_vfs_caps_from_disk() plus identifying whether any permitted
or inheritable capability is set or effective bit is set.  IOW, if there is
_anything_ related to cababilities in the file, it is protected.

> > Probably we can enhance this further and allow LSMs to define whether this
> > particular file is special in LSM's point of view (IOW, it can be able to 
> > move
> > a process to another security domain which is served by LSM).
> 
> Yeah. Perhaps implementing the needed check above with a new security
> check and have commoncaps do the vfs fetch with LSMs able to override?

Yep.  Then uid and gid checks from the above code should move to commoncaps
function as default candidates for "privileged files" checks.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kernel-hardening] [PATCH 1/2] fs: add link restrictions

2012-08-12 Thread Vasily Kulikov
On Sat, Aug 11, 2012 at 23:34 -0700, Kees Cook wrote:
 On Wed, Aug 8, 2012 at 5:19 AM, Vasily Kulikov seg...@openwall.com wrote:
  Hi Kees,
 
  On Wed, Jul 25, 2012 at 17:29 -0700, Kees Cook wrote:
  +/**
  + * safe_hardlink_source - Check for safe hardlink conditions
  + * @inode: the source inode to hardlink from
  + *
  + * Return false if at least one of the following conditions:
  + *- inode is not a regular file
  + *- inode is setuid
  + *- inode is setgid and group-exec
  + *- access failure for read and write
  + *
  + * Otherwise returns true.
  + */
  +static bool safe_hardlink_source(struct inode *inode)
  +{
  + umode_t mode = inode-i_mode;
  +
  + /* Special files should not get pinned to the filesystem. */
  + if (!S_ISREG(mode))
  + return false;
  +
  + /* Setuid files should not get pinned to the filesystem. */
  + if (mode  S_ISUID)
  + return false;
 
  We don't want to make hardlinks of SUID files, but we still allow to create
  hardlinks to SUID'ish cap'ed files.  Probably check whether the inode is
  setcap'ed?
 
 Excellent idea. It doesn't look like there is anything simple to do
 this already. It'd be close to get_file_caps() but without the bprm.
 Maybe just get_vfs_caps_from_disk() and a walk of the caps? What would
 you recommend?

Yes, I think get_vfs_caps_from_disk() plus identifying whether any permitted
or inheritable capability is set or effective bit is set.  IOW, if there is
_anything_ related to cababilities in the file, it is protected.

  Probably we can enhance this further and allow LSMs to define whether this
  particular file is special in LSM's point of view (IOW, it can be able to 
  move
  a process to another security domain which is served by LSM).
 
 Yeah. Perhaps implementing the needed check above with a new security
 check and have commoncaps do the vfs fetch with LSMs able to override?

Yep.  Then uid and gid checks from the above code should move to commoncaps
function as default candidates for privileged files checks.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kernel-hardening] [PATCH 1/2] fs: add link restrictions

2012-08-08 Thread Vasily Kulikov
Hi Kees,

On Wed, Jul 25, 2012 at 17:29 -0700, Kees Cook wrote:
> +/**
> + * safe_hardlink_source - Check for safe hardlink conditions
> + * @inode: the source inode to hardlink from
> + *
> + * Return false if at least one of the following conditions:
> + *- inode is not a regular file
> + *- inode is setuid
> + *- inode is setgid and group-exec
> + *- access failure for read and write
> + *
> + * Otherwise returns true.
> + */
> +static bool safe_hardlink_source(struct inode *inode)
> +{
> + umode_t mode = inode->i_mode;
> +
> + /* Special files should not get pinned to the filesystem. */
> + if (!S_ISREG(mode))
> + return false;
> +
> + /* Setuid files should not get pinned to the filesystem. */
> + if (mode & S_ISUID)
> + return false;

We don't want to make hardlinks of SUID files, but we still allow to create
hardlinks to SUID'ish cap'ed files.  Probably check whether the inode is
setcap'ed?

Probably we can enhance this further and allow LSMs to define whether this
particular file is special in LSM's point of view (IOW, it can be able to move
a process to another security domain which is served by LSM).

> +
> + /* Executable setgid files should not get pinned to the filesystem. */
> + if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP))
> + return false;
> +
> + /* Hardlinking to unreadable or unwritable sources is dangerous. */
> + if (inode_permission(inode, MAY_READ | MAY_WRITE))
> + return false;
> +
> + return true;
> +}

Thanks,

-- 
Vasily
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kernel-hardening] [PATCH 1/2] fs: add link restrictions

2012-08-08 Thread Vasily Kulikov
Hi Kees,

On Wed, Jul 25, 2012 at 17:29 -0700, Kees Cook wrote:
 +/**
 + * safe_hardlink_source - Check for safe hardlink conditions
 + * @inode: the source inode to hardlink from
 + *
 + * Return false if at least one of the following conditions:
 + *- inode is not a regular file
 + *- inode is setuid
 + *- inode is setgid and group-exec
 + *- access failure for read and write
 + *
 + * Otherwise returns true.
 + */
 +static bool safe_hardlink_source(struct inode *inode)
 +{
 + umode_t mode = inode-i_mode;
 +
 + /* Special files should not get pinned to the filesystem. */
 + if (!S_ISREG(mode))
 + return false;
 +
 + /* Setuid files should not get pinned to the filesystem. */
 + if (mode  S_ISUID)
 + return false;

We don't want to make hardlinks of SUID files, but we still allow to create
hardlinks to SUID'ish cap'ed files.  Probably check whether the inode is
setcap'ed?

Probably we can enhance this further and allow LSMs to define whether this
particular file is special in LSM's point of view (IOW, it can be able to move
a process to another security domain which is served by LSM).

 +
 + /* Executable setgid files should not get pinned to the filesystem. */
 + if ((mode  (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP))
 + return false;
 +
 + /* Hardlinking to unreadable or unwritable sources is dangerous. */
 + if (inode_permission(inode, MAY_READ | MAY_WRITE))
 + return false;
 +
 + return true;
 +}

Thanks,

-- 
Vasily
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ip: fix error handling in ip_finish_output2()

2012-08-06 Thread Vasily Kulikov
__neigh_create() returns either a pointer to struct neighbour or PTR_ERR().
But the caller expects it to return either a pointer or NULL.  Replace
the NULL check with IS_ERR() check.

The bug was introduced in a263b3093641fb1ec377582c90986a7fd0625184.

Signed-off-by: Vasily Kulikov 
---
 Compile tested only.

 net/ipv4/ip_output.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index ba39a52..76dde25 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -197,7 +197,7 @@ static inline int ip_finish_output2(struct sk_buff *skb)
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(_tbl, , dev, false);
-   if (neigh) {
+   if (!IS_ERR(neigh)) {
int res = dst_neigh_output(dst, neigh, skb);
 
rcu_read_unlock_bh();
-- 
1.7.0.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ip: fix error handling in ip_finish_output2()

2012-08-06 Thread Vasily Kulikov
__neigh_create() returns either a pointer to struct neighbour or PTR_ERR().
But the caller expects it to return either a pointer or NULL.  Replace
the NULL check with IS_ERR() check.

The bug was introduced in a263b3093641fb1ec377582c90986a7fd0625184.

Signed-off-by: Vasily Kulikov seg...@openwall.com
---
 Compile tested only.

 net/ipv4/ip_output.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index ba39a52..76dde25 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -197,7 +197,7 @@ static inline int ip_finish_output2(struct sk_buff *skb)
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(arp_tbl, nexthop, dev, false);
-   if (neigh) {
+   if (!IS_ERR(neigh)) {
int res = dst_neigh_output(dst, neigh, skb);
 
rcu_read_unlock_bh();
-- 
1.7.0.4
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/