Re: [PATCH] bcache: dont reset bio opf in bch_data_insert_start

2021-01-28 Thread Dongsheng Yang

Hi Christoph,

在 2021/1/28 星期四 下午 5:10, Dongsheng Yang 写道:

Hi Christop:

在 2021/1/28 星期四 上午 1:37, Christoph Hellwig 写道:

But the old code is also completely broken.  We can't just OR in
the op, as that implicitly assumes the old op was 0 (REQ_OP_READ).



Yes, indeed, there is an assume that the op is just possible to be 0 
(REQ_OP_READ) or 1 (REQ_OP_WRITE).


REQ_OP_WRITE is from cached_dev_submit_bio() which would be submitted by 
upper user.


REQ_OP_READ is from bcache itself, such as cached_dev_read_done() (when 
we found cache miss, we will read


data from backing and then we want to insert it into cache device. then 
there is a read bio with data reach here, we


need to set the bio_op to REQ_OP_WRITE, and send this bio to cache device).


Please fix this to explicitly set the exact op and flags that you want
instead of this fragile magic.blk_rq_map_kern


This commit only want to fix the logic bug introduced in ad0d9e76a412 
("bcache: use bio op accessors"),


that's more likely a partial revert.


I agree that we can make it more clearly and explicitly.

But I found there is no accessor to set op only, besides, the 
bio_set_op_attrs() was marked as obsolete.


There are some others doing similar things as below:

blk_rq_map_kern():

bio->bi_opf &= ~REQ_OP_MASK;

bio->bi_opf |= req_op(rq);


So what about below:

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index c7cadaafa947..bacc7366002f 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -244,7 +244,14 @@ static void bch_data_insert_start(struct closure *cl)
     trace_bcache_cache_insert(k);
     bch_keylist_push(>insert_keys);

-   bio_set_op_attrs(n, REQ_OP_WRITE, 0);
+   /*
+    * n here would be REQ_OP_READ, if
+    * we are inserting data read from
+    * backing device in cache miss or
+    * inserting data in movinggc.
+    */
+   n->bi_opf &= ~REQ_OP_MASK;
+   n->bi_opf |= REQ_OP_WRITE;
     bch_submit_bbio(n, op->c, k, 0);
     } while (n != bio);


Another solution is introducing an accessor to set op only, something 
like bio_set_op(). Then we should keep the bcache patch as what it was 
to fix the bug.


And send another patch to introduce bio_set_op():

diff --git a/block/blk-map.c b/block/blk-map.c
index 6e804892d5ec..83bc33a59fa5 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -587,9 +587,7 @@ static int __blk_rq_map_user_iov(struct request *rq,
if (IS_ERR(bio))
return PTR_ERR(bio);

-   bio->bi_opf &= ~REQ_OP_MASK;
-   bio->bi_opf |= req_op(rq);
-
+   bio_set_op(bio, req_op(rq));
orig_bio = bio;

/*
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index eb734f7ddaac..d8839300805e 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -244,7 +244,13 @@ static void bch_data_insert_start(struct closure *cl)
trace_bcache_cache_insert(k);
bch_keylist_push(>insert_keys);

-   n->bi_opf |= REQ_OP_WRITE;
+   /*
+* n here would be REQ_OP_READ, if
+* we are inserting data read from
+* backing device in cache miss or
+* inserting data in movinggc.
+*/
+   bio_set_op(n, REQ_OP_WRITE);
bch_submit_bbio(n, op->c, k, 0);
} while (n != bio);

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index b3fc5d3dd8ea..2affd3269bdc 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -439,6 +439,12 @@ static inline void bio_set_op_attrs(struct bio 
*bio, unsigned op,

bio->bi_opf = op | op_flags;
 }

+static inline void bio_set_op(struct bio *bio, unsigned op)
+{
+   bio->bi_opf &= ~REQ_OP_MASK;
+   bio->bi_opf |= op;
+}
+
 static inline bool op_is_write(unsigned int op)
 {
return (op & 1);




Thanx

Yang



Re: [PATCH] bcache: dont reset bio opf in bch_data_insert_start

2021-01-28 Thread Dongsheng Yang

Hi Christop:

在 2021/1/28 星期四 上午 1:37, Christoph Hellwig 写道:

But the old code is also completely broken.  We can't just OR in
the op, as that implicitly assumes the old op was 0 (REQ_OP_READ).



Yes, indeed, there is an assume that the op is just possible to be 0 
(REQ_OP_READ) or 1 (REQ_OP_WRITE).


REQ_OP_WRITE is from cached_dev_submit_bio() which would be submitted by 
upper user.


REQ_OP_READ is from bcache itself, such as cached_dev_read_done() (when 
we found cache miss, we will read


data from backing and then we want to insert it into cache device. then 
there is a read bio with data reach here, we


need to set the bio_op to REQ_OP_WRITE, and send this bio to cache device).


Please fix this to explicitly set the exact op and flags that you want
instead of this fragile magic.blk_rq_map_kern


This commit only want to fix the logic bug introduced in ad0d9e76a412 
("bcache: use bio op accessors"),


that's more likely a partial revert.


I agree that we can make it more clearly and explicitly.

But I found there is no accessor to set op only, besides, the 
bio_set_op_attrs() was marked as obsolete.


There are some others doing similar things as below:

blk_rq_map_kern():

bio->bi_opf &= ~REQ_OP_MASK;

bio->bi_opf |= req_op(rq);


So what about below:

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index c7cadaafa947..bacc7366002f 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -244,7 +244,14 @@ static void bch_data_insert_start(struct closure *cl)
    trace_bcache_cache_insert(k);
    bch_keylist_push(>insert_keys);

-   bio_set_op_attrs(n, REQ_OP_WRITE, 0);
+   /*
+    * n here would be REQ_OP_READ, if
+    * we are inserting data read from
+    * backing device in cache miss or
+    * inserting data in movinggc.
+    */
+   n->bi_opf &= ~REQ_OP_MASK;
+   n->bi_opf |= REQ_OP_WRITE;
    bch_submit_bbio(n, op->c, k, 0);
    } while (n != bio);


Thanx

Yang



Re: [PATCH] bcache: dont reset bio opf in bch_data_insert_start

2021-01-27 Thread Dongsheng Yang



在 2021/1/26 星期二 下午 12:34, Coly Li 写道:

On 1/26/21 12:32 PM, Dongsheng Yang wrote:

在 2021/1/25 星期一 下午 12:53, Coly Li 写道:

On 1/25/21 12:29 PM, Dongsheng Yang wrote:

commit ad0d9e76(bcache: use bio op accessors) makes the bi_opf
modified by bio_set_op_attrs(). But there is a logical
problem in this commit:

  trace_bcache_cache_insert(k);
  bch_keylist_push(>insert_keys);

-   n->bi_rw |= REQ_WRITE;
+   bio_set_op_attrs(n, REQ_OP_WRITE, 0);
  bch_submit_bbio(n, op->c, k, 0);
  } while (n != bio);

The old code add REQ_WRITE into bio n and keep other flags; the
new code set REQ_OP_WRITE to bi_opf, but reset all other flags.

This problem is discoverd in our performance testing:
(1) start a fio with 1M x 128depth for read in /dev/nvme0n1p1
(2) start a fio with 1M x 128depth for write in /dev/escache0 (cache
device is /dev/nvme0n1p2)

We found the BW of reading is 2000+M/s, but the BW of writing is
0-100M/s. After some debugging, we found the problem is io submit in
writting is very slow.

bch_data_insert_start() insert a bio to /dev/nvme0n1p1, but as
cached_dev submit stack bio will be added into current->bio_list, and
return.Then __submit_bio_noacct() will submit the new bio in bio_list
into /dev/nvme0n1p1. This operation would be slow in
blk_mq_submit_bio() -> rq_qos_throttle(q, bio);

The rq_qos_throttle() will call wbt_should_throttle(),
static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio
*bio)
{
  switch (bio_op(bio)) {
  case REQ_OP_WRITE:
  /*
   * Don't throttle WRITE_ODIRECT
   */
  if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
  (REQ_SYNC | REQ_IDLE))
  return false;
... ...
}

As the bio_set_op_attrs() reset the (REQ_SYNC | REQ_IDLE), so this write
bio will be considered as non-direct write.

After this fix, bio to nvme will flaged as (REQ_SYNC | REQ_IDLE),
then fio for writing will get about 1000M/s bandwidth.

Fixes: ad0d9e76a412470800c04fc4b56fc86c02d6

It should be,
Fixes: ad0d9e76a412 ("bcache: use bio op accessors")


Signed-off-by: Dongsheng Yang

Please CC the fixed patch author  Mike Christie.


Hi Coly,

     Should I send a V2 for commit message update?

Or you can help to fix it when you take it from maillist?


Yes, please fix it in v2 version. And let's wait for response from Mike,
maybe he has better suggestion to fix.



okey,actually, Mike is in my cc list of first mail (but not note in 
commit message), so he can receive my patch.


But anyway, I will send a v2



Thanks.

Coly Li



[PATCH V2] bcache: dont reset bio opf in bch_data_insert_start

2021-01-26 Thread Dongsheng Yang
commit ad0d9e76(bcache: use bio op accessors) makes the bi_opf
modified by bio_set_op_attrs(). But there is a logical
problem in this commit:

trace_bcache_cache_insert(k);
bch_keylist_push(>insert_keys);

-   n->bi_rw |= REQ_WRITE;
+   bio_set_op_attrs(n, REQ_OP_WRITE, 0);
bch_submit_bbio(n, op->c, k, 0);
} while (n != bio);

The old code add REQ_WRITE into bio n and keep other flags; the
new code set REQ_OP_WRITE to bi_opf, but reset all other flags.

This problem is discoverd in our performance testing:
(1) start a fio with 1M x 128depth for read in /dev/nvme0n1p1
(2) start a fio with 1M x 128depth for write in /dev/escache0 (cache
device is /dev/nvme0n1p2)

We found the BW of reading is 2000+M/s, but the BW of writing is
0-100M/s. After some debugging, we found the problem is io submit in
writting is very slow.

bch_data_insert_start() insert a bio to /dev/nvme0n1p1, but as
cached_dev submit stack bio will be added into current->bio_list, and
return.Then __submit_bio_noacct() will submit the new bio in bio_list
into /dev/nvme0n1p1. This operation would be slow in
blk_mq_submit_bio() -> rq_qos_throttle(q, bio);

The rq_qos_throttle() will call wbt_should_throttle(),
static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
{
switch (bio_op(bio)) {
case REQ_OP_WRITE:
/*
 * Don't throttle WRITE_ODIRECT
 */
if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
(REQ_SYNC | REQ_IDLE))
return false;
... ...
}

As the bio_set_op_attrs() reset the (REQ_SYNC | REQ_IDLE), so this write
bio will be considered as non-direct write.

After this fix, bio to nvme will flaged as (REQ_SYNC | REQ_IDLE),
then fio for writing will get about 1000M/s bandwidth.

Fixes: ad0d9e76a412 ("bcache: use bio op accessors")
Close: EAS-60259
CC: Mike Christie 
Signed-off-by: Dongsheng Yang 
---
 drivers/md/bcache/request.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index c7cadaafa947..eb734f7ddaac 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -244,7 +244,7 @@ static void bch_data_insert_start(struct closure *cl)
trace_bcache_cache_insert(k);
bch_keylist_push(>insert_keys);
 
-   bio_set_op_attrs(n, REQ_OP_WRITE, 0);
+   n->bi_opf |= REQ_OP_WRITE;
bch_submit_bbio(n, op->c, k, 0);
} while (n != bio);
 
-- 
2.25.1



[PATCH V3] bcache: dont reset bio opf in bch_data_insert_start

2021-01-26 Thread Dongsheng Yang
commit ad0d9e76(bcache: use bio op accessors) makes the bi_opf
modified by bio_set_op_attrs(). But there is a logical
problem in this commit:

trace_bcache_cache_insert(k);
bch_keylist_push(>insert_keys);

-   n->bi_rw |= REQ_WRITE;
+   bio_set_op_attrs(n, REQ_OP_WRITE, 0);
bch_submit_bbio(n, op->c, k, 0);
} while (n != bio);

The old code add REQ_WRITE into bio n and keep other flags; the
new code set REQ_OP_WRITE to bi_opf, but reset all other flags.

This problem is discoverd in our performance testing:
(1) start a fio with 1M x 128depth for read in /dev/nvme0n1p1
(2) start a fio with 1M x 128depth for write in /dev/escache0 (cache
device is /dev/nvme0n1p2)

We found the BW of reading is 2000+M/s, but the BW of writing is
0-100M/s. After some debugging, we found the problem is io submit in
writting is very slow.

bch_data_insert_start() insert a bio to /dev/nvme0n1p1, but as
cached_dev submit stack bio will be added into current->bio_list, and
return.Then __submit_bio_noacct() will submit the new bio in bio_list
into /dev/nvme0n1p1. This operation would be slow in
blk_mq_submit_bio() -> rq_qos_throttle(q, bio);

The rq_qos_throttle() will call wbt_should_throttle(),
static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
{
switch (bio_op(bio)) {
case REQ_OP_WRITE:
/*
 * Don't throttle WRITE_ODIRECT
 */
if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
(REQ_SYNC | REQ_IDLE))
return false;
... ...
}

As the bio_set_op_attrs() reset the (REQ_SYNC | REQ_IDLE), so this write
bio will be considered as non-direct write.

After this fix, bio to nvme will flaged as (REQ_SYNC | REQ_IDLE),
then fio for writing will get about 1000M/s bandwidth.

Fixes: ad0d9e76a412 ("bcache: use bio op accessors")
CC: Mike Christie 
Signed-off-by: Dongsheng Yang 
---
V3-V2:
remove an unused close-line in commit message.

 drivers/md/bcache/request.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index c7cadaafa947..eb734f7ddaac 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -244,7 +244,7 @@ static void bch_data_insert_start(struct closure *cl)
trace_bcache_cache_insert(k);
bch_keylist_push(>insert_keys);
 
-   bio_set_op_attrs(n, REQ_OP_WRITE, 0);
+   n->bi_opf |= REQ_OP_WRITE;
bch_submit_bbio(n, op->c, k, 0);
} while (n != bio);
 
-- 
2.25.1



Re: [PATCH] bcache: dont reset bio opf in bch_data_insert_start

2021-01-26 Thread Dongsheng Yang



在 2021/1/25 星期一 下午 12:53, Coly Li 写道:

On 1/25/21 12:29 PM, Dongsheng Yang wrote:

commit ad0d9e76(bcache: use bio op accessors) makes the bi_opf
modified by bio_set_op_attrs(). But there is a logical
problem in this commit:

 trace_bcache_cache_insert(k);
 bch_keylist_push(>insert_keys);

-   n->bi_rw |= REQ_WRITE;
+   bio_set_op_attrs(n, REQ_OP_WRITE, 0);
 bch_submit_bbio(n, op->c, k, 0);
 } while (n != bio);

The old code add REQ_WRITE into bio n and keep other flags; the
new code set REQ_OP_WRITE to bi_opf, but reset all other flags.

This problem is discoverd in our performance testing:
(1) start a fio with 1M x 128depth for read in /dev/nvme0n1p1
(2) start a fio with 1M x 128depth for write in /dev/escache0 (cache
device is /dev/nvme0n1p2)

We found the BW of reading is 2000+M/s, but the BW of writing is
0-100M/s. After some debugging, we found the problem is io submit in
writting is very slow.

bch_data_insert_start() insert a bio to /dev/nvme0n1p1, but as
cached_dev submit stack bio will be added into current->bio_list, and
return.Then __submit_bio_noacct() will submit the new bio in bio_list
into /dev/nvme0n1p1. This operation would be slow in
blk_mq_submit_bio() -> rq_qos_throttle(q, bio);

The rq_qos_throttle() will call wbt_should_throttle(),
static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
{
 switch (bio_op(bio)) {
 case REQ_OP_WRITE:
 /*
  * Don't throttle WRITE_ODIRECT
  */
 if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
 (REQ_SYNC | REQ_IDLE))
 return false;
... ...
}

As the bio_set_op_attrs() reset the (REQ_SYNC | REQ_IDLE), so this write
bio will be considered as non-direct write.

After this fix, bio to nvme will flaged as (REQ_SYNC | REQ_IDLE),
then fio for writing will get about 1000M/s bandwidth.

Fixes: ad0d9e76a412470800c04fc4b56fc86c02d6

It should be,
Fixes: ad0d9e76a412 ("bcache: use bio op accessors")


Signed-off-by: Dongsheng Yang

Please CC the fixed patch author  Mike Christie.



Hi Coly,

    Should I send a V2 for commit message update?

Or you can help to fix it when you take it from maillist?


Thanx

Yang


---
  drivers/md/bcache/request.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index c7cadaafa947..eb734f7ddaac 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -244,7 +244,7 @@ static void bch_data_insert_start(struct closure *cl)
trace_bcache_cache_insert(k);
bch_keylist_push(>insert_keys);
  
-		bio_set_op_attrs(n, REQ_OP_WRITE, 0);

+   n->bi_opf |= REQ_OP_WRITE;
bch_submit_bbio(n, op->c, k, 0);
} while (n != bio);

The fix is OK to me, I'd like to see opinion from Mike Christie too.

Thanks for the catch.

Coly Li


[PATCH] bcache: dont reset bio opf in bch_data_insert_start

2021-01-24 Thread Dongsheng Yang
commit ad0d9e76(bcache: use bio op accessors) makes the bi_opf
modified by bio_set_op_attrs(). But there is a logical
problem in this commit:

trace_bcache_cache_insert(k);
bch_keylist_push(>insert_keys);

-   n->bi_rw |= REQ_WRITE;
+   bio_set_op_attrs(n, REQ_OP_WRITE, 0);
bch_submit_bbio(n, op->c, k, 0);
} while (n != bio);

The old code add REQ_WRITE into bio n and keep other flags; the
new code set REQ_OP_WRITE to bi_opf, but reset all other flags.

This problem is discoverd in our performance testing:
(1) start a fio with 1M x 128depth for read in /dev/nvme0n1p1
(2) start a fio with 1M x 128depth for write in /dev/escache0 (cache
device is /dev/nvme0n1p2)

We found the BW of reading is 2000+M/s, but the BW of writing is
0-100M/s. After some debugging, we found the problem is io submit in
writting is very slow.

bch_data_insert_start() insert a bio to /dev/nvme0n1p1, but as
cached_dev submit stack bio will be added into current->bio_list, and
return.Then __submit_bio_noacct() will submit the new bio in bio_list
into /dev/nvme0n1p1. This operation would be slow in
blk_mq_submit_bio() -> rq_qos_throttle(q, bio);

The rq_qos_throttle() will call wbt_should_throttle(),
static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
{
switch (bio_op(bio)) {
case REQ_OP_WRITE:
/*
 * Don't throttle WRITE_ODIRECT
 */
if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
(REQ_SYNC | REQ_IDLE))
return false;
... ...
}

As the bio_set_op_attrs() reset the (REQ_SYNC | REQ_IDLE), so this write
bio will be considered as non-direct write.

After this fix, bio to nvme will flaged as (REQ_SYNC | REQ_IDLE),
then fio for writing will get about 1000M/s bandwidth.

Fixes: ad0d9e76a412470800c04fc4b56fc86c02d6
Signed-off-by: Dongsheng Yang 
---
 drivers/md/bcache/request.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index c7cadaafa947..eb734f7ddaac 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -244,7 +244,7 @@ static void bch_data_insert_start(struct closure *cl)
trace_bcache_cache_insert(k);
bch_keylist_push(>insert_keys);
 
-   bio_set_op_attrs(n, REQ_OP_WRITE, 0);
+   n->bi_opf |= REQ_OP_WRITE;
bch_submit_bbio(n, op->c, k, 0);
} while (n != bio);
 
-- 
2.25.1



Re: [PATCH] bcache: consider the fragmentation when update the writeback rate

2020-12-08 Thread Dongsheng Yang



在 2020/12/9 星期三 下午 12:48, Dongdong Tao 写道:

Hi Dongsheng,

I'm working on it, next step I'm gathering some testing data and
upload (very sorry for the delay...)
Thanks for the comment.
One of the main concerns to alleviate this issue with the writeback
process is that we need to minimize the impact on the client IO
performance.
writeback_percent by default is 10, start writeback when dirty buckets
reached 10 percent might be a bit too aggressive, as the
writeback_cutoff_sync is 70 percent.
So i chose to start the writeback when dirty buckets reached 50
percent so that this patch will only take effect after dirty buckets
percent is above that


Agree with that's too aggressive to reuse writeback_percent, and that's 
less flexable.


Okey, let's wait for your testing result.


Thanx



Thanks,
Dongdong




On Wed, Dec 9, 2020 at 10:27 AM Dongsheng Yang
 wrote:


在 2020/11/3 星期二 下午 8:42, Dongdong Tao 写道:

From: dongdong tao 

Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.

This patch will try to accelerate the writeback rate when the
fragmentation is high. It calculate the propotional_scaled value
based on below:
(dirty_sectors / writeback_rate_p_term_inverse) * fragment
As we can see, the higher fragmentation will result a larger
proportional_scaled value, thus cause a larger writeback rate.
The fragment value is calculated based on below:
(dirty_buckets *  bucket_size) / dirty_sectors
If you think about it, the value of fragment will be always
inside [1, bucket_size].

This patch only considers the fragmentation when the number of
dirty_buckets reached to a dirty threshold(configurable by
writeback_fragment_percent, default is 50), so bcache will
remain the original behaviour before the dirty buckets reached
the threshold.

Signed-off-by: dongdong tao 
---
   drivers/md/bcache/bcache.h|  1 +
   drivers/md/bcache/sysfs.c |  6 ++
   drivers/md/bcache/writeback.c | 21 +
   3 files changed, 28 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d57f48307e6..87632f7032b6 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -374,6 +374,7 @@ struct cached_dev {
   unsigned intwriteback_metadata:1;
   unsigned intwriteback_running:1;
   unsigned char   writeback_percent;
+ unsigned char   writeback_fragment_percent;
   unsigned intwriteback_delay;

   uint64_twriteback_rate_target;
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 554e3afc9b68..69499113aef8 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
   rw_attribute(writeback_metadata);
   rw_attribute(writeback_running);
   rw_attribute(writeback_percent);
+rw_attribute(writeback_fragment_percent);


Hi Dongdong and Coly,

  What is the status about this patch? In my opinion, it is a problem
we need to solve,

but can we just reuse the parameter of writeback_percent, rather than
introduce a new writeback_fragment_percent?

That means the semantic of writeback_percent will act on dirty data
percent and dirty bucket percent.

When we found there are dirty buckets more than (c->nbuckets *
writeback_percent), start the writeback.


Thanx

Yang


   rw_attribute(writeback_delay);
   rw_attribute(writeback_rate);

@@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)
   var_printf(writeback_running,   "%i");
   var_print(writeback_delay);
   var_print(writeback_percent);
+ var_print(writeback_fragment_percent);
   sysfs_hprint(writeback_rate,
wb ? atomic_long_read(>writeback_rate.rate) << 9 : 0);
   sysfs_printf(io_errors, "%i", atomic_read(>io_errors));
@@ -308,6 +310,9 @@ STORE(__cached_dev)
   sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
   0, bch_cutoff_writeback);

+ sysfs_strtoul_clamp(writeback_fragment_percent, 
dc->writeback_fragment_percent,
+ 0, bch_cutoff_writeback_sync);
+
   if (attr == _writeback_rate) {
   ssize_t ret;
   long int v = atomic_long_read(>writeback_rate.rate);
@@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
   _writeback_running,
   _writeback_delay,
   _writeback_percent,
+ _writeb

Re: [PATCH] bcache: consider the fragmentation when update the writeback rate

2020-12-08 Thread Dongsheng Yang



在 2020/11/3 星期二 下午 8:42, Dongdong Tao 写道:

From: dongdong tao 

Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.

This patch will try to accelerate the writeback rate when the
fragmentation is high. It calculate the propotional_scaled value
based on below:
(dirty_sectors / writeback_rate_p_term_inverse) * fragment
As we can see, the higher fragmentation will result a larger
proportional_scaled value, thus cause a larger writeback rate.
The fragment value is calculated based on below:
(dirty_buckets *  bucket_size) / dirty_sectors
If you think about it, the value of fragment will be always
inside [1, bucket_size].

This patch only considers the fragmentation when the number of
dirty_buckets reached to a dirty threshold(configurable by
writeback_fragment_percent, default is 50), so bcache will
remain the original behaviour before the dirty buckets reached
the threshold.

Signed-off-by: dongdong tao 
---
  drivers/md/bcache/bcache.h|  1 +
  drivers/md/bcache/sysfs.c |  6 ++
  drivers/md/bcache/writeback.c | 21 +
  3 files changed, 28 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d57f48307e6..87632f7032b6 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -374,6 +374,7 @@ struct cached_dev {
unsigned intwriteback_metadata:1;
unsigned intwriteback_running:1;
unsigned char   writeback_percent;
+   unsigned char   writeback_fragment_percent;
unsigned intwriteback_delay;
  
  	uint64_t		writeback_rate_target;

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 554e3afc9b68..69499113aef8 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed);
  rw_attribute(writeback_metadata);
  rw_attribute(writeback_running);
  rw_attribute(writeback_percent);
+rw_attribute(writeback_fragment_percent);



Hi Dongdong and Coly,

    What is the status about this patch? In my opinion, it is a problem 
we need to solve,


but can we just reuse the parameter of writeback_percent, rather than 
introduce a new writeback_fragment_percent?


That means the semantic of writeback_percent will act on dirty data 
percent and dirty bucket percent.


When we found there are dirty buckets more than (c->nbuckets * 
writeback_percent), start the writeback.



Thanx

Yang


  rw_attribute(writeback_delay);
  rw_attribute(writeback_rate);
  
@@ -197,6 +198,7 @@ SHOW(__bch_cached_dev)

var_printf(writeback_running,   "%i");
var_print(writeback_delay);
var_print(writeback_percent);
+   var_print(writeback_fragment_percent);
sysfs_hprint(writeback_rate,
 wb ? atomic_long_read(>writeback_rate.rate) << 9 : 0);
sysfs_printf(io_errors, "%i", atomic_read(>io_errors));
@@ -308,6 +310,9 @@ STORE(__cached_dev)
sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent,
0, bch_cutoff_writeback);
  
+	sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent,

+   0, bch_cutoff_writeback_sync);
+
if (attr == _writeback_rate) {
ssize_t ret;
long int v = atomic_long_read(>writeback_rate.rate);
@@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = {
_writeback_running,
_writeback_delay,
_writeback_percent,
+   _writeback_fragment_percent,
_writeback_rate,
_writeback_rate_update_seconds,
_writeback_rate_i_term_inverse,
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 3c74996978da..34babc89fdf3 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc)
int64_t integral_scaled;
uint32_t new_rate;
  
+	/*

+* We need to consider the number of dirty buckets as well
+* when calculating the proportional_scaled, Otherwise we might
+* have an unreasonable small writeback rate at a highly fragmented 
situation
+* when very few dirty sectors consumed a lot dirty buckets, the
+* worst case is when dirty_data reached writeback_percent and
+* dirty buckets reached to cutoff_writeback_sync, but the rate
+* still will be at 

Re: [PATCH] rbd: fix spelling mistake: "reregisteration" -> "re-registration"

2018-03-20 Thread Dongsheng Yang

Hi Colin,

On 03/19/2018 09:33 PM, Colin King wrote:

From: Colin Ian King 

Trivial fix to spelling mistake in rdb_warn message text

Signed-off-by: Colin Ian King 
---
  drivers/block/rbd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 1e03b04819c8..7b97240ff15e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3887,7 +3887,7 @@ static void rbd_reregister_watch(struct work_struct *work)
  
  	ret = rbd_dev_refresh(rbd_dev);

if (ret)
-   rbd_warn(rbd_dev, "reregisteration refresh failed: %d", ret);
+   rbd_warn(rbd_dev, "re-registration refresh failed: %d", ret);


Hmmm, I am not sure is that a spelling mistake, because the function 
name is rbd_reregister_watch().


Thanx
Yang

  }
  
  /*





Re: [PATCH] rbd: fix spelling mistake: "reregisteration" -> "re-registration"

2018-03-20 Thread Dongsheng Yang

Hi Colin,

On 03/19/2018 09:33 PM, Colin King wrote:

From: Colin Ian King 

Trivial fix to spelling mistake in rdb_warn message text

Signed-off-by: Colin Ian King 
---
  drivers/block/rbd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 1e03b04819c8..7b97240ff15e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3887,7 +3887,7 @@ static void rbd_reregister_watch(struct work_struct *work)
  
  	ret = rbd_dev_refresh(rbd_dev);

if (ret)
-   rbd_warn(rbd_dev, "reregisteration refresh failed: %d", ret);
+   rbd_warn(rbd_dev, "re-registration refresh failed: %d", ret);


Hmmm, I am not sure is that a spelling mistake, because the function 
name is rbd_reregister_watch().


Thanx
Yang

  }
  
  /*





[tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage

2016-03-31 Thread tip-bot for Dongsheng Yang
Commit-ID:  d740037fac7052e49450f6fa1454f1144a103b55
Gitweb: http://git.kernel.org/tip/d740037fac7052e49450f6fa1454f1144a103b55
Author: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
AuthorDate: Tue, 22 Mar 2016 16:37:08 +0800
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 31 Mar 2016 10:48:54 +0200

sched/cpuacct: Split usage accounting into user_usage and sys_usage

Sometimes, cpuacct.usage is not detailed enough to see how much CPU
usage a group had. We want to know how much time it used in user mode
and how much in kernel mode.

This patch introduces more files to give this information:

 # ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
 /sys/fs/cgroup/cpuacct/cpuacct.usage
 /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu
 /sys/fs/cgroup/cpuacct/cpuacct.usage_user
 /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
 /sys/fs/cgroup/cpuacct/cpuacct.usage_sys
 /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys

... while keeping the ABI with the existing counter.

Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
[ Ported to newer kernels. ]
Signed-off-by: Zhao Lei <zhao...@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Acked-by: Tejun Heo <t...@kernel.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Tejun Heo <hte...@gmail.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Link: 
http://lkml.kernel.org/r/aa171da036b520b51c79549e9b3215d29473f19d.1458635566.git.zhao...@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/sched/cpuacct.c | 140 +++--
 1 file changed, 113 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index e39bd4f..df947e0 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -25,11 +25,22 @@ enum cpuacct_stat_index {
CPUACCT_STAT_NSTATS,
 };
 
+enum cpuacct_usage_index {
+   CPUACCT_USAGE_USER, /* ... user mode */
+   CPUACCT_USAGE_SYSTEM,   /* ... kernel mode */
+
+   CPUACCT_USAGE_NRUSAGE,
+};
+
+struct cpuacct_usage {
+   u64 usages[CPUACCT_USAGE_NRUSAGE];
+};
+
 /* track cpu usage of a group of tasks and its child groups */
 struct cpuacct {
struct cgroup_subsys_state css;
/* cpuusage holds pointer to a u64-type object on every cpu */
-   u64 __percpu *cpuusage;
+   struct cpuacct_usage __percpu *cpuusage;
struct kernel_cpustat __percpu *cpustat;
 };
 
@@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
return css_ca(ca->css.parent);
 }
 
-static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
 static struct cpuacct root_cpuacct = {
.cpustat= _cpustat,
.cpuusage   = _cpuacct_cpuusage,
@@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
if (!ca)
goto out;
 
-   ca->cpuusage = alloc_percpu(u64);
+   ca->cpuusage = alloc_percpu(struct cpuacct_usage);
if (!ca->cpuusage)
goto out_free_ca;
 
@@ -96,20 +107,37 @@ static void cpuacct_css_free(struct cgroup_subsys_state 
*css)
kfree(ca);
 }
 
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
+enum cpuacct_usage_index index)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
u64 data;
 
+   /*
+* We allow index == CPUACCT_USAGE_NRUSAGE here to read
+* the sum of suages.
+*/
+   BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   data = *cpuusage;
+#endif
+
+   if (index == CPUACCT_USAGE_NRUSAGE) {
+   int i = 0;
+
+   data = 0;
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   data += cpuusage->usages[i];
+   } else {
+   data = cpuusage->usages[index];
+   }
+
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(_rq(cpu)->lock);
-#else
-   data = *cpuusage;
 #endif
 
return data;
@@ -117,69 +145,103 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int 
cpu)
 
 static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   int i;
 
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->loc

[tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage

2016-03-31 Thread tip-bot for Dongsheng Yang
Commit-ID:  d740037fac7052e49450f6fa1454f1144a103b55
Gitweb: http://git.kernel.org/tip/d740037fac7052e49450f6fa1454f1144a103b55
Author: Dongsheng Yang 
AuthorDate: Tue, 22 Mar 2016 16:37:08 +0800
Committer:  Ingo Molnar 
CommitDate: Thu, 31 Mar 2016 10:48:54 +0200

sched/cpuacct: Split usage accounting into user_usage and sys_usage

Sometimes, cpuacct.usage is not detailed enough to see how much CPU
usage a group had. We want to know how much time it used in user mode
and how much in kernel mode.

This patch introduces more files to give this information:

 # ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
 /sys/fs/cgroup/cpuacct/cpuacct.usage
 /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu
 /sys/fs/cgroup/cpuacct/cpuacct.usage_user
 /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
 /sys/fs/cgroup/cpuacct/cpuacct.usage_sys
 /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys

... while keeping the ABI with the existing counter.

Signed-off-by: Dongsheng Yang 
[ Ported to newer kernels. ]
Signed-off-by: Zhao Lei 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Tejun Heo 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/aa171da036b520b51c79549e9b3215d29473f19d.1458635566.git.zhao...@cn.fujitsu.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/cpuacct.c | 140 +++--
 1 file changed, 113 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index e39bd4f..df947e0 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -25,11 +25,22 @@ enum cpuacct_stat_index {
CPUACCT_STAT_NSTATS,
 };
 
+enum cpuacct_usage_index {
+   CPUACCT_USAGE_USER, /* ... user mode */
+   CPUACCT_USAGE_SYSTEM,   /* ... kernel mode */
+
+   CPUACCT_USAGE_NRUSAGE,
+};
+
+struct cpuacct_usage {
+   u64 usages[CPUACCT_USAGE_NRUSAGE];
+};
+
 /* track cpu usage of a group of tasks and its child groups */
 struct cpuacct {
struct cgroup_subsys_state css;
/* cpuusage holds pointer to a u64-type object on every cpu */
-   u64 __percpu *cpuusage;
+   struct cpuacct_usage __percpu *cpuusage;
struct kernel_cpustat __percpu *cpustat;
 };
 
@@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
return css_ca(ca->css.parent);
 }
 
-static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
 static struct cpuacct root_cpuacct = {
.cpustat= _cpustat,
.cpuusage   = _cpuacct_cpuusage,
@@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
if (!ca)
goto out;
 
-   ca->cpuusage = alloc_percpu(u64);
+   ca->cpuusage = alloc_percpu(struct cpuacct_usage);
if (!ca->cpuusage)
goto out_free_ca;
 
@@ -96,20 +107,37 @@ static void cpuacct_css_free(struct cgroup_subsys_state 
*css)
kfree(ca);
 }
 
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
+enum cpuacct_usage_index index)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
u64 data;
 
+   /*
+* We allow index == CPUACCT_USAGE_NRUSAGE here to read
+* the sum of suages.
+*/
+   BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   data = *cpuusage;
+#endif
+
+   if (index == CPUACCT_USAGE_NRUSAGE) {
+   int i = 0;
+
+   data = 0;
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   data += cpuusage->usages[i];
+   } else {
+   data = cpuusage->usages[index];
+   }
+
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(_rq(cpu)->lock);
-#else
-   data = *cpuusage;
 #endif
 
return data;
@@ -117,69 +145,103 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int 
cpu)
 
 static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   int i;
 
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   *cpuusage = val;
+#endif
+
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   cpuusage->usages[i] = val;
+
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(_rq(cpu)->lock);
-#else
-   *cpuusage = val;
 #endif
 }
 
 /* return total cpu usage (in nanoseconds) of a group */
-static u64 cpuus

Re: [PATCH 0/5] drivers/mtd: make several functions return bool

2016-03-24 Thread Dongsheng Yang

ccing: Brian and Richard

Hi Yao,
Is that really necessary? I am not sure how much benefit we can 
achieve from this change.

Could you explain more?

Yang

On 03/25/2016 10:41 AM, Yaowei Bai wrote:

This series only make several funcitons return bool to improve
readability, no other funcitonal changes.

Yaowei Bai (5):
   drivers/mtd: mtd_is_partition can be boolean
   drivers/mtd: cfi_interleave_supported can be boolean
   drivers/mtd: map_bankwidth_supported can be boolean
   drivers/mtd: mtd_nand_has_bch can be boolean
   drivers/mtd/nand: nand_opcode_8bits can be boolean

  drivers/mtd/mtdpart.c  | 6 +++---
  include/linux/mtd/cfi.h| 6 +++---
  include/linux/mtd/map.h| 6 +++---
  include/linux/mtd/nand.h   | 6 +++---
  include/linux/mtd/nand_bch.h   | 4 ++--
  include/linux/mtd/partitions.h | 2 +-
  6 files changed, 15 insertions(+), 15 deletions(-)





Re: [PATCH 0/5] drivers/mtd: make several functions return bool

2016-03-24 Thread Dongsheng Yang

ccing: Brian and Richard

Hi Yao,
Is that really necessary? I am not sure how much benefit we can 
achieve from this change.

Could you explain more?

Yang

On 03/25/2016 10:41 AM, Yaowei Bai wrote:

This series only make several funcitons return bool to improve
readability, no other funcitonal changes.

Yaowei Bai (5):
   drivers/mtd: mtd_is_partition can be boolean
   drivers/mtd: cfi_interleave_supported can be boolean
   drivers/mtd: map_bankwidth_supported can be boolean
   drivers/mtd: mtd_nand_has_bch can be boolean
   drivers/mtd/nand: nand_opcode_8bits can be boolean

  drivers/mtd/mtdpart.c  | 6 +++---
  include/linux/mtd/cfi.h| 6 +++---
  include/linux/mtd/map.h| 6 +++---
  include/linux/mtd/nand.h   | 6 +++---
  include/linux/mtd/nand_bch.h   | 4 ++--
  include/linux/mtd/partitions.h | 2 +-
  6 files changed, 15 insertions(+), 15 deletions(-)





[tip:sched/urgent] sched/cpuacct: Rename parameter in cpuusage_write() for readability

2016-03-21 Thread tip-bot for Dongsheng Yang
Commit-ID:  1a736b77a3f50910843d076623204ba6e5057dc1
Gitweb: http://git.kernel.org/tip/1a736b77a3f50910843d076623204ba6e5057dc1
Author: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
AuthorDate: Mon, 21 Dec 2015 19:14:42 +0800
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Mon, 21 Mar 2016 10:59:29 +0100

sched/cpuacct: Rename parameter in cpuusage_write() for readability

The name of the 'reset' parameter to cpuusage_write() is quite confusing,
because the only valid value we allow is '0', so !reset is actually the
case that resets ...

Rename it to 'val' and explain it in a comment that we only allow 0.

Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: cgro...@vger.kernel.org
Cc: t...@kernel.org
Link: 
http://lkml.kernel.org/r/1450696483-2864-1-git-send-email-yangds.f...@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/sched/cpuacct.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index dd7cbb5..9c2bbf7 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, 
struct cftype *cft)
 }
 
 static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
- u64 reset)
+ u64 val)
 {
struct cpuacct *ca = css_ca(css);
int err = 0;
int i;
 
-   if (reset) {
+   /*
+* Only allow '0' here to do a reset.
+*/
+   if (val) {
err = -EINVAL;
goto out;
}


[tip:sched/urgent] sched/cpuacct: Rename parameter in cpuusage_write() for readability

2016-03-21 Thread tip-bot for Dongsheng Yang
Commit-ID:  1a736b77a3f50910843d076623204ba6e5057dc1
Gitweb: http://git.kernel.org/tip/1a736b77a3f50910843d076623204ba6e5057dc1
Author: Dongsheng Yang 
AuthorDate: Mon, 21 Dec 2015 19:14:42 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 21 Mar 2016 10:59:29 +0100

sched/cpuacct: Rename parameter in cpuusage_write() for readability

The name of the 'reset' parameter to cpuusage_write() is quite confusing,
because the only valid value we allow is '0', so !reset is actually the
case that resets ...

Rename it to 'val' and explain it in a comment that we only allow 0.

Signed-off-by: Dongsheng Yang 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: cgro...@vger.kernel.org
Cc: t...@kernel.org
Link: 
http://lkml.kernel.org/r/1450696483-2864-1-git-send-email-yangds.f...@cn.fujitsu.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/cpuacct.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index dd7cbb5..9c2bbf7 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, 
struct cftype *cft)
 }
 
 static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
- u64 reset)
+ u64 val)
 {
struct cpuacct *ca = css_ca(css);
int err = 0;
int i;
 
-   if (reset) {
+   /*
+* Only allow '0' here to do a reset.
+*/
+   if (val) {
err = -EINVAL;
goto out;
}


[tip:sched/core] sched/core: Remove duplicated sched_group_set_shares() prototype

2016-02-29 Thread tip-bot for Dongsheng Yang
Commit-ID:  41d93397334f3c3374810f45e7bcce9007d1a7bb
Gitweb: http://git.kernel.org/tip/41d93397334f3c3374810f45e7bcce9007d1a7bb
Author: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
AuthorDate: Wed, 13 Jan 2016 16:42:38 +0800
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Mon, 29 Feb 2016 09:53:05 +0100

sched/core: Remove duplicated sched_group_set_shares() prototype

Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: <lize...@huawei.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Link: 
http://lkml.kernel.org/r/1452674558-31897-1-git-send-email-yangds.f...@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/sched/sched.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91f0a77..3dc7b8b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -318,7 +318,6 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct 
cfs_rq *cfs_rq,
struct sched_entity *se, int cpu,
struct sched_entity *parent);
 extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
-extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);


[tip:sched/core] sched/core: Remove duplicated sched_group_set_shares() prototype

2016-02-29 Thread tip-bot for Dongsheng Yang
Commit-ID:  41d93397334f3c3374810f45e7bcce9007d1a7bb
Gitweb: http://git.kernel.org/tip/41d93397334f3c3374810f45e7bcce9007d1a7bb
Author: Dongsheng Yang 
AuthorDate: Wed, 13 Jan 2016 16:42:38 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 29 Feb 2016 09:53:05 +0100

sched/core: Remove duplicated sched_group_set_shares() prototype

Signed-off-by: Dongsheng Yang 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1452674558-31897-1-git-send-email-yangds.f...@cn.fujitsu.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/sched.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91f0a77..3dc7b8b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -318,7 +318,6 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct 
cfs_rq *cfs_rq,
struct sched_entity *se, int cpu,
struct sched_entity *parent);
 extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
-extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);


Re: [Propose] Isolate core_pattern in mnt namespace.

2016-01-04 Thread Dongsheng Yang

On 12/24/2015 12:36 AM, Eric W. Biederman wrote:

Dongsheng Yang  writes:

[...]

Hi Eric,
Happy new year and sorry for the late reply.


Given the other constraints on an implementation the pid namespace looks
by far the one best suited to host such a sysctl if it is possible to
implement safely.


So you think it's better to isolate the core_pattern in pid_namespace,
am I right?

But, core_file_path and user_mode_helper_path in core_pattern are much
more related with mnt_namespace IMO.

Could you help to explain it more?

Thanx
Yang


Eric


.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2016-01-04 Thread Dongsheng Yang

On 12/24/2015 12:36 AM, Eric W. Biederman wrote:

Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:

[...]

Hi Eric,
Happy new year and sorry for the late reply.


Given the other constraints on an implementation the pid namespace looks
by far the one best suited to host such a sysctl if it is possible to
implement safely.


So you think it's better to isolate the core_pattern in pid_namespace,
am I right?

But, core_file_path and user_mode_helper_path in core_pattern are much
more related with mnt_namespace IMO.

Could you help to explain it more?

Thanx
Yang


Eric


.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-22 Thread Dongsheng Yang

On 12/22/2015 05:52 AM, Eric W. Biederman wrote:

Dongsheng Yang  writes:


On 12/20/2015 05:47 PM, Eric W. Biederman wrote:

Dongsheng Yang  writes:


On 12/20/2015 10:37 AM, Al Viro wrote:

On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:

On 12/17/2015 07:23 PM, Dongsheng Yang wrote:

Hi guys,
   We are working on making core dump behaviour isolated in
container. But the problem is, the /proc/sys/kernel/core_pattern
is a kernel wide setting, not belongs to a container.

   So we want to add core_pattern into mnt namespace. What
do you think about it?


Hi Eric,
I found your patch about "net: Implement the per network namespace
sysctl infrastructure", I want to do the similar thing
in mnt namespace. Is that suggested way?


Why mnt namespace and not something else?


Hi Al,

Well, because core_pattern indicates the path to store core file.
In different mnt namespace, we would like to change the path with
different value.

In addition, Let's considering other namespaces:
UTS ns: contains informations of kernel and arch, not proper for core_pattern.
IPC ns: communication informations, not proper for core_pattern
PID ns: core_pattern is not related with pid
net ns: obviousely no.
user ns: not proper too.

Then I believe it's better to do this in mnt namespace. of course,
core_pattern is just one example. After this infrastructure finished,
we can implement more sysctls as per-mnt if necessary, I think.

Al, what do you think about this idea?


The hard part is not the sysctl.  The hard part is starting the usermode
helper, in an environment that it can deal with.  The mount namespace
really provides you with no help there.


Do you mean the core dump helper? But I think I don't want to touch it
in my development. I think I can use non-pipe way to get what I want,
Let me try to explain what I want here.

(1). introduce a --core-path option in docker run command to specify the
path in host to store core file in one container.
E.g: docker run --core-path=/core/test --name=test IMAGE

(2). When the container starting, docker attach a volume to it, similar
with "-v /core/test:/var/lib/docker/coredump". That means, the path of
/var/lib/docker/coredump in container is a link to /core/test in host.

(3). Set the /proc/sys/kernel/core_pattern in container as
"/var/lib/docker/coredump". But that should not affect the core_pattern
in host or other containers.

Then I think I can collect the core files from each container and save
them in the paths where I want.


For your case that sounds like it would work.  Unfortunately for this to
be generally applicable and to let the OS in the contianer control it's
fate the core dump pattern needs to be supported.

Otherwise something clever in userspace that can be written now should
be sufficient to fill the gap.  There is enough information for the user
mode helper to implement the policy you would like today.


Hi Eric,

To make sure I understand your point correctly:
Do you mean we can write a userspace helper in host such as
/usr/libexec/docker-pipe to get what I want?

Yes, I would say, for my case, it would work. This helper can get the
dump data from containers and dispatch them to different path such
as /var/lib/docker/cores//.

But there would be two problems in this solution.
(1). It may affect core dump on host. Normally, other processes in
 host would not be happy to use a helper of docker-pipe for themselves.
 But host have to share the core_pattern with containers, can't config
 it by itself.

(2). If there are some containers don't want to pass the core files
to host, they can't set the core_pattern in this solution.

IMO, we can get core files on host currently, by either non-pipe way I
described above or the pipe way you suggested. But the problem is both
of these methods would affect the core_pattern on host and other
containers.

So, I think the key point here is just isolating the core dump related
sysctl in mnt namespace.

Thanx
Yang


Eric


.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-22 Thread Dongsheng Yang
On 12/22/2015 11:12 AM, Kamezawa Hiroyuki wrote:
> On 2015/12/22 6:52, Eric W. Biederman wrote:
>> Dongsheng Yang  writes:
>>
>>> On 12/20/2015 05:47 PM, Eric W. Biederman wrote:
>>>> Dongsheng Yang  writes:
>>>>
>>>>> On 12/20/2015 10:37 AM, Al Viro wrote:
>>>>>> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:
>>>>>>> On 12/17/2015 07:23 PM, Dongsheng Yang wrote:
>>>>>>>> Hi guys,
>>>>>>>> We are working on making core dump behaviour isolated in
>>>>>>>> container. But the problem is, the /proc/sys/kernel/core_pattern
>>>>>>>> is a kernel wide setting, not belongs to a container.
>>>>>>>>
>>>>>>>> So we want to add core_pattern into mnt namespace. What
>>>>>>>> do you think about it?
>>>>>>>
>>>>>>> Hi Eric,
>>>>>>> I found your patch about "net: Implement the per network 
>>>>>>> namespace
>>>>>>> sysctl infrastructure", I want to do the similar thing
>>>>>>> in mnt namespace. Is that suggested way?
>>>>>>
>>>>>> Why mnt namespace and not something else?
>>>>>
>>>>> Hi Al,
>>>>>
>>>>> Well, because core_pattern indicates the path to store core file.
>>>>> In different mnt namespace, we would like to change the path with
>>>>> different value.
>>>>>
>>>>> In addition, Let's considering other namespaces:
>>>>> UTS ns: contains informations of kernel and arch, not proper for 
>>>>> core_pattern.
>>>>> IPC ns: communication informations, not proper for core_pattern
>>>>> PID ns: core_pattern is not related with pid
>>>>> net ns: obviousely no.
>>>>> user ns: not proper too.
>>>>>
>>>>> Then I believe it's better to do this in mnt namespace. of course,
>>>>> core_pattern is just one example. After this infrastructure finished,
>>>>> we can implement more sysctls as per-mnt if necessary, I think.
>>>>>
>>>>> Al, what do you think about this idea?
>>>>
>>>> The hard part is not the sysctl.  The hard part is starting the usermode
>>>> helper, in an environment that it can deal with.  The mount namespace
>>>> really provides you with no help there.
>>>
>>> Do you mean the core dump helper? But I think I don't want to touch it
>>> in my development. I think I can use non-pipe way to get what I want,
>>> Let me try to explain what I want here.
>>>
>>> (1). introduce a --core-path option in docker run command to specify the
>>> path in host to store core file in one container.
>>> E.g: docker run --core-path=/core/test --name=test IMAGE
>>>
>>> (2). When the container starting, docker attach a volume to it, similar
>>> with "-v /core/test:/var/lib/docker/coredump". That means, the path of
>>> /var/lib/docker/coredump in container is a link to /core/test in host.
>>>
>>> (3). Set the /proc/sys/kernel/core_pattern in container as
>>> "/var/lib/docker/coredump". But that should not affect the core_pattern
>>> in host or other containers.
>>>
>>> Then I think I can collect the core files from each container and save
>>> them in the paths where I want.
>>
>> For your case that sounds like it would work.  Unfortunately for this to
>> be generally applicable and to let the OS in the contianer control it's
>> fate the core dump pattern needs to be supported.
>>
>> Otherwise something clever in userspace that can be written now should
>> be sufficient to fill the gap.  There is enough information for the user
>> mode helper to implement the policy you would like today.
>>
> Let me clarify my understanding.
> 
> 1) running user-mode-helper in a container.
> It's not supported by the kernel. user-mode-helper always works on a host.
> 
> 2) running user mode helper in a host.
>It's supported in the newest distro(FC23). (abrt supports container.)
>Summary is here. https://github.com/abrt/abrt/wiki/Containers-and-chroots
> 
> If a guest user doesn't want to pass a core to the host owner, core_pattern
> should be configurable but it can't.

Agreed, then we have to make the core_pattern namespace-ed in mnt
namespace.

Thanx
Yang
> 
> Thanks,
> -Kame
> 
> 
> 
> 
> .
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.

2015-12-22 Thread Dongsheng Yang

On 12/22/2015 05:33 AM, Tejun Heo wrote:

On Mon, Dec 21, 2015 at 07:14:43PM +0800, Dongsheng Yang wrote:

Sometimes, cpuacct.usage is not detialed enough to user
to see how much usage a group used. We want to know how
much time it used in user mode and how much in kernel mode.


cpuusage is being phased out.  If you need these stats, please
implement it on cpu side.


Hi TJ, thanx, then I will look at introducing this feature in cpu side
then.

Thanx
Yang


Thanks.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-22 Thread Dongsheng Yang
On 12/22/2015 11:12 AM, Kamezawa Hiroyuki wrote:
> On 2015/12/22 6:52, Eric W. Biederman wrote:
>> Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:
>>
>>> On 12/20/2015 05:47 PM, Eric W. Biederman wrote:
>>>> Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:
>>>>
>>>>> On 12/20/2015 10:37 AM, Al Viro wrote:
>>>>>> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:
>>>>>>> On 12/17/2015 07:23 PM, Dongsheng Yang wrote:
>>>>>>>> Hi guys,
>>>>>>>> We are working on making core dump behaviour isolated in
>>>>>>>> container. But the problem is, the /proc/sys/kernel/core_pattern
>>>>>>>> is a kernel wide setting, not belongs to a container.
>>>>>>>>
>>>>>>>> So we want to add core_pattern into mnt namespace. What
>>>>>>>> do you think about it?
>>>>>>>
>>>>>>> Hi Eric,
>>>>>>> I found your patch about "net: Implement the per network 
>>>>>>> namespace
>>>>>>> sysctl infrastructure", I want to do the similar thing
>>>>>>> in mnt namespace. Is that suggested way?
>>>>>>
>>>>>> Why mnt namespace and not something else?
>>>>>
>>>>> Hi Al,
>>>>>
>>>>> Well, because core_pattern indicates the path to store core file.
>>>>> In different mnt namespace, we would like to change the path with
>>>>> different value.
>>>>>
>>>>> In addition, Let's considering other namespaces:
>>>>> UTS ns: contains informations of kernel and arch, not proper for 
>>>>> core_pattern.
>>>>> IPC ns: communication informations, not proper for core_pattern
>>>>> PID ns: core_pattern is not related with pid
>>>>> net ns: obviousely no.
>>>>> user ns: not proper too.
>>>>>
>>>>> Then I believe it's better to do this in mnt namespace. of course,
>>>>> core_pattern is just one example. After this infrastructure finished,
>>>>> we can implement more sysctls as per-mnt if necessary, I think.
>>>>>
>>>>> Al, what do you think about this idea?
>>>>
>>>> The hard part is not the sysctl.  The hard part is starting the usermode
>>>> helper, in an environment that it can deal with.  The mount namespace
>>>> really provides you with no help there.
>>>
>>> Do you mean the core dump helper? But I think I don't want to touch it
>>> in my development. I think I can use non-pipe way to get what I want,
>>> Let me try to explain what I want here.
>>>
>>> (1). introduce a --core-path option in docker run command to specify the
>>> path in host to store core file in one container.
>>> E.g: docker run --core-path=/core/test --name=test IMAGE
>>>
>>> (2). When the container starting, docker attach a volume to it, similar
>>> with "-v /core/test:/var/lib/docker/coredump". That means, the path of
>>> /var/lib/docker/coredump in container is a link to /core/test in host.
>>>
>>> (3). Set the /proc/sys/kernel/core_pattern in container as
>>> "/var/lib/docker/coredump". But that should not affect the core_pattern
>>> in host or other containers.
>>>
>>> Then I think I can collect the core files from each container and save
>>> them in the paths where I want.
>>
>> For your case that sounds like it would work.  Unfortunately for this to
>> be generally applicable and to let the OS in the contianer control it's
>> fate the core dump pattern needs to be supported.
>>
>> Otherwise something clever in userspace that can be written now should
>> be sufficient to fill the gap.  There is enough information for the user
>> mode helper to implement the policy you would like today.
>>
> Let me clarify my understanding.
> 
> 1) running user-mode-helper in a container.
> It's not supported by the kernel. user-mode-helper always works on a host.
> 
> 2) running user mode helper in a host.
>It's supported in the newest distro(FC23). (abrt supports container.)
>Summary is here. https://github.com/abrt/abrt/wiki/Containers-and-chroots
> 
> If a guest user doesn't want to pass a core to the host owner, core_pattern
> should be configurable but it can't.

Agreed, then we have to make the core_pattern namespace-ed in mnt
namespace.

Thanx
Yang
> 
> Thanks,
> -Kame
> 
> 
> 
> 
> .
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-22 Thread Dongsheng Yang

On 12/22/2015 05:52 AM, Eric W. Biederman wrote:

Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:


On 12/20/2015 05:47 PM, Eric W. Biederman wrote:

Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:


On 12/20/2015 10:37 AM, Al Viro wrote:

On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:

On 12/17/2015 07:23 PM, Dongsheng Yang wrote:

Hi guys,
   We are working on making core dump behaviour isolated in
container. But the problem is, the /proc/sys/kernel/core_pattern
is a kernel wide setting, not belongs to a container.

   So we want to add core_pattern into mnt namespace. What
do you think about it?


Hi Eric,
I found your patch about "net: Implement the per network namespace
sysctl infrastructure", I want to do the similar thing
in mnt namespace. Is that suggested way?


Why mnt namespace and not something else?


Hi Al,

Well, because core_pattern indicates the path to store core file.
In different mnt namespace, we would like to change the path with
different value.

In addition, Let's considering other namespaces:
UTS ns: contains informations of kernel and arch, not proper for core_pattern.
IPC ns: communication informations, not proper for core_pattern
PID ns: core_pattern is not related with pid
net ns: obviousely no.
user ns: not proper too.

Then I believe it's better to do this in mnt namespace. of course,
core_pattern is just one example. After this infrastructure finished,
we can implement more sysctls as per-mnt if necessary, I think.

Al, what do you think about this idea?


The hard part is not the sysctl.  The hard part is starting the usermode
helper, in an environment that it can deal with.  The mount namespace
really provides you with no help there.


Do you mean the core dump helper? But I think I don't want to touch it
in my development. I think I can use non-pipe way to get what I want,
Let me try to explain what I want here.

(1). introduce a --core-path option in docker run command to specify the
path in host to store core file in one container.
E.g: docker run --core-path=/core/test --name=test IMAGE

(2). When the container starting, docker attach a volume to it, similar
with "-v /core/test:/var/lib/docker/coredump". That means, the path of
/var/lib/docker/coredump in container is a link to /core/test in host.

(3). Set the /proc/sys/kernel/core_pattern in container as
"/var/lib/docker/coredump". But that should not affect the core_pattern
in host or other containers.

Then I think I can collect the core files from each container and save
them in the paths where I want.


For your case that sounds like it would work.  Unfortunately for this to
be generally applicable and to let the OS in the contianer control it's
fate the core dump pattern needs to be supported.

Otherwise something clever in userspace that can be written now should
be sufficient to fill the gap.  There is enough information for the user
mode helper to implement the policy you would like today.


Hi Eric,

To make sure I understand your point correctly:
Do you mean we can write a userspace helper in host such as
/usr/libexec/docker-pipe to get what I want?

Yes, I would say, for my case, it would work. This helper can get the
dump data from containers and dispatch them to different path such
as /var/lib/docker/cores//.

But there would be two problems in this solution.
(1). It may affect core dump on host. Normally, other processes in
 host would not be happy to use a helper of docker-pipe for themselves.
 But host have to share the core_pattern with containers, can't config
 it by itself.

(2). If there are some containers don't want to pass the core files
to host, they can't set the core_pattern in this solution.

IMO, we can get core files on host currently, by either non-pipe way I
described above or the pipe way you suggested. But the problem is both
of these methods would affect the core_pattern on host and other
containers.

So, I think the key point here is just isolating the core dump related
sysctl in mnt namespace.

Thanx
Yang


Eric


.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.

2015-12-22 Thread Dongsheng Yang

On 12/22/2015 05:33 AM, Tejun Heo wrote:

On Mon, Dec 21, 2015 at 07:14:43PM +0800, Dongsheng Yang wrote:

Sometimes, cpuacct.usage is not detialed enough to user
to see how much usage a group used. We want to know how
much time it used in user mode and how much in kernel mode.


cpuusage is being phased out.  If you need these stats, please
implement it on cpu side.


Hi TJ, thanx, then I will look at introducing this feature in cpu side
then.

Thanx
Yang


Thanks.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND PATCH 1/2] cpuacct: rename parameter in cpuusage_write for readability

2015-12-21 Thread Dongsheng Yang
The name of 'reset' makes a little confusion in reading, we would
say, if we want to reset usage, return -EINVAL. That's not true.

Actually, we want to say, we only allow user to do a reset. This
patch rename reset to val and add a comment here, making the code
more readable.

Signed-off-by: Dongsheng Yang 
---
 kernel/sched/cpuacct.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index dd7cbb5..9c2bbf7 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, 
struct cftype *cft)
 }
 
 static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
- u64 reset)
+ u64 val)
 {
struct cpuacct *ca = css_ca(css);
int err = 0;
int i;
 
-   if (reset) {
+   /*
+* Only allow '0' here to do a reset.
+*/
+   if (val) {
err = -EINVAL;
goto out;
}
-- 
1.8.4.2



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.

2015-12-21 Thread Dongsheng Yang
Sometimes, cpuacct.usage is not detialed enough to user
to see how much usage a group used. We want to know how
much time it used in user mode and how much in kernel mode.

This patch introduce some more files to tell user these informations.
 # ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
/sys/fs/cgroup/cpuacct/cpuacct.usage 
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu  
/sys/fs/cgroup/cpuacct/cpuacct.usage_sys
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys  
/sys/fs/cgroup/cpuacct/cpuacct.usage_user

Signed-off-by: Dongsheng Yang 
---
 kernel/sched/cpuacct.c | 140 ++---
 1 file changed, 120 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9c2bbf7..b3e8971 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -25,11 +25,22 @@ enum cpuacct_stat_index {
CPUACCT_STAT_NSTATS,
 };
 
+enum cpuacct_usage_index {
+   CPUACCT_USAGE_USER, /* ... user mode */
+   CPUACCT_USAGE_SYSTEM,   /* ... kernel mode */
+
+   CPUACCT_USAGE_NRUSAGE,
+};
+
+struct cpuacct_usage {
+   u64 usages[CPUACCT_USAGE_NRUSAGE];
+};
+
 /* track cpu usage of a group of tasks and its child groups */
 struct cpuacct {
struct cgroup_subsys_state css;
/* cpuusage holds pointer to a u64-type object on every cpu */
-   u64 __percpu *cpuusage;
+   struct cpuacct_usage __percpu *cpuusage;
struct kernel_cpustat __percpu *cpustat;
 };
 
@@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
return css_ca(ca->css.parent);
 }
 
-static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
 static struct cpuacct root_cpuacct = {
.cpustat= _cpustat,
.cpuusage   = _cpuacct_cpuusage,
@@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
if (!ca)
goto out;
 
-   ca->cpuusage = alloc_percpu(u64);
+   ca->cpuusage = alloc_percpu(struct cpuacct_usage);
if (!ca->cpuusage)
goto out_free_ca;
 
@@ -96,54 +107,104 @@ static void cpuacct_css_free(struct cgroup_subsys_state 
*css)
kfree(ca);
 }
 
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
+enum cpuacct_usage_index index)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-   u64 data;
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   u64 data = 0;
+   int i = 0;
+
+   /*
+* We allow index == CPUACCT_USAGE_NRUSAGE here to read
+* the sum of suages.
+*/
+   BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
+   if (index == CPUACCT_USAGE_NRUSAGE) {
+   raw_spin_lock_irq(_rq(cpu)->lock);
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   data += cpuusage->usages[i];
+   raw_spin_unlock_irq(_rq(cpu)->lock);
+
+   goto out;
+   }
 
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   data = *cpuusage;
+   data = cpuusage->usages[index];
raw_spin_unlock_irq(_rq(cpu)->lock);
 #else
-   data = *cpuusage;
+   data = cpuusage->usages[index];
 #endif
 
+out:
return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu,
+  enum cpuacct_usage_index index, u64 val)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   int i = 0;
+
+   /*
+* We allow index == CPUACCT_USAGE_NRUSAGE here to write
+* val to each index of usages.
+*/
+   BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
+   if (index == CPUACCT_USAGE_NRUSAGE) {
+   raw_spin_lock_irq(_rq(cpu)->lock);
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   cpuusage->usages[i] = val;
+   raw_spin_unlock_irq(_rq(cpu)->lock);
+
+   return;
+   }
 
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   *cpuusage = val;
+   cpuusage->usages[index] = val;
raw_spin_unlock_irq(_rq(cpu)->lock);
 #else
-   *cpuusage = val;
+   cpuusage->usages[index] = val;
 #endif
 }
 
 /* return total cpu usage (in nanoseconds) of a group */
-static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
+static u64 __cpuusage_read(struct 

[RESEND PATCH 1/2] cpuacct: rename parameter in cpuusage_write for readability

2015-12-21 Thread Dongsheng Yang
The name of 'reset' makes a little confusion in reading, we would
say, if we want to reset usage, return -EINVAL. That's not true.

Actually, we want to say, we only allow user to do a reset. This
patch rename reset to val and add a comment here, making the code
more readable.

Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
---
 kernel/sched/cpuacct.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index dd7cbb5..9c2bbf7 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, 
struct cftype *cft)
 }
 
 static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
- u64 reset)
+ u64 val)
 {
struct cpuacct *ca = css_ca(css);
int err = 0;
int i;
 
-   if (reset) {
+   /*
+* Only allow '0' here to do a reset.
+*/
+   if (val) {
err = -EINVAL;
goto out;
}
-- 
1.8.4.2



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.

2015-12-21 Thread Dongsheng Yang
Sometimes, cpuacct.usage is not detialed enough to user
to see how much usage a group used. We want to know how
much time it used in user mode and how much in kernel mode.

This patch introduce some more files to tell user these informations.
 # ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
/sys/fs/cgroup/cpuacct/cpuacct.usage 
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu  
/sys/fs/cgroup/cpuacct/cpuacct.usage_sys
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys  
/sys/fs/cgroup/cpuacct/cpuacct.usage_user

Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
---
 kernel/sched/cpuacct.c | 140 ++---
 1 file changed, 120 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9c2bbf7..b3e8971 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -25,11 +25,22 @@ enum cpuacct_stat_index {
CPUACCT_STAT_NSTATS,
 };
 
+enum cpuacct_usage_index {
+   CPUACCT_USAGE_USER, /* ... user mode */
+   CPUACCT_USAGE_SYSTEM,   /* ... kernel mode */
+
+   CPUACCT_USAGE_NRUSAGE,
+};
+
+struct cpuacct_usage {
+   u64 usages[CPUACCT_USAGE_NRUSAGE];
+};
+
 /* track cpu usage of a group of tasks and its child groups */
 struct cpuacct {
struct cgroup_subsys_state css;
/* cpuusage holds pointer to a u64-type object on every cpu */
-   u64 __percpu *cpuusage;
+   struct cpuacct_usage __percpu *cpuusage;
struct kernel_cpustat __percpu *cpustat;
 };
 
@@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
return css_ca(ca->css.parent);
 }
 
-static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
 static struct cpuacct root_cpuacct = {
.cpustat= _cpustat,
.cpuusage   = _cpuacct_cpuusage,
@@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
if (!ca)
goto out;
 
-   ca->cpuusage = alloc_percpu(u64);
+   ca->cpuusage = alloc_percpu(struct cpuacct_usage);
if (!ca->cpuusage)
goto out_free_ca;
 
@@ -96,54 +107,104 @@ static void cpuacct_css_free(struct cgroup_subsys_state 
*css)
kfree(ca);
 }
 
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
+enum cpuacct_usage_index index)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-   u64 data;
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   u64 data = 0;
+   int i = 0;
+
+   /*
+* We allow index == CPUACCT_USAGE_NRUSAGE here to read
+* the sum of suages.
+*/
+   BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
+   if (index == CPUACCT_USAGE_NRUSAGE) {
+   raw_spin_lock_irq(_rq(cpu)->lock);
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   data += cpuusage->usages[i];
+   raw_spin_unlock_irq(_rq(cpu)->lock);
+
+   goto out;
+   }
 
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   data = *cpuusage;
+   data = cpuusage->usages[index];
raw_spin_unlock_irq(_rq(cpu)->lock);
 #else
-   data = *cpuusage;
+   data = cpuusage->usages[index];
 #endif
 
+out:
return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu,
+  enum cpuacct_usage_index index, u64 val)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   int i = 0;
+
+   /*
+* We allow index == CPUACCT_USAGE_NRUSAGE here to write
+* val to each index of usages.
+*/
+   BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
+   if (index == CPUACCT_USAGE_NRUSAGE) {
+   raw_spin_lock_irq(_rq(cpu)->lock);
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   cpuusage->usages[i] = val;
+   raw_spin_unlock_irq(_rq(cpu)->lock);
+
+   return;
+   }
 
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   *cpuusage = val;
+   cpuusage->usages[index] = val;
raw_spin_unlock_irq(_rq(cpu)->lock);
 #else
-   *cpuusage = val;
+   cpuusage->usages[index] = val;
 #endif
 }
 
 /* return total cpu usage (in nanoseconds) of a group */
-static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
+s

Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-20 Thread Dongsheng Yang

On 12/20/2015 05:47 PM, Eric W. Biederman wrote:

Dongsheng Yang  writes:


On 12/20/2015 10:37 AM, Al Viro wrote:

On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:

On 12/17/2015 07:23 PM, Dongsheng Yang wrote:

Hi guys,
  We are working on making core dump behaviour isolated in
container. But the problem is, the /proc/sys/kernel/core_pattern
is a kernel wide setting, not belongs to a container.

  So we want to add core_pattern into mnt namespace. What
do you think about it?


Hi Eric,
I found your patch about "net: Implement the per network namespace
sysctl infrastructure", I want to do the similar thing
in mnt namespace. Is that suggested way?


Why mnt namespace and not something else?


Hi Al,

Well, because core_pattern indicates the path to store core file.
In different mnt namespace, we would like to change the path with
different value.

In addition, Let's considering other namespaces:
UTS ns: contains informations of kernel and arch, not proper for core_pattern.
IPC ns: communication informations, not proper for core_pattern
PID ns: core_pattern is not related with pid
net ns: obviousely no.
user ns: not proper too.

Then I believe it's better to do this in mnt namespace. of course,
core_pattern is just one example. After this infrastructure finished,
we can implement more sysctls as per-mnt if necessary, I think.

Al, what do you think about this idea?


The hard part is not the sysctl.  The hard part is starting the usermode
helper, in an environment that it can deal with.  The mount namespace
really provides you with no help there.


Do you mean the core dump helper? But I think I don't want to touch it
in my development. I think I can use non-pipe way to get what I want,
Let me try to explain what I want here.

(1). introduce a --core-path option in docker run command to specify the
path in host to store core file in one container.
E.g: docker run --core-path=/core/test --name=test IMAGE

(2). When the container starting, docker attach a volume to it, similar
with "-v /core/test:/var/lib/docker/coredump". That means, the path of
/var/lib/docker/coredump in container is a link to /core/test in host.

(3). Set the /proc/sys/kernel/core_pattern in container as
"/var/lib/docker/coredump". But that should not affect the core_pattern
in host or other containers.

Then I think I can collect the core files from each container and save
them in the paths where I want.

Thanx
Yang


Eric







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-20 Thread Dongsheng Yang

On 12/20/2015 05:47 PM, Eric W. Biederman wrote:

Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:


On 12/20/2015 10:37 AM, Al Viro wrote:

On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:

On 12/17/2015 07:23 PM, Dongsheng Yang wrote:

Hi guys,
  We are working on making core dump behaviour isolated in
container. But the problem is, the /proc/sys/kernel/core_pattern
is a kernel wide setting, not belongs to a container.

  So we want to add core_pattern into mnt namespace. What
do you think about it?


Hi Eric,
I found your patch about "net: Implement the per network namespace
sysctl infrastructure", I want to do the similar thing
in mnt namespace. Is that suggested way?


Why mnt namespace and not something else?


Hi Al,

Well, because core_pattern indicates the path to store core file.
In different mnt namespace, we would like to change the path with
different value.

In addition, Let's considering other namespaces:
UTS ns: contains informations of kernel and arch, not proper for core_pattern.
IPC ns: communication informations, not proper for core_pattern
PID ns: core_pattern is not related with pid
net ns: obviousely no.
user ns: not proper too.

Then I believe it's better to do this in mnt namespace. of course,
core_pattern is just one example. After this infrastructure finished,
we can implement more sysctls as per-mnt if necessary, I think.

Al, what do you think about this idea?


The hard part is not the sysctl.  The hard part is starting the usermode
helper, in an environment that it can deal with.  The mount namespace
really provides you with no help there.


Do you mean the core dump helper? But I think I don't want to touch it
in my development. I think I can use non-pipe way to get what I want,
Let me try to explain what I want here.

(1). introduce a --core-path option in docker run command to specify the
path in host to store core file in one container.
E.g: docker run --core-path=/core/test --name=test IMAGE

(2). When the container starting, docker attach a volume to it, similar
with "-v /core/test:/var/lib/docker/coredump". That means, the path of
/var/lib/docker/coredump in container is a link to /core/test in host.

(3). Set the /proc/sys/kernel/core_pattern in container as
"/var/lib/docker/coredump". But that should not affect the core_pattern
in host or other containers.

Then I think I can collect the core files from each container and save
them in the paths where I want.

Thanx
Yang


Eric







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-19 Thread Dongsheng Yang

On 12/20/2015 10:37 AM, Al Viro wrote:

On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:

On 12/17/2015 07:23 PM, Dongsheng Yang wrote:

Hi guys,
 We are working on making core dump behaviour isolated in
container. But the problem is, the /proc/sys/kernel/core_pattern
is a kernel wide setting, not belongs to a container.

 So we want to add core_pattern into mnt namespace. What
do you think about it?


Hi Eric,
I found your patch about "net: Implement the per network namespace
sysctl infrastructure", I want to do the similar thing
in mnt namespace. Is that suggested way?


Why mnt namespace and not something else?


Hi Al,

Well, because core_pattern indicates the path to store core file.
In different mnt namespace, we would like to change the path with
different value.

In addition, Let's considering other namespaces:
UTS ns: contains informations of kernel and arch, not proper for 
core_pattern.

IPC ns: communication informations, not proper for core_pattern
PID ns: core_pattern is not related with pid
net ns: obviousely no.
user ns: not proper too.

Then I believe it's better to do this in mnt namespace. of course,
core_pattern is just one example. After this infrastructure finished,
we can implement more sysctls as per-mnt if necessary, I think.

Al, what do you think about this idea?

Yang



.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-19 Thread Dongsheng Yang

On 12/17/2015 07:23 PM, Dongsheng Yang wrote:

Hi guys,
 We are working on making core dump behaviour isolated in
container. But the problem is, the /proc/sys/kernel/core_pattern
is a kernel wide setting, not belongs to a container.

 So we want to add core_pattern into mnt namespace. What
do you think about it?


Hi Eric,
	I found your patch about "net: Implement the per network namespace 
sysctl infrastructure", I want to do the similar thing

in mnt namespace. Is that suggested way?

Thanx
Yang


Yang


--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-19 Thread Dongsheng Yang

On 12/17/2015 07:23 PM, Dongsheng Yang wrote:

Hi guys,
 We are working on making core dump behaviour isolated in
container. But the problem is, the /proc/sys/kernel/core_pattern
is a kernel wide setting, not belongs to a container.

 So we want to add core_pattern into mnt namespace. What
do you think about it?


Hi Eric,
	I found your patch about "net: Implement the per network namespace 
sysctl infrastructure", I want to do the similar thing

in mnt namespace. Is that suggested way?

Thanx
Yang


Yang


--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-19 Thread Dongsheng Yang

On 12/20/2015 10:37 AM, Al Viro wrote:

On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:

On 12/17/2015 07:23 PM, Dongsheng Yang wrote:

Hi guys,
 We are working on making core dump behaviour isolated in
container. But the problem is, the /proc/sys/kernel/core_pattern
is a kernel wide setting, not belongs to a container.

 So we want to add core_pattern into mnt namespace. What
do you think about it?


Hi Eric,
I found your patch about "net: Implement the per network namespace
sysctl infrastructure", I want to do the similar thing
in mnt namespace. Is that suggested way?


Why mnt namespace and not something else?


Hi Al,

Well, because core_pattern indicates the path to store core file.
In different mnt namespace, we would like to change the path with
different value.

In addition, Let's considering other namespaces:
UTS ns: contains informations of kernel and arch, not proper for 
core_pattern.

IPC ns: communication informations, not proper for core_pattern
PID ns: core_pattern is not related with pid
net ns: obviousely no.
user ns: not proper too.

Then I believe it's better to do this in mnt namespace. of course,
core_pattern is just one example. After this infrastructure finished,
we can implement more sysctls as per-mnt if necessary, I think.

Al, what do you think about this idea?

Yang



.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Propose] Isolate core_pattern in mnt namespace.

2015-12-17 Thread Dongsheng Yang

Hi guys,
We are working on making core dump behaviour isolated in
container. But the problem is, the /proc/sys/kernel/core_pattern
is a kernel wide setting, not belongs to a container.

So we want to add core_pattern into mnt namespace. What
do you think about it?

Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Propose] Isolate core_pattern in mnt namespace.

2015-12-17 Thread Dongsheng Yang

Hi guys,
We are working on making core dump behaviour isolated in
container. But the problem is, the /proc/sys/kernel/core_pattern
is a kernel wide setting, not belongs to a container.

So we want to add core_pattern into mnt namespace. What
do you think about it?

Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] cpuacct: rename parameter in cpuusage_write for readability

2015-12-14 Thread Dongsheng Yang
The name of 'reset' makes a little confusion in reading, we would
say, if we want to reset usage, return -EINVAL. That's not true.

Actually, we want to say, we only allow user to do a reset. This
patch rename reset to val and add a comment here, making the code
more readable.

Signed-off-by: Dongsheng Yang 
---
 kernel/sched/cpuacct.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index dd7cbb5..9c2bbf7 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, 
struct cftype *cft)
 }
 
 static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
- u64 reset)
+ u64 val)
 {
struct cpuacct *ca = css_ca(css);
int err = 0;
int i;
 
-   if (reset) {
+   /*
+* Only allow '0' here to do a reset.
+*/
+   if (val) {
err = -EINVAL;
goto out;
}
-- 
1.8.4.2



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.

2015-12-14 Thread Dongsheng Yang
Sometimes, cpuacct.usage is not detialed enough to user
to see how much usage a group used. We want to know how
much time it used in user mode and how much in kernel mode.

This patch introduce some more files to tell user these informations.
 # ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
/sys/fs/cgroup/cpuacct/cpuacct.usage 
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu  
/sys/fs/cgroup/cpuacct/cpuacct.usage_sys
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys  
/sys/fs/cgroup/cpuacct/cpuacct.usage_user

Signed-off-by: Dongsheng Yang 
---
 kernel/sched/cpuacct.c | 140 ++---
 1 file changed, 120 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9c2bbf7..b3e8971 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -25,11 +25,22 @@ enum cpuacct_stat_index {
CPUACCT_STAT_NSTATS,
 };
 
+enum cpuacct_usage_index {
+   CPUACCT_USAGE_USER, /* ... user mode */
+   CPUACCT_USAGE_SYSTEM,   /* ... kernel mode */
+
+   CPUACCT_USAGE_NRUSAGE,
+};
+
+struct cpuacct_usage {
+   u64 usages[CPUACCT_USAGE_NRUSAGE];
+};
+
 /* track cpu usage of a group of tasks and its child groups */
 struct cpuacct {
struct cgroup_subsys_state css;
/* cpuusage holds pointer to a u64-type object on every cpu */
-   u64 __percpu *cpuusage;
+   struct cpuacct_usage __percpu *cpuusage;
struct kernel_cpustat __percpu *cpustat;
 };
 
@@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
return css_ca(ca->css.parent);
 }
 
-static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
 static struct cpuacct root_cpuacct = {
.cpustat= _cpustat,
.cpuusage   = _cpuacct_cpuusage,
@@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
if (!ca)
goto out;
 
-   ca->cpuusage = alloc_percpu(u64);
+   ca->cpuusage = alloc_percpu(struct cpuacct_usage);
if (!ca->cpuusage)
goto out_free_ca;
 
@@ -96,54 +107,104 @@ static void cpuacct_css_free(struct cgroup_subsys_state 
*css)
kfree(ca);
 }
 
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
+enum cpuacct_usage_index index)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-   u64 data;
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   u64 data = 0;
+   int i = 0;
+
+   /*
+* We allow index == CPUACCT_USAGE_NRUSAGE here to read
+* the sum of suages.
+*/
+   BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
+   if (index == CPUACCT_USAGE_NRUSAGE) {
+   raw_spin_lock_irq(_rq(cpu)->lock);
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   data += cpuusage->usages[i];
+   raw_spin_unlock_irq(_rq(cpu)->lock);
+
+   goto out;
+   }
 
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   data = *cpuusage;
+   data = cpuusage->usages[index];
raw_spin_unlock_irq(_rq(cpu)->lock);
 #else
-   data = *cpuusage;
+   data = cpuusage->usages[index];
 #endif
 
+out:
return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu,
+  enum cpuacct_usage_index index, u64 val)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   int i = 0;
+
+   /*
+* We allow index == CPUACCT_USAGE_NRUSAGE here to write
+* val to each index of usages.
+*/
+   BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
+   if (index == CPUACCT_USAGE_NRUSAGE) {
+   raw_spin_lock_irq(_rq(cpu)->lock);
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   cpuusage->usages[i] = val;
+   raw_spin_unlock_irq(_rq(cpu)->lock);
+
+   return;
+   }
 
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   *cpuusage = val;
+   cpuusage->usages[index] = val;
raw_spin_unlock_irq(_rq(cpu)->lock);
 #else
-   *cpuusage = val;
+   cpuusage->usages[index] = val;
 #endif
 }
 
 /* return total cpu usage (in nanoseconds) of a group */
-static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
+static u64 __cpuusage_read(struct 

[PATCH 1/2] cpuacct: rename parameter in cpuusage_write for readability

2015-12-14 Thread Dongsheng Yang
The name of 'reset' makes a little confusion in reading, we would
say, if we want to reset usage, return -EINVAL. That's not true.

Actually, we want to say, we only allow user to do a reset. This
patch rename reset to val and add a comment here, making the code
more readable.

Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
---
 kernel/sched/cpuacct.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index dd7cbb5..9c2bbf7 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, 
struct cftype *cft)
 }
 
 static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
- u64 reset)
+ u64 val)
 {
struct cpuacct *ca = css_ca(css);
int err = 0;
int i;
 
-   if (reset) {
+   /*
+* Only allow '0' here to do a reset.
+*/
+   if (val) {
err = -EINVAL;
goto out;
}
-- 
1.8.4.2



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.

2015-12-14 Thread Dongsheng Yang
Sometimes, cpuacct.usage is not detialed enough to user
to see how much usage a group used. We want to know how
much time it used in user mode and how much in kernel mode.

This patch introduce some more files to tell user these informations.
 # ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
/sys/fs/cgroup/cpuacct/cpuacct.usage 
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu  
/sys/fs/cgroup/cpuacct/cpuacct.usage_sys
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys  
/sys/fs/cgroup/cpuacct/cpuacct.usage_user

Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
---
 kernel/sched/cpuacct.c | 140 ++---
 1 file changed, 120 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9c2bbf7..b3e8971 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -25,11 +25,22 @@ enum cpuacct_stat_index {
CPUACCT_STAT_NSTATS,
 };
 
+enum cpuacct_usage_index {
+   CPUACCT_USAGE_USER, /* ... user mode */
+   CPUACCT_USAGE_SYSTEM,   /* ... kernel mode */
+
+   CPUACCT_USAGE_NRUSAGE,
+};
+
+struct cpuacct_usage {
+   u64 usages[CPUACCT_USAGE_NRUSAGE];
+};
+
 /* track cpu usage of a group of tasks and its child groups */
 struct cpuacct {
struct cgroup_subsys_state css;
/* cpuusage holds pointer to a u64-type object on every cpu */
-   u64 __percpu *cpuusage;
+   struct cpuacct_usage __percpu *cpuusage;
struct kernel_cpustat __percpu *cpustat;
 };
 
@@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
return css_ca(ca->css.parent);
 }
 
-static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
 static struct cpuacct root_cpuacct = {
.cpustat= _cpustat,
.cpuusage   = _cpuacct_cpuusage,
@@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
if (!ca)
goto out;
 
-   ca->cpuusage = alloc_percpu(u64);
+   ca->cpuusage = alloc_percpu(struct cpuacct_usage);
if (!ca->cpuusage)
goto out_free_ca;
 
@@ -96,54 +107,104 @@ static void cpuacct_css_free(struct cgroup_subsys_state 
*css)
kfree(ca);
 }
 
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
+enum cpuacct_usage_index index)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-   u64 data;
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   u64 data = 0;
+   int i = 0;
+
+   /*
+* We allow index == CPUACCT_USAGE_NRUSAGE here to read
+* the sum of suages.
+*/
+   BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
+   if (index == CPUACCT_USAGE_NRUSAGE) {
+   raw_spin_lock_irq(_rq(cpu)->lock);
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   data += cpuusage->usages[i];
+   raw_spin_unlock_irq(_rq(cpu)->lock);
+
+   goto out;
+   }
 
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   data = *cpuusage;
+   data = cpuusage->usages[index];
raw_spin_unlock_irq(_rq(cpu)->lock);
 #else
-   data = *cpuusage;
+   data = cpuusage->usages[index];
 #endif
 
+out:
return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu,
+  enum cpuacct_usage_index index, u64 val)
 {
-   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   int i = 0;
+
+   /*
+* We allow index == CPUACCT_USAGE_NRUSAGE here to write
+* val to each index of usages.
+*/
+   BUG_ON(index > CPUACCT_USAGE_NRUSAGE);
+
+   if (index == CPUACCT_USAGE_NRUSAGE) {
+   raw_spin_lock_irq(_rq(cpu)->lock);
+   for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++)
+   cpuusage->usages[i] = val;
+   raw_spin_unlock_irq(_rq(cpu)->lock);
+
+   return;
+   }
 
 #ifndef CONFIG_64BIT
/*
 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 */
raw_spin_lock_irq(_rq(cpu)->lock);
-   *cpuusage = val;
+   cpuusage->usages[index] = val;
raw_spin_unlock_irq(_rq(cpu)->lock);
 #else
-   *cpuusage = val;
+   cpuusage->usages[index] = val;
 #endif
 }
 
 /* return total cpu usage (in nanoseconds) of a group */
-static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft)
+s

Re: piping core dump to a program escapes container

2015-12-09 Thread Dongsheng Yang

On 12/09/2015 11:29 AM, Eric W. Biederman wrote:

Dongsheng Yang  writes:


On 12/09/2015 10:26 AM, Dongsheng Yang wrote:

On 10/25/2015 05:54 AM, Shayan Pooya wrote:

I noticed the following core_pattern behavior in my linux box while
running docker containers. I am not sure if it is bug, but it is
inconsistent and not documented.

If the core_pattern is set on the host, the containers will observe
and use the pattern for dumping cores (there is no per cgroup
core_pattern). According to core(5) for setting core_pattern one can:

1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern
2. echo "|/bin/custom_core /tmp/cores/ %e %p " >
/proc/sys/kernel/core_pattern

The former pattern evaluates the /tmp/cores path in the container's
filesystem namespace. Which means, the host does not see a core file
in /tmp/cores.

However, the latter evaluates the /bin/custom_core path in the global
filesystem namespace. Moreover, if /bin/core decides to write the core
to a path (/tmp/cores in this case as shown by the arg to
custom_core), the path will be evaluated in the global filesystem
namespace as well.

The latter behaviour is counter-intuitive and error-prone as the
container can fill up the core-file directory which it does not have
direct access to (which means the core is also not accessible for
debugging if someone only has access to the container).



From a container perspective it is perhaps counter intuitive from

the perspective of the operator of the machine nothing works specially
about core_pattern and it works as designed with no unusual danages.


Hi Shayan,
  We found the same problem with what you described here.
Is there any document for this behaviour? I want to know is
that intentional or as you said a 'bug'. Maybe that's intentional
to provide a way for admin to collect core dumps from all containers as
Richard said. I am interested in it too.

Anyone can help here?


In addition, is that a good idea to make core_pattern to be seperated
in different namespace?


The behavior was the best we could do at the time last time this issue
was examined.There is enough information available to be able to
write a core dumping program that can reliably place your core dumps
in your container.

There has not yet been an obvious namespace in which to stick
core_pattern, and even worse exactly how to appropriate launch a process
in a container has not been figured out.


Hi Eric,
Could you provide an reference to these discussion?? In
addition, is there a already infrastructure to do this kind of thing?

Thanx
Yang


If those tricky problems can be solved we can have a core_pattern in a
container.  What we have now is the best we have been able to figure out
so far.

Eric




Yang


Yang


Currently, I work around this issue by detecting that the process is
crashing from a container (by comparing the namespace pid to the
global pid) and refuse to dump the core if it is from a container.

Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1).
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
.




.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-09 Thread Dongsheng Yang

On 12/09/2015 04:34 PM, Bruno Prémont wrote:

On Tue, 08 Dec 2015 21:29:13 -0600 Eric W. Biederman wrote:

Dongsheng Yang  writes:


On 12/09/2015 10:26 AM, Dongsheng Yang wrote:

On 10/25/2015 05:54 AM, Shayan Pooya wrote:

I noticed the following core_pattern behavior in my linux box while
running docker containers. I am not sure if it is bug, but it is
inconsistent and not documented.

If the core_pattern is set on the host, the containers will observe
and use the pattern for dumping cores (there is no per cgroup
core_pattern). According to core(5) for setting core_pattern one can:

1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern
2. echo "|/bin/custom_core /tmp/cores/ %e %p " >
/proc/sys/kernel/core_pattern

The former pattern evaluates the /tmp/cores path in the container's
filesystem namespace. Which means, the host does not see a core file
in /tmp/cores.

However, the latter evaluates the /bin/custom_core path in the global
filesystem namespace. Moreover, if /bin/core decides to write the core
to a path (/tmp/cores in this case as shown by the arg to
custom_core), the path will be evaluated in the global filesystem
namespace as well.

The latter behaviour is counter-intuitive and error-prone as the
container can fill up the core-file directory which it does not have
direct access to (which means the core is also not accessible for
debugging if someone only has access to the container).


 From a container perspective it is perhaps counter intuitive from
the perspective of the operator of the machine nothing works specially
about core_pattern and it works as designed with no unusual danages.


Hi Shayan,
  We found the same problem with what you described here.
Is there any document for this behaviour? I want to know is
that intentional or as you said a 'bug'. Maybe that's intentional
to provide a way for admin to collect core dumps from all containers as
Richard said. I am interested in it too.

Anyone can help here?


In addition, is that a good idea to make core_pattern to be seperated
in different namespace?


The behavior was the best we could do at the time last time this issue
was examined.There is enough information available to be able to
write a core dumping program that can reliably place your core dumps
in your container.

There has not yet been an obvious namespace in which to stick
core_pattern, and even worse exactly how to appropriate launch a process
in a container has not been figured out.

If those tricky problems can be solved we can have a core_pattern in a
container.  What we have now is the best we have been able to figure out
so far.


Isn't the second option dangerous if its run in global namespace and
settable from some other namespace/container?

If a process inside a container can set /proc/sys/kernel/core_pattern
then it could e.g. set it to
   echo "|/bin/rm -rf / /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern
and kill the host (eventually itself included).
Other command lines could do different bad things.


Yes, if you don't give a privileged to container, that's read-only
to them. But if you give containers privilege, that true as you said.
But that's similar with if you give a privilege to any of process
running in the machine. So I think it's not a problem.

Yang



Something that would sound reasonable is to have the core dumping
helper process run under the namespaces the process which wrote to
/proc/sys/kernel/core_pattern had.
When some of those namespaces are gone, falling back to the namespaces
of the process for which core is to be dumped might seem reasonable
(or just not dumping core at all as is done when core_pipe_limit is
exceeded).

The value of core_pattern (and other core_* sysctls)  should probably belong
to the mount namespace the proc filesystem used for setting its value
was in - or the matching namespace of calling process when set via syscall.

Bruno


.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-09 Thread Dongsheng Yang

On 12/09/2015 02:32 PM, Eric W. Biederman wrote:

Dongsheng Yang  writes:


On 12/09/2015 11:29 AM, Eric W. Biederman wrote:

Dongsheng Yang  writes:


[...]


There has not yet been an obvious namespace in which to stick
core_pattern, and even worse exactly how to appropriate launch a process
in a container has not been figured out.

If those tricky problems can be solved we can have a core_pattern in a
container.  What we have now is the best we have been able to figure out
so far.


Thanx Eric, but if I want to make docker works rely on this behaviour,
is that reliable?

I mean, I want to make a docker container to dump the
core file to a specified path in host by a pipe way. But I am afraid
this behaviour would be changed later. Any suggestion?


The kernel rules say if there is a behavior someone depends on and that
behavior changes and breaks userspace that is a regression and it is not
allowed.

As developers we try not to create regressions.  But some days it
requires someone testing/using the functional enough to catdch an issue.

That said the real issue you are likely to run into when developing this
as part of docker is that docker doesn't get to own the core pattern.
It doesn't make sense for any one application to, as it is a kernel wide
setting.


Agreed.

To have different app or container specific policies for core
dumping likely requires either solving the problems I mentioned with
containers or in userspace a solution so there can be an
/etc/core_pattern.d/ with different configuration and different scripts
that somehow know how to select which core files they want and dump them
sanely.


We would try to solve the problems you mentioned, but sound not easy.
Anyway, I need to read some old discussion at first I think.

Thanx
Yang


Eric







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-09 Thread Dongsheng Yang

On 12/09/2015 02:32 PM, Eric W. Biederman wrote:

Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:


On 12/09/2015 11:29 AM, Eric W. Biederman wrote:

Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:


[...]


There has not yet been an obvious namespace in which to stick
core_pattern, and even worse exactly how to appropriate launch a process
in a container has not been figured out.

If those tricky problems can be solved we can have a core_pattern in a
container.  What we have now is the best we have been able to figure out
so far.


Thanx Eric, but if I want to make docker works rely on this behaviour,
is that reliable?

I mean, I want to make a docker container to dump the
core file to a specified path in host by a pipe way. But I am afraid
this behaviour would be changed later. Any suggestion?


The kernel rules say if there is a behavior someone depends on and that
behavior changes and breaks userspace that is a regression and it is not
allowed.

As developers we try not to create regressions.  But some days it
requires someone testing/using the functional enough to catdch an issue.

That said the real issue you are likely to run into when developing this
as part of docker is that docker doesn't get to own the core pattern.
It doesn't make sense for any one application to, as it is a kernel wide
setting.


Agreed.

To have different app or container specific policies for core
dumping likely requires either solving the problems I mentioned with
containers or in userspace a solution so there can be an
/etc/core_pattern.d/ with different configuration and different scripts
that somehow know how to select which core files they want and dump them
sanely.


We would try to solve the problems you mentioned, but sound not easy.
Anyway, I need to read some old discussion at first I think.

Thanx
Yang


Eric







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-09 Thread Dongsheng Yang

On 12/09/2015 04:34 PM, Bruno Prémont wrote:

On Tue, 08 Dec 2015 21:29:13 -0600 Eric W. Biederman wrote:

Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:


On 12/09/2015 10:26 AM, Dongsheng Yang wrote:

On 10/25/2015 05:54 AM, Shayan Pooya wrote:

I noticed the following core_pattern behavior in my linux box while
running docker containers. I am not sure if it is bug, but it is
inconsistent and not documented.

If the core_pattern is set on the host, the containers will observe
and use the pattern for dumping cores (there is no per cgroup
core_pattern). According to core(5) for setting core_pattern one can:

1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern
2. echo "|/bin/custom_core /tmp/cores/ %e %p " >
/proc/sys/kernel/core_pattern

The former pattern evaluates the /tmp/cores path in the container's
filesystem namespace. Which means, the host does not see a core file
in /tmp/cores.

However, the latter evaluates the /bin/custom_core path in the global
filesystem namespace. Moreover, if /bin/core decides to write the core
to a path (/tmp/cores in this case as shown by the arg to
custom_core), the path will be evaluated in the global filesystem
namespace as well.

The latter behaviour is counter-intuitive and error-prone as the
container can fill up the core-file directory which it does not have
direct access to (which means the core is also not accessible for
debugging if someone only has access to the container).


 From a container perspective it is perhaps counter intuitive from
the perspective of the operator of the machine nothing works specially
about core_pattern and it works as designed with no unusual danages.


Hi Shayan,
  We found the same problem with what you described here.
Is there any document for this behaviour? I want to know is
that intentional or as you said a 'bug'. Maybe that's intentional
to provide a way for admin to collect core dumps from all containers as
Richard said. I am interested in it too.

Anyone can help here?


In addition, is that a good idea to make core_pattern to be seperated
in different namespace?


The behavior was the best we could do at the time last time this issue
was examined.There is enough information available to be able to
write a core dumping program that can reliably place your core dumps
in your container.

There has not yet been an obvious namespace in which to stick
core_pattern, and even worse exactly how to appropriate launch a process
in a container has not been figured out.

If those tricky problems can be solved we can have a core_pattern in a
container.  What we have now is the best we have been able to figure out
so far.


Isn't the second option dangerous if its run in global namespace and
settable from some other namespace/container?

If a process inside a container can set /proc/sys/kernel/core_pattern
then it could e.g. set it to
   echo "|/bin/rm -rf / /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern
and kill the host (eventually itself included).
Other command lines could do different bad things.


Yes, if you don't give a privileged to container, that's read-only
to them. But if you give containers privilege, that true as you said.
But that's similar with if you give a privilege to any of process
running in the machine. So I think it's not a problem.

Yang



Something that would sound reasonable is to have the core dumping
helper process run under the namespaces the process which wrote to
/proc/sys/kernel/core_pattern had.
When some of those namespaces are gone, falling back to the namespaces
of the process for which core is to be dumped might seem reasonable
(or just not dumping core at all as is done when core_pipe_limit is
exceeded).

The value of core_pattern (and other core_* sysctls)  should probably belong
to the mount namespace the proc filesystem used for setting its value
was in - or the matching namespace of calling process when set via syscall.

Bruno


.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-09 Thread Dongsheng Yang

On 12/09/2015 11:29 AM, Eric W. Biederman wrote:

Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:


On 12/09/2015 10:26 AM, Dongsheng Yang wrote:

On 10/25/2015 05:54 AM, Shayan Pooya wrote:

I noticed the following core_pattern behavior in my linux box while
running docker containers. I am not sure if it is bug, but it is
inconsistent and not documented.

If the core_pattern is set on the host, the containers will observe
and use the pattern for dumping cores (there is no per cgroup
core_pattern). According to core(5) for setting core_pattern one can:

1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern
2. echo "|/bin/custom_core /tmp/cores/ %e %p " >
/proc/sys/kernel/core_pattern

The former pattern evaluates the /tmp/cores path in the container's
filesystem namespace. Which means, the host does not see a core file
in /tmp/cores.

However, the latter evaluates the /bin/custom_core path in the global
filesystem namespace. Moreover, if /bin/core decides to write the core
to a path (/tmp/cores in this case as shown by the arg to
custom_core), the path will be evaluated in the global filesystem
namespace as well.

The latter behaviour is counter-intuitive and error-prone as the
container can fill up the core-file directory which it does not have
direct access to (which means the core is also not accessible for
debugging if someone only has access to the container).



From a container perspective it is perhaps counter intuitive from

the perspective of the operator of the machine nothing works specially
about core_pattern and it works as designed with no unusual danages.


Hi Shayan,
  We found the same problem with what you described here.
Is there any document for this behaviour? I want to know is
that intentional or as you said a 'bug'. Maybe that's intentional
to provide a way for admin to collect core dumps from all containers as
Richard said. I am interested in it too.

Anyone can help here?


In addition, is that a good idea to make core_pattern to be seperated
in different namespace?


The behavior was the best we could do at the time last time this issue
was examined.There is enough information available to be able to
write a core dumping program that can reliably place your core dumps
in your container.

There has not yet been an obvious namespace in which to stick
core_pattern, and even worse exactly how to appropriate launch a process
in a container has not been figured out.


Hi Eric,
Could you provide an reference to these discussion?? In
addition, is there a already infrastructure to do this kind of thing?

Thanx
Yang


If those tricky problems can be solved we can have a core_pattern in a
container.  What we have now is the best we have been able to figure out
so far.

Eric




Yang


Yang


Currently, I work around this issue by detecting that the process is
crashing from a container (by comparing the namespace pid to the
global pid) and refuse to dump the core if it is from a container.

Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1).
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
.




.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-08 Thread Dongsheng Yang

On 12/09/2015 11:29 AM, Eric W. Biederman wrote:

Dongsheng Yang  writes:


[...]


There has not yet been an obvious namespace in which to stick
core_pattern, and even worse exactly how to appropriate launch a process
in a container has not been figured out.

If those tricky problems can be solved we can have a core_pattern in a
container.  What we have now is the best we have been able to figure out
so far.


Thanx Eric, but if I want to make docker works rely on this behaviour,
is that reliable?

I mean, I want to make a docker container to dump the
core file to a specified path in host by a pipe way. But I am afraid
this behaviour would be changed later. Any suggestion?

Yang


Eric




Yang


Yang


Currently, I work around this issue by detecting that the process is
crashing from a container (by comparing the namespace pid to the
global pid) and refuse to dump the core if it is from a container.

Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1).
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
.




.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-08 Thread Dongsheng Yang

On 12/09/2015 10:26 AM, Dongsheng Yang wrote:

On 10/25/2015 05:54 AM, Shayan Pooya wrote:

I noticed the following core_pattern behavior in my linux box while
running docker containers. I am not sure if it is bug, but it is
inconsistent and not documented.

If the core_pattern is set on the host, the containers will observe
and use the pattern for dumping cores (there is no per cgroup
core_pattern). According to core(5) for setting core_pattern one can:

1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern
2. echo "|/bin/custom_core /tmp/cores/ %e %p " >
/proc/sys/kernel/core_pattern

The former pattern evaluates the /tmp/cores path in the container's
filesystem namespace. Which means, the host does not see a core file
in /tmp/cores.

However, the latter evaluates the /bin/custom_core path in the global
filesystem namespace. Moreover, if /bin/core decides to write the core
to a path (/tmp/cores in this case as shown by the arg to
custom_core), the path will be evaluated in the global filesystem
namespace as well.

The latter behaviour is counter-intuitive and error-prone as the
container can fill up the core-file directory which it does not have
direct access to (which means the core is also not accessible for
debugging if someone only has access to the container).


Hi Shayan,
 We found the same problem with what you described here.
Is there any document for this behaviour? I want to know is
that intentional or as you said a 'bug'. Maybe that's intentional
to provide a way for admin to collect core dumps from all containers as
Richard said. I am interested in it too.

Anyone can help here?


In addition, is that a good idea to make core_pattern to be seperated
in different namespace?

Yang


Yang


Currently, I work around this issue by detecting that the process is
crashing from a container (by comparing the namespace pid to the
global pid) and refuse to dump the core if it is from a container.

Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1).
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-08 Thread Dongsheng Yang

On 10/25/2015 05:54 AM, Shayan Pooya wrote:

I noticed the following core_pattern behavior in my linux box while
running docker containers. I am not sure if it is bug, but it is
inconsistent and not documented.

If the core_pattern is set on the host, the containers will observe
and use the pattern for dumping cores (there is no per cgroup
core_pattern). According to core(5) for setting core_pattern one can:

1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern
2. echo "|/bin/custom_core /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern

The former pattern evaluates the /tmp/cores path in the container's
filesystem namespace. Which means, the host does not see a core file
in /tmp/cores.

However, the latter evaluates the /bin/custom_core path in the global
filesystem namespace. Moreover, if /bin/core decides to write the core
to a path (/tmp/cores in this case as shown by the arg to
custom_core), the path will be evaluated in the global filesystem
namespace as well.

The latter behaviour is counter-intuitive and error-prone as the
container can fill up the core-file directory which it does not have
direct access to (which means the core is also not accessible for
debugging if someone only has access to the container).


Hi Shayan,
We found the same problem with what you described here.
Is there any document for this behaviour? I want to know is
that intentional or as you said a 'bug'. Maybe that's intentional
to provide a way for admin to collect core dumps from all containers as
Richard said. I am interested in it too.

Anyone can help here?

Yang


Currently, I work around this issue by detecting that the process is
crashing from a container (by comparing the namespace pid to the
global pid) and refuse to dump the core if it is from a container.

Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1).
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-08 Thread Dongsheng Yang

On 10/25/2015 05:54 AM, Shayan Pooya wrote:

I noticed the following core_pattern behavior in my linux box while
running docker containers. I am not sure if it is bug, but it is
inconsistent and not documented.

If the core_pattern is set on the host, the containers will observe
and use the pattern for dumping cores (there is no per cgroup
core_pattern). According to core(5) for setting core_pattern one can:

1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern
2. echo "|/bin/custom_core /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern

The former pattern evaluates the /tmp/cores path in the container's
filesystem namespace. Which means, the host does not see a core file
in /tmp/cores.

However, the latter evaluates the /bin/custom_core path in the global
filesystem namespace. Moreover, if /bin/core decides to write the core
to a path (/tmp/cores in this case as shown by the arg to
custom_core), the path will be evaluated in the global filesystem
namespace as well.

The latter behaviour is counter-intuitive and error-prone as the
container can fill up the core-file directory which it does not have
direct access to (which means the core is also not accessible for
debugging if someone only has access to the container).


Hi Shayan,
We found the same problem with what you described here.
Is there any document for this behaviour? I want to know is
that intentional or as you said a 'bug'. Maybe that's intentional
to provide a way for admin to collect core dumps from all containers as
Richard said. I am interested in it too.

Anyone can help here?

Yang


Currently, I work around this issue by detecting that the process is
crashing from a container (by comparing the namespace pid to the
global pid) and refuse to dump the core if it is from a container.

Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1).
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-08 Thread Dongsheng Yang

On 12/09/2015 11:29 AM, Eric W. Biederman wrote:

Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes:


[...]


There has not yet been an obvious namespace in which to stick
core_pattern, and even worse exactly how to appropriate launch a process
in a container has not been figured out.

If those tricky problems can be solved we can have a core_pattern in a
container.  What we have now is the best we have been able to figure out
so far.


Thanx Eric, but if I want to make docker works rely on this behaviour,
is that reliable?

I mean, I want to make a docker container to dump the
core file to a specified path in host by a pipe way. But I am afraid
this behaviour would be changed later. Any suggestion?

Yang


Eric




Yang


Yang


Currently, I work around this issue by detecting that the process is
crashing from a container (by comparing the namespace pid to the
global pid) and refuse to dump the core if it is from a container.

Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1).
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
.




.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: piping core dump to a program escapes container

2015-12-08 Thread Dongsheng Yang

On 12/09/2015 10:26 AM, Dongsheng Yang wrote:

On 10/25/2015 05:54 AM, Shayan Pooya wrote:

I noticed the following core_pattern behavior in my linux box while
running docker containers. I am not sure if it is bug, but it is
inconsistent and not documented.

If the core_pattern is set on the host, the containers will observe
and use the pattern for dumping cores (there is no per cgroup
core_pattern). According to core(5) for setting core_pattern one can:

1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern
2. echo "|/bin/custom_core /tmp/cores/ %e %p " >
/proc/sys/kernel/core_pattern

The former pattern evaluates the /tmp/cores path in the container's
filesystem namespace. Which means, the host does not see a core file
in /tmp/cores.

However, the latter evaluates the /bin/custom_core path in the global
filesystem namespace. Moreover, if /bin/core decides to write the core
to a path (/tmp/cores in this case as shown by the arg to
custom_core), the path will be evaluated in the global filesystem
namespace as well.

The latter behaviour is counter-intuitive and error-prone as the
container can fill up the core-file directory which it does not have
direct access to (which means the core is also not accessible for
debugging if someone only has access to the container).


Hi Shayan,
 We found the same problem with what you described here.
Is there any document for this behaviour? I want to know is
that intentional or as you said a 'bug'. Maybe that's intentional
to provide a way for admin to collect core dumps from all containers as
Richard said. I am interested in it too.

Anyone can help here?


In addition, is that a good idea to make core_pattern to be seperated
in different namespace?

Yang


Yang


Currently, I work around this issue by detecting that the process is
crashing from a container (by comparing the namespace pid to the
global pid) and refuse to dump the core if it is from a container.

Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1).
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v13 2/5] gennvm: Generic NVM manager

2015-10-28 Thread Dongsheng Yang

On 10/28/2015 08:30 AM, Matias Bjørling wrote:

The implementation for Open-Channel SSDs is divided into media

[...]

+   lun->reserved_blocks = 2; /* for GC only */
+   lun->vlun.id = i;
+   lun->vlun.lun_id = i % dev->luns_per_chnl;
+   lun->vlun.chnl_id = i / dev->luns_per_chnl;


Please use do_div(). % would be not supported in some platforms, as
the kbuild pointed in V12.

Yang


+   lun->vlun.nr_free_blocks = dev->blks_per_lun;
+   }
+   return 0;
+}
+
+static int gennvm_block_bb(u32 lun_id, void *bb_bitmap, unsigned int nr_blocks,
+   void *private)
+{
+   struct gen_nvm *gn = private;
+   struct gen_lun *lun = >luns[lun_id];
+   struct nvm_block *block;
+   int i;
+
+   if (unlikely(bitmap_empty(bb_bitmap, nr_blocks)))
+   return 0;
+
+   i = -1;
+   while ((i = find_next_bit(bb_bitmap, nr_blocks, i + 1)) <
+   nr_blocks) {
+   block = >vlun.blocks[i];
+   if (!block) {
+   pr_err("gen_nvm: BB data is out of bounds.\n");
+   return -EINVAL;
+   }
+   list_move_tail(>list, >bb_list);
+   }
+
+   return 0;
+}
+
+static int gennvm_block_map(u64 slba, u32 nlb, __le64 *entries, void *private)
+{
+   struct nvm_dev *dev = private;
+   struct gen_nvm *gn = dev->mp;
+   sector_t max_pages = dev->total_pages * (dev->sec_size >> 9);
+   u64 elba = slba + nlb;
+   struct gen_lun *lun;
+   struct nvm_block *blk;
+   u64 i;
+   int lun_id;
+
+   if (unlikely(elba > dev->total_pages)) {
+   pr_err("gen_nvm: L2P data from device is out of bounds!\n");
+   return -EINVAL;
+   }
+
+   for (i = 0; i < nlb; i++) {
+   u64 pba = le64_to_cpu(entries[i]);
+
+   if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+   pr_err("gen_nvm: L2P data entry is out of bounds!\n");
+   return -EINVAL;
+   }
+
+   /* Address zero is a special one. The first page on a disk is
+* protected. It often holds internal device boot
+* information.
+*/
+   if (!pba)
+   continue;
+
+   /* resolve block from physical address */
+   lun_id = div_u64(pba, dev->sec_per_lun);
+   lun = >luns[lun_id];
+
+   /* Calculate block offset into lun */
+   pba = pba - (dev->sec_per_lun * lun_id);
+   blk = >vlun.blocks[div_u64(pba, dev->sec_per_blk)];
+
+   if (!blk->type) {
+   /* at this point, we don't know anything about the
+* block. It's up to the FTL on top to re-etablish the
+* block state
+*/
+   list_move_tail(>list, >used_list);
+   blk->type = 1;
+   lun->vlun.nr_free_blocks--;
+   }
+   }
+
+   return 0;
+}
+
+static int gennvm_blocks_init(struct nvm_dev *dev, struct gen_nvm *gn)
+{
+   struct gen_lun *lun;
+   struct nvm_block *block;
+   sector_t lun_iter, blk_iter, cur_block_id = 0;
+   int ret;
+
+   gennvm_for_each_lun(gn, lun, lun_iter) {
+   lun->vlun.blocks = vzalloc(sizeof(struct nvm_block) *
+   dev->blks_per_lun);
+   if (!lun->vlun.blocks)
+   return -ENOMEM;
+
+   for (blk_iter = 0; blk_iter < dev->blks_per_lun; blk_iter++) {
+   block = >vlun.blocks[blk_iter];
+
+   INIT_LIST_HEAD(>list);
+
+   block->lun = >vlun;
+   block->id = cur_block_id++;
+
+   /* First block is reserved for device */
+   if (unlikely(lun_iter == 0 && blk_iter == 0))
+   continue;
+
+   list_add_tail(>list, >free_list);
+   }
+
+   if (dev->ops->get_bb_tbl) {
+   ret = dev->ops->get_bb_tbl(dev->q, lun->vlun.id,
+   dev->blks_per_lun, gennvm_block_bb, gn);
+   if (ret)
+   pr_err("gen_nvm: could not read BB table\n");
+   }
+   }
+
+   if (dev->ops->get_l2p_tbl) {
+   ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+   gennvm_block_map, dev);
+   if (ret) {
+   pr_err("gen_nvm: could not read L2P table.\n");
+   pr_warn("gen_nvm: default block initialization");
+   }
+   }
+
+   return 0;
+}
+
+static int 

Re: [PATCH v13 2/5] gennvm: Generic NVM manager

2015-10-28 Thread Dongsheng Yang

On 10/28/2015 08:30 AM, Matias Bjørling wrote:

The implementation for Open-Channel SSDs is divided into media

[...]

+   lun->reserved_blocks = 2; /* for GC only */
+   lun->vlun.id = i;
+   lun->vlun.lun_id = i % dev->luns_per_chnl;
+   lun->vlun.chnl_id = i / dev->luns_per_chnl;


Please use do_div(). % would be not supported in some platforms, as
the kbuild pointed in V12.

Yang


+   lun->vlun.nr_free_blocks = dev->blks_per_lun;
+   }
+   return 0;
+}
+
+static int gennvm_block_bb(u32 lun_id, void *bb_bitmap, unsigned int nr_blocks,
+   void *private)
+{
+   struct gen_nvm *gn = private;
+   struct gen_lun *lun = >luns[lun_id];
+   struct nvm_block *block;
+   int i;
+
+   if (unlikely(bitmap_empty(bb_bitmap, nr_blocks)))
+   return 0;
+
+   i = -1;
+   while ((i = find_next_bit(bb_bitmap, nr_blocks, i + 1)) <
+   nr_blocks) {
+   block = >vlun.blocks[i];
+   if (!block) {
+   pr_err("gen_nvm: BB data is out of bounds.\n");
+   return -EINVAL;
+   }
+   list_move_tail(>list, >bb_list);
+   }
+
+   return 0;
+}
+
+static int gennvm_block_map(u64 slba, u32 nlb, __le64 *entries, void *private)
+{
+   struct nvm_dev *dev = private;
+   struct gen_nvm *gn = dev->mp;
+   sector_t max_pages = dev->total_pages * (dev->sec_size >> 9);
+   u64 elba = slba + nlb;
+   struct gen_lun *lun;
+   struct nvm_block *blk;
+   u64 i;
+   int lun_id;
+
+   if (unlikely(elba > dev->total_pages)) {
+   pr_err("gen_nvm: L2P data from device is out of bounds!\n");
+   return -EINVAL;
+   }
+
+   for (i = 0; i < nlb; i++) {
+   u64 pba = le64_to_cpu(entries[i]);
+
+   if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+   pr_err("gen_nvm: L2P data entry is out of bounds!\n");
+   return -EINVAL;
+   }
+
+   /* Address zero is a special one. The first page on a disk is
+* protected. It often holds internal device boot
+* information.
+*/
+   if (!pba)
+   continue;
+
+   /* resolve block from physical address */
+   lun_id = div_u64(pba, dev->sec_per_lun);
+   lun = >luns[lun_id];
+
+   /* Calculate block offset into lun */
+   pba = pba - (dev->sec_per_lun * lun_id);
+   blk = >vlun.blocks[div_u64(pba, dev->sec_per_blk)];
+
+   if (!blk->type) {
+   /* at this point, we don't know anything about the
+* block. It's up to the FTL on top to re-etablish the
+* block state
+*/
+   list_move_tail(>list, >used_list);
+   blk->type = 1;
+   lun->vlun.nr_free_blocks--;
+   }
+   }
+
+   return 0;
+}
+
+static int gennvm_blocks_init(struct nvm_dev *dev, struct gen_nvm *gn)
+{
+   struct gen_lun *lun;
+   struct nvm_block *block;
+   sector_t lun_iter, blk_iter, cur_block_id = 0;
+   int ret;
+
+   gennvm_for_each_lun(gn, lun, lun_iter) {
+   lun->vlun.blocks = vzalloc(sizeof(struct nvm_block) *
+   dev->blks_per_lun);
+   if (!lun->vlun.blocks)
+   return -ENOMEM;
+
+   for (blk_iter = 0; blk_iter < dev->blks_per_lun; blk_iter++) {
+   block = >vlun.blocks[blk_iter];
+
+   INIT_LIST_HEAD(>list);
+
+   block->lun = >vlun;
+   block->id = cur_block_id++;
+
+   /* First block is reserved for device */
+   if (unlikely(lun_iter == 0 && blk_iter == 0))
+   continue;
+
+   list_add_tail(>list, >free_list);
+   }
+
+   if (dev->ops->get_bb_tbl) {
+   ret = dev->ops->get_bb_tbl(dev->q, lun->vlun.id,
+   dev->blks_per_lun, gennvm_block_bb, gn);
+   if (ret)
+   pr_err("gen_nvm: could not read BB table\n");
+   }
+   }
+
+   if (dev->ops->get_l2p_tbl) {
+   ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+   gennvm_block_map, dev);
+   if (ret) {
+   pr_err("gen_nvm: could not read L2P table.\n");
+   pr_warn("gen_nvm: default block initialization");
+   }
+   }
+
+   return 0;
+}
+
+static int 

Re: [PATCH] mtd: mtdram: check offs and len in mtdram->erase

2015-10-02 Thread Dongsheng Yang

On 10/03/2015 01:38 AM, Brian Norris wrote:

On Fri, Oct 02, 2015 at 03:31:33PM +0530, Sudip Mukherjee wrote:

On Fri, Oct 02, 2015 at 05:39:02PM +0800, Dongsheng Yang wrote:

On 10/01/2015 12:41 AM, Sudip Mukherjee wrote:

We should prevent user to erasing mtd device with an unaligned offset
or length.

Signed-off-by: Sudip Mukherjee 
---

I am not sure if I should add the Signed-off-by of
Dongsheng Yang  . He is the original author
and he should get the credit for that.


But I had sent a a patch out to fix this problem before your v1.

http://lists.infradead.org/pipermail/linux-mtd/2015-September/062234.html

I didn't know that. I think your v1 was applied.


Sorry if I left any confusion. Dongsheng's v1 was applied and reverted.
v2 is still under review (and was sent slightly before (?) your v1).


Yea, sorry I should have mentioned it earlier. But I was and am still
in a vacation, then I did not point it out in time.

Sudip, any comment or test for my patch there is always welcome.

Thanx
Yang

Feel free to comment there.

Brian
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mtd: mtdram: check offs and len in mtdram->erase

2015-10-02 Thread Dongsheng Yang

On 10/01/2015 12:41 AM, Sudip Mukherjee wrote:

We should prevent user to erasing mtd device with an unaligned offset
or length.

Signed-off-by: Sudip Mukherjee 
---

I am not sure if I should add the Signed-off-by of
Dongsheng Yang  . He is the original author
and he should get the credit for that.


But I had sent a a patch out to fix this problem before your v1.

http://lists.infradead.org/pipermail/linux-mtd/2015-September/062234.html

Yang


  drivers/mtd/devices/mtdram.c | 27 +++
  1 file changed, 27 insertions(+)

diff --git a/drivers/mtd/devices/mtdram.c b/drivers/mtd/devices/mtdram.c
index 8e28508..21b6a05 100644
--- a/drivers/mtd/devices/mtdram.c
+++ b/drivers/mtd/devices/mtdram.c
@@ -32,8 +32,35 @@ MODULE_PARM_DESC(erase_size, "Device erase block size in 
KiB");
  // We could store these in the mtd structure, but we only support 1 device..
  static struct mtd_info *mtd_info;

+static int check_offs_len(struct mtd_info *mtd, loff_t ofs, uint64_t len)
+{
+   int ret = 0;
+   uint64_t temp_len, rem;
+
+   /* Start address must align on block boundary */
+   temp_len = ofs;
+   rem = do_div(temp_len, mtd->erasesize);
+   if (rem) {
+   pr_debug("%s: unaligned address\n", __func__);
+   ret = -EINVAL;
+   }
+
+   /* Length must align on block boundary */
+   temp_len = len;
+   rem = do_div(temp_len, mtd->erasesize);
+
+   if (rem) {
+   pr_debug("%s: length not block aligned\n", __func__);
+   ret = -EINVAL;
+   }
+
+   return ret;
+}
+
  static int ram_erase(struct mtd_info *mtd, struct erase_info *instr)
  {
+   if (check_offs_len(mtd, instr->addr, instr->len))
+   return -EINVAL;
memset((char *)mtd->priv + instr->addr, 0xff, instr->len);
instr->state = MTD_ERASE_DONE;
mtd_erase_callback(instr);



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mtd: mtdram: check offs and len in mtdram->erase

2015-10-02 Thread Dongsheng Yang

On 10/01/2015 12:41 AM, Sudip Mukherjee wrote:

We should prevent user to erasing mtd device with an unaligned offset
or length.

Signed-off-by: Sudip Mukherjee <su...@vectorindia.org>
---

I am not sure if I should add the Signed-off-by of
Dongsheng Yang <yangds.f...@cn.fujitsu.com> . He is the original author
and he should get the credit for that.


But I had sent a a patch out to fix this problem before your v1.

http://lists.infradead.org/pipermail/linux-mtd/2015-September/062234.html

Yang


  drivers/mtd/devices/mtdram.c | 27 +++
  1 file changed, 27 insertions(+)

diff --git a/drivers/mtd/devices/mtdram.c b/drivers/mtd/devices/mtdram.c
index 8e28508..21b6a05 100644
--- a/drivers/mtd/devices/mtdram.c
+++ b/drivers/mtd/devices/mtdram.c
@@ -32,8 +32,35 @@ MODULE_PARM_DESC(erase_size, "Device erase block size in 
KiB");
  // We could store these in the mtd structure, but we only support 1 device..
  static struct mtd_info *mtd_info;

+static int check_offs_len(struct mtd_info *mtd, loff_t ofs, uint64_t len)
+{
+   int ret = 0;
+   uint64_t temp_len, rem;
+
+   /* Start address must align on block boundary */
+   temp_len = ofs;
+   rem = do_div(temp_len, mtd->erasesize);
+   if (rem) {
+   pr_debug("%s: unaligned address\n", __func__);
+   ret = -EINVAL;
+   }
+
+   /* Length must align on block boundary */
+   temp_len = len;
+   rem = do_div(temp_len, mtd->erasesize);
+
+   if (rem) {
+   pr_debug("%s: length not block aligned\n", __func__);
+   ret = -EINVAL;
+   }
+
+   return ret;
+}
+
  static int ram_erase(struct mtd_info *mtd, struct erase_info *instr)
  {
+   if (check_offs_len(mtd, instr->addr, instr->len))
+   return -EINVAL;
memset((char *)mtd->priv + instr->addr, 0xff, instr->len);
instr->state = MTD_ERASE_DONE;
mtd_erase_callback(instr);



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mtd: mtdram: check offs and len in mtdram->erase

2015-10-02 Thread Dongsheng Yang

On 10/03/2015 01:38 AM, Brian Norris wrote:

On Fri, Oct 02, 2015 at 03:31:33PM +0530, Sudip Mukherjee wrote:

On Fri, Oct 02, 2015 at 05:39:02PM +0800, Dongsheng Yang wrote:

On 10/01/2015 12:41 AM, Sudip Mukherjee wrote:

We should prevent user to erasing mtd device with an unaligned offset
or length.

Signed-off-by: Sudip Mukherjee <su...@vectorindia.org>
---

I am not sure if I should add the Signed-off-by of
Dongsheng Yang <yangds.f...@cn.fujitsu.com> . He is the original author
and he should get the credit for that.


But I had sent a a patch out to fix this problem before your v1.

http://lists.infradead.org/pipermail/linux-mtd/2015-September/062234.html

I didn't know that. I think your v1 was applied.


Sorry if I left any confusion. Dongsheng's v1 was applied and reverted.
v2 is still under review (and was sent slightly before (?) your v1).


Yea, sorry I should have mentioned it earlier. But I was and am still
in a vacation, then I did not point it out in time.

Sudip, any comment or test for my patch there is always welcome.

Thanx
Yang

Feel free to comment there.

Brian
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.3-rc1 build error on CentOS 5.11 "scripts/sign-file.c:23:25: fatal error: openssl/cms.h: No such file or directory"

2015-09-13 Thread Dongsheng Yang

On 09/12/2015 07:22 AM, Davidlohr Bueso wrote:

On Fri, 11 Sep 2015, Vinson Lee wrote:


Hi.

With the latest Linux 4.3-rc1, I am hitting this build error on CentOS
5.11.

 HOSTCC  scripts/sign-file
scripts/sign-file.c:23:25: fatal error: openssl/cms.h: No such file or
directory
#include 


fwiw/rant, I have run into kernel build issues recently due to lack of
openssl libs.
The solution is trivial, in my case just intalling my distros openssl-devel
package,


Even I install-ed the openssl-devel, I can compile it but I met a
problem in `make modules_install`

At main.c:178:
- SSL error:02001002:system library:fopen:No such file or directory: 
bss_file.c:169

- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:172
sign-file: : No such file or directory


but this will probably break things for more people. And this
is with default
configs... annoying.

Thanks,
Davidlohr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.3-rc1 build error on CentOS 5.11 "scripts/sign-file.c:23:25: fatal error: openssl/cms.h: No such file or directory"

2015-09-13 Thread Dongsheng Yang

On 09/12/2015 07:22 AM, Davidlohr Bueso wrote:

On Fri, 11 Sep 2015, Vinson Lee wrote:


Hi.

With the latest Linux 4.3-rc1, I am hitting this build error on CentOS
5.11.

 HOSTCC  scripts/sign-file
scripts/sign-file.c:23:25: fatal error: openssl/cms.h: No such file or
directory
#include 


fwiw/rant, I have run into kernel build issues recently due to lack of
openssl libs.
The solution is trivial, in my case just intalling my distros openssl-devel
package,


Even I install-ed the openssl-devel, I can compile it but I met a
problem in `make modules_install`

At main.c:178:
- SSL error:02001002:system library:fopen:No such file or directory: 
bss_file.c:169

- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:172
sign-file: : No such file or directory


but this will probably break things for more people. And this
is with default
configs... annoying.

Thanks,
Davidlohr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs

2015-09-04 Thread Dongsheng Yang

On 09/04/2015 04:05 PM, Matias Bjørling wrote:


So here is a suggestion, register_bm again
if we found nvm_dev->bm == NULL in create_target(). And if it is still
NULL after that. return an error "nvm: no compatible bm was found"
and stop target creating. Otherwise, there would be a NULL Pointer
reference problem.

That's a real problem I met in my testing and I did this change
in my local using. I hope that's useful to you.


Hi Yang,
ac
Similar to this?


Okey, I attached two changes in my local using. I hope that
useful to you.

Yang


diff --git i/drivers/lightnvm/core.c w/drivers/lightnvm/core.c
index 5e4c2b8..0d2e5e3 100644
--- i/drivers/lightnvm/core.c
+++ w/drivers/lightnvm/core.c
@@ -262,8 +262,9 @@ int nvm_init(struct nvm_dev *dev)
 }

 if (!ret) {
-   pr_info("nvm: no compatible bm was found.\n");
-   return 0;
+   pr_info("nvm: %s was not initialized due to no
compatible bm.\n",
+   dev->name);
+   return -EINVAL;
 }

 pr_info("nvm: registered %s with luns: %u blocks: %lu sector
size: %d\n",



.



>From 2060232d379328679b22753587d16249f01fa219 Mon Sep 17 00:00:00 2001
From: Dongsheng Yang 
Date: Fri, 4 Sep 2015 08:10:13 +0900
Subject: [PATCH 2/2] lightNVM: register bm in nvm_create_target if dev->bm is
 NULL

When we create target, we need to make sure dev->bm is not NULL.
If it's NULL try to register bm again. If we still fail to find
a proper bm for this dev, return error rather than continue to
provide a NULL pointer dereference problem later.

Signed-off-by: Dongsheng Yang 
---
 drivers/lightnvm/core.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 5e4c2b8..9c75ea4 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -293,10 +293,30 @@ static int nvm_create_target(struct nvm_dev *dev, char *ttname, char *tname,
 		int lun_begin, int lun_end)
 {
 	struct request_queue *tqueue;
+	struct nvm_bm_type *bt;
 	struct gendisk *tdisk;
 	struct nvm_tgt_type *tt;
 	struct nvm_target *t;
 	void *targetdata;
+	int ret = 0;
+
+	if (!dev->bm) {
+		/* register with device with a supported BM */
+		list_for_each_entry(bt, _bms, list) {
+			ret = bt->register_bm(dev);
+			if (ret < 0)
+return ret; /* initialization failed */
+			if (ret > 0) {
+dev->bm = bt;
+break; /* successfully initialized */
+			}
+		}
+
+		if (!ret) {
+			pr_info("nvm: no compatible bm was found.\n");
+			return -ENODEV;
+		}
+	}
 
 	tt = nvm_find_target_type(ttname);
 	if (!tt) {
-- 
1.8.4.2

>From 699d279ee0dbf3db5a4e7a78d52fb93e954294a1 Mon Sep 17 00:00:00 2001
From: Dongsheng Yang 
Date: Mon, 31 Aug 2015 17:22:23 -0400
Subject: [PATCH 1/2] lightNVM: fix a compatibility problem in compiling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

In some old gcc version, such as [gcc version 4.4.7 20120313 (Red Hat 4.4.7-4)]
there is a compiling error with this kind of code:

struct test {
	union {
		int data;
	};
};

int main()
{
struct test ins = {
.data = 1,
};
return 0;
}

 # gcc test.c
 # test.c: In function ‘main’:
 # test.c:12: error: unknown field ‘data’ specified in initializer

This patch fix this problem to initialize it in a compatible way.

Signed-off-by: Dongsheng Yang 
---
 drivers/block/nvme-lightnvm.c | 58 +++
 1 file changed, 26 insertions(+), 32 deletions(-)

diff --git a/drivers/block/nvme-lightnvm.c b/drivers/block/nvme-lightnvm.c
index 8ad84c9..d1dbc67 100644
--- a/drivers/block/nvme-lightnvm.c
+++ b/drivers/block/nvme-lightnvm.c
@@ -184,13 +184,13 @@ static int init_chnls(struct request_queue *q, struct nvm_id *nvm_id,
 	struct nvme_nvm_id_chnl *src = nvme_nvm_id->chnls;
 	struct nvm_id_chnl *dst = nvm_id->chnls;
 	struct nvme_ns *ns = q->queuedata;
-	struct nvme_nvm_command c = {
-		.nvm_identify.opcode = nvme_nvm_admin_identify,
-		.nvm_identify.nsid = cpu_to_le32(ns->ns_id),
-	};
+	struct nvme_nvm_command c = {};
 	unsigned int len = nvm_id->nchannels;
 	int i, end, ret, off = 0;
 
+	c.nvm_identify.opcode = nvme_nvm_admin_identify;
+	c.nvm_identify.nsid = cpu_to_le32(ns->ns_id);
+
 	while (len) {
 		end = min_t(u32, NVME_NVM_CHNLS_PR_REQ, len);
 
@@ -230,13 +230,12 @@ static int nvme_nvm_identify(struct request_queue *q, struct nvm_id *nvm_id)
 {
 	struct nvme_ns *ns = q->queuedata;
 	struct nvme_nvm_id *nvme_nvm_id;
-	struct nvme_nvm_command c = {
-		.nvm_identify.opcode = nvme_nvm_admin_identify,
-		.nvm_identify.nsid = cpu_to_le32(ns->ns_id),
-		.nvm_identify.chnl_off = 0,
-	};
+	struct nvme_nvm_command c = {};
 	int ret;
 
+	c.nvm_identify.opcode = nvme_nvm_admin_identify;
+	c.nvm_identify.nsid = cpu_to_le32(ns->n

Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs

2015-09-04 Thread Dongsheng Yang

On 09/02/2015 06:48 PM, Matias Bjørling wrote:

+
+/* register with device with a supported BM */
+list_for_each_entry(bt, _bms, list) {
+ret = bt->register_bm(dev);
+if (ret < 0)
+goto err; /* initialization failed */
+if (ret > 0) {
+dev->bm = bt;
+break; /* successfully initialized */
+}
+}


Why just search it from head to tail? Can user specific it
in nvm_create_target()?


Hi Yang,

Currently only the rrpc and a couple of out of tree block managers are
built. The register_bm only tries to find a block manager that supports
the device, when it finds it, that  one is initialized. It is an open
question on how we choose the right block manager, e.g. a proprietary
and a open-source block manager is in place. Priorities might be a way
to go? or mark certain block managers as a catch all?

Hopefully we will get away with only a single or two block managers in
the future, so we won't have one for each type of device.


+
+if (!ret) {
+pr_info("nvm: no compatible bm was found.\n");
+return 0;
+}


If we allow nvm_device registered with no bm, we would get
a NULL pointer reference problem in later using.



Yes, definitely.


So here is a suggestion, register_bm again
if we found nvm_dev->bm == NULL in create_target(). And if it is still
NULL after that. return an error "nvm: no compatible bm was found"
and stop target creating. Otherwise, there would be a NULL Pointer
reference problem.

That's a real problem I met in my testing and I did this change
in my local using. I hope that's useful to you.

Thanx
Yang

In the care that happens, I envision it should be
possible to register a block manager after a device is loaded, and then
any outstanding devices (which does not have a registered block
manager), will be probed again.


As mentioned above, why we have to choose bm for nvm in nvm_register?


Without a block manager, we don't know the structure of the device and
how to interact with it. I want to initialize that as soon as possible.
So that layers on top can start interacting.



Thanx
Yang

.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs

2015-09-04 Thread Dongsheng Yang

On 09/02/2015 06:48 PM, Matias Bjørling wrote:

+
+/* register with device with a supported BM */
+list_for_each_entry(bt, _bms, list) {
+ret = bt->register_bm(dev);
+if (ret < 0)
+goto err; /* initialization failed */
+if (ret > 0) {
+dev->bm = bt;
+break; /* successfully initialized */
+}
+}


Why just search it from head to tail? Can user specific it
in nvm_create_target()?


Hi Yang,

Currently only the rrpc and a couple of out of tree block managers are
built. The register_bm only tries to find a block manager that supports
the device, when it finds it, that  one is initialized. It is an open
question on how we choose the right block manager, e.g. a proprietary
and a open-source block manager is in place. Priorities might be a way
to go? or mark certain block managers as a catch all?

Hopefully we will get away with only a single or two block managers in
the future, so we won't have one for each type of device.


+
+if (!ret) {
+pr_info("nvm: no compatible bm was found.\n");
+return 0;
+}


If we allow nvm_device registered with no bm, we would get
a NULL pointer reference problem in later using.



Yes, definitely.


So here is a suggestion, register_bm again
if we found nvm_dev->bm == NULL in create_target(). And if it is still
NULL after that. return an error "nvm: no compatible bm was found"
and stop target creating. Otherwise, there would be a NULL Pointer
reference problem.

That's a real problem I met in my testing and I did this change
in my local using. I hope that's useful to you.

Thanx
Yang

In the care that happens, I envision it should be
possible to register a block manager after a device is loaded, and then
any outstanding devices (which does not have a registered block
manager), will be probed again.


As mentioned above, why we have to choose bm for nvm in nvm_register?


Without a block manager, we don't know the structure of the device and
how to interact with it. I want to initialize that as soon as possible.
So that layers on top can start interacting.



Thanx
Yang

.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs

2015-09-04 Thread Dongsheng Yang

On 09/04/2015 04:05 PM, Matias Bjørling wrote:


So here is a suggestion, register_bm again
if we found nvm_dev->bm == NULL in create_target(). And if it is still
NULL after that. return an error "nvm: no compatible bm was found"
and stop target creating. Otherwise, there would be a NULL Pointer
reference problem.

That's a real problem I met in my testing and I did this change
in my local using. I hope that's useful to you.


Hi Yang,
ac
Similar to this?


Okey, I attached two changes in my local using. I hope that
useful to you.

Yang


diff --git i/drivers/lightnvm/core.c w/drivers/lightnvm/core.c
index 5e4c2b8..0d2e5e3 100644
--- i/drivers/lightnvm/core.c
+++ w/drivers/lightnvm/core.c
@@ -262,8 +262,9 @@ int nvm_init(struct nvm_dev *dev)
 }

 if (!ret) {
-   pr_info("nvm: no compatible bm was found.\n");
-   return 0;
+   pr_info("nvm: %s was not initialized due to no
compatible bm.\n",
+   dev->name);
+   return -EINVAL;
 }

 pr_info("nvm: registered %s with luns: %u blocks: %lu sector
size: %d\n",



.



>From 2060232d379328679b22753587d16249f01fa219 Mon Sep 17 00:00:00 2001
From: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
Date: Fri, 4 Sep 2015 08:10:13 +0900
Subject: [PATCH 2/2] lightNVM: register bm in nvm_create_target if dev->bm is
 NULL

When we create target, we need to make sure dev->bm is not NULL.
If it's NULL try to register bm again. If we still fail to find
a proper bm for this dev, return error rather than continue to
provide a NULL pointer dereference problem later.

Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
---
 drivers/lightnvm/core.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 5e4c2b8..9c75ea4 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -293,10 +293,30 @@ static int nvm_create_target(struct nvm_dev *dev, char *ttname, char *tname,
 		int lun_begin, int lun_end)
 {
 	struct request_queue *tqueue;
+	struct nvm_bm_type *bt;
 	struct gendisk *tdisk;
 	struct nvm_tgt_type *tt;
 	struct nvm_target *t;
 	void *targetdata;
+	int ret = 0;
+
+	if (!dev->bm) {
+		/* register with device with a supported BM */
+		list_for_each_entry(bt, _bms, list) {
+			ret = bt->register_bm(dev);
+			if (ret < 0)
+return ret; /* initialization failed */
+			if (ret > 0) {
+dev->bm = bt;
+break; /* successfully initialized */
+			}
+		}
+
+		if (!ret) {
+			pr_info("nvm: no compatible bm was found.\n");
+			return -ENODEV;
+		}
+	}
 
 	tt = nvm_find_target_type(ttname);
 	if (!tt) {
-- 
1.8.4.2

>From 699d279ee0dbf3db5a4e7a78d52fb93e954294a1 Mon Sep 17 00:00:00 2001
From: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
Date: Mon, 31 Aug 2015 17:22:23 -0400
Subject: [PATCH 1/2] lightNVM: fix a compatibility problem in compiling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

In some old gcc version, such as [gcc version 4.4.7 20120313 (Red Hat 4.4.7-4)]
there is a compiling error with this kind of code:

struct test {
	union {
		int data;
	};
};

int main()
{
struct test ins = {
.data = 1,
};
return 0;
}

 # gcc test.c
 # test.c: In function ‘main’:
 # test.c:12: error: unknown field ‘data’ specified in initializer

This patch fix this problem to initialize it in a compatible way.

Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com>
---
 drivers/block/nvme-lightnvm.c | 58 +++
 1 file changed, 26 insertions(+), 32 deletions(-)

diff --git a/drivers/block/nvme-lightnvm.c b/drivers/block/nvme-lightnvm.c
index 8ad84c9..d1dbc67 100644
--- a/drivers/block/nvme-lightnvm.c
+++ b/drivers/block/nvme-lightnvm.c
@@ -184,13 +184,13 @@ static int init_chnls(struct request_queue *q, struct nvm_id *nvm_id,
 	struct nvme_nvm_id_chnl *src = nvme_nvm_id->chnls;
 	struct nvm_id_chnl *dst = nvm_id->chnls;
 	struct nvme_ns *ns = q->queuedata;
-	struct nvme_nvm_command c = {
-		.nvm_identify.opcode = nvme_nvm_admin_identify,
-		.nvm_identify.nsid = cpu_to_le32(ns->ns_id),
-	};
+	struct nvme_nvm_command c = {};
 	unsigned int len = nvm_id->nchannels;
 	int i, end, ret, off = 0;
 
+	c.nvm_identify.opcode = nvme_nvm_admin_identify;
+	c.nvm_identify.nsid = cpu_to_le32(ns->ns_id);
+
 	while (len) {
 		end = min_t(u32, NVME_NVM_CHNLS_PR_REQ, len);
 
@@ -230,13 +230,12 @@ static int nvme_nvm_identify(struct request_queue *q, struct nvm_id *nvm_id)
 {
 	struct nvme_ns *ns = q->queuedata;
 	struct nvme_nvm_id *nvme_nvm_id;
-	struct nvme_nvm_command c = {
-		.nvm_identify.opcode = nvme_nvm_admin_identify,
-		.nvm_identify.nsid = cpu_to_le32(ns->ns_id),
-		.nvm_identify.chnl_off = 0,
-	};
+	struct nvm

Re: [PATCH v7 0/5] Support for Open-Channel SSDs

2015-09-01 Thread Dongsheng Yang

On 08/07/2015 10:29 PM, Matias Bjørling wrote:

These patches implement support for Open-Channel SSDs.

Applies against axboe's linux-block/for-4.3/drivers and can be found
in the lkml_v7 branch at https://github.com/OpenChannelSSD/linux

Any feedback is greatly appreciated.


Hi Matias,
After a reading of your code, that's a great idea.
I tried it with null_nvm and qemu-nvm. I have two questions
here.
(1), Why we name it lightnvm? IIUC, this framework
can work for other flashes not only NVMe protocol.
(2), There are gc and bm, but where is the wear leveling?
In hardware?

Thanx
Yang


Changes since v6:
  - Multipage support (Javier Gonzalez)
  - General code cleanups
  - Fixed memleak on register failure

Changes since v5:
Feedback from Christoph Hellwig.
  - Created new null_nvm from null_blk to register itself as a lightnvm
device.
  - Changed the interface of register/unregister to only take disk_name.
The gendisk alloc in nvme is kept. Most instantiations will
involve the device gendisk, therefore wait with refactoring to a
later time.
  - Renamed global parameters in core.c and rrpc.c

Changes since v4:
  - Remove gendisk->nvm dependency
  - Remove device driver rq private field dependency.
  - Update submission and completion. The flow is now
  Target -> Block Manager -> Device Driver, replacing callbacks in
  device driver.
  - Abstracted out the block manager into its own module. Other block
managers can now be implemented. For example to support fully
host-based SSDs.
  - No longer exposes the device driver gendisk to user-space.
  - Management is moved into /sys/modules/lnvm/parameters/configure_debug

Changes since v3:

  - Remove dependency on REQ_NVM_GC
  - Refactor nvme integration to use nvme_submit_sync_cmd for
internal commands.
  - Fix race condition bug on multiple threads on RRPC target.
  - Rename sysfs entry under the block device from nvm to lightnvm.
The configuration is found in /sys/block/*/lightnvm/

Changes since v2:

  Feedback from Paul Bolle:
  - Fix license to GPLv2, documentation, compilation.
  Feedback from Keith Busch:
  - nvme: Move lightnvm out and into nvme-lightnvm.c.
  - nvme: Set controller css on lightnvm command set.
  - nvme: Remove OACS.
  Feedback from Christoph Hellwig:
  - lightnvm: Move out of block layer into /drivers/lightnvm/core.c
  - lightnvm: refactor request->phys_sector into device drivers.
  - lightnvm: refactor prep/unprep into device drivers.
  - lightnvm: move nvm_dev from request_queue to gendisk.

  New
  - Bad block table support (From Javier).
  - Update maintainers file.

Changes since v1:

  - Splitted LightNVM into two parts. A get/put interface for flash
blocks and the respective targets that implement flash translation
layer logic.
  - Updated the patches according to the LightNVM specification changes.
  - Added interface to add/remove targets for a block device.

Thanks to Jens Axboe, Christoph Hellwig, Keith Busch, Paul Bolle,
Javier Gonzalez and Jesper Madsen for discussions and contributions.

Matias Bjørling (5):
   lightnvm: Support for Open-Channel SSDs
   lightnvm: Hybrid Open-Channel SSD RRPC target
   lightnvm: Hybrid Open-Channel SSD block manager
   null_nvm: Lightnvm test driver
   nvme: LightNVM support

  MAINTAINERS   |8 +
  drivers/Kconfig   |2 +
  drivers/Makefile  |5 +
  drivers/block/Makefile|2 +-
  drivers/block/nvme-core.c |   23 +-
  drivers/block/nvme-lightnvm.c |  568 ++
  drivers/lightnvm/Kconfig  |   36 ++
  drivers/lightnvm/Makefile |8 +
  drivers/lightnvm/bm_hb.c  |  366 
  drivers/lightnvm/bm_hb.h  |   46 ++
  drivers/lightnvm/core.c   |  591 +++
  drivers/lightnvm/null_nvm.c   |  481 +++
  drivers/lightnvm/rrpc.c   | 1296 +
  drivers/lightnvm/rrpc.h   |  236 
  include/linux/lightnvm.h  |  334 +++
  include/linux/nvme.h  |6 +
  include/uapi/linux/nvme.h |3 +
  17 files changed, 4007 insertions(+), 4 deletions(-)
  create mode 100644 drivers/block/nvme-lightnvm.c
  create mode 100644 drivers/lightnvm/Kconfig
  create mode 100644 drivers/lightnvm/Makefile
  create mode 100644 drivers/lightnvm/bm_hb.c
  create mode 100644 drivers/lightnvm/bm_hb.h
  create mode 100644 drivers/lightnvm/core.c
  create mode 100644 drivers/lightnvm/null_nvm.c
  create mode 100644 drivers/lightnvm/rrpc.c
  create mode 100644 drivers/lightnvm/rrpc.h
  create mode 100644 include/linux/lightnvm.h



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs

2015-09-01 Thread Dongsheng Yang

On 08/07/2015 10:29 PM, Matias Bjørling wrote:

Open-channel SSDs are devices that share responsibilities with the host
in order to implement and maintain features that typical SSDs keep
strictly in firmware. These include (i) the Flash Translation Layer
(FTL), (ii) bad block management, and (iii) hardware units such as the
flash controller, the interface controller, and large amounts of flash
chips. In this way, Open-channels SSDs exposes direct access to their
physical flash storage, while keeping a subset of the internal features
of SSDs.

LightNVM is a specification that gives support to Open-channel SSDs
LightNVM allows the host to manage data placement, garbage collection,
and parallelism. Device specific responsibilities such as bad block
management, FTL extensions to support atomic IOs, or metadata
persistence are still handled by the device.

The implementation of LightNVM consists of two parts: core and
(multiple) targets. The core implements functionality shared across
targets. This is initialization, teardown and statistics. The targets
implement the interface that exposes physical flash to user-space
applications. Examples of such targets include key-value store,
object-store, as well as traditional block devices, which can be
application-specific.

Contributions in this patch from:

   Javier Gonzalez 
   Jesper Madsen 

Signed-off-by: Matias Bjørling 
---
  MAINTAINERS   |   8 +
  drivers/Kconfig   |   2 +
  drivers/Makefile  |   5 +
  drivers/lightnvm/Kconfig  |  16 ++
  drivers/lightnvm/Makefile |   5 +
  drivers/lightnvm/core.c   | 590 ++
  include/linux/lightnvm.h  | 335 ++
  7 files changed, 961 insertions(+)
  create mode 100644 drivers/lightnvm/Kconfig
  create mode 100644 drivers/lightnvm/Makefile
  create mode 100644 drivers/lightnvm/core.c
  create mode 100644 include/linux/lightnvm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2d3d55c..d149104 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6162,6 +6162,14 @@ S:   Supported
  F:drivers/nvdimm/pmem.c
  F:include/linux/pmem.h

+LIGHTNVM PLATFORM SUPPORT
+M: Matias Bjorling 
+W: http://github/OpenChannelSSD
+S: Maintained
+F: drivers/lightnvm/
+F: include/linux/lightnvm.h
+F: include/uapi/linux/lightnvm.h
+
  LINUX FOR IBM pSERIES (RS/6000)
  M:Paul Mackerras 
  W:http://www.ibm.com/linux/ltc/projects/ppc
diff --git a/drivers/Kconfig b/drivers/Kconfig
index 6e973b8..3992902 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -42,6 +42,8 @@ source "drivers/net/Kconfig"

  source "drivers/isdn/Kconfig"

+source "drivers/lightnvm/Kconfig"
+
  # input before char - char/joystick depends on it. As does USB.

  source "drivers/input/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index b64b49f..75978ab 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -63,6 +63,10 @@ obj-$(CONFIG_FB_I810)   += video/fbdev/i810/
  obj-$(CONFIG_FB_INTEL)  += video/fbdev/intelfb/

  obj-$(CONFIG_PARPORT) += parport/
+
+# lightnvm/ comes before block to initialize bm before usage
+obj-$(CONFIG_NVM)  += lightnvm/
+
  obj-y += base/ block/ misc/ mfd/ nfc/
  obj-$(CONFIG_LIBNVDIMM)   += nvdimm/
  obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
@@ -165,3 +169,4 @@ obj-$(CONFIG_RAS)   += ras/
  obj-$(CONFIG_THUNDERBOLT) += thunderbolt/
  obj-$(CONFIG_CORESIGHT)   += hwtracing/coresight/
  obj-$(CONFIG_ANDROID) += android/
+
diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
new file mode 100644
index 000..1f8412c
--- /dev/null
+++ b/drivers/lightnvm/Kconfig
@@ -0,0 +1,16 @@
+#
+# Open-Channel SSD NVM configuration
+#
+
+menuconfig NVM
+   bool "Open-Channel SSD target support"
+   depends on BLOCK
+   help
+ Say Y here to get to enable Open-channel SSDs.
+
+ Open-Channel SSDs implement a set of extension to SSDs, that
+ exposes direct access to the underlying non-volatile memory.
+
+ If you say N, all options in this submenu will be skipped and disabled
+ only do this if you know what you are doing.
+
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
new file mode 100644
index 000..38185e9
--- /dev/null
+++ b/drivers/lightnvm/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for Open-Channel SSDs.
+#
+
+obj-$(CONFIG_NVM)  := core.o
diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
new file mode 100644
index 000..6499922
--- /dev/null
+++ b/drivers/lightnvm/core.c
@@ -0,0 +1,590 @@
+/*
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in 

Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs

2015-09-01 Thread Dongsheng Yang

On 08/07/2015 10:29 PM, Matias Bjørling wrote:

Open-channel SSDs are devices that share responsibilities with the host
in order to implement and maintain features that typical SSDs keep
strictly in firmware. These include (i) the Flash Translation Layer
(FTL), (ii) bad block management, and (iii) hardware units such as the
flash controller, the interface controller, and large amounts of flash
chips. In this way, Open-channels SSDs exposes direct access to their
physical flash storage, while keeping a subset of the internal features
of SSDs.

LightNVM is a specification that gives support to Open-channel SSDs
LightNVM allows the host to manage data placement, garbage collection,
and parallelism. Device specific responsibilities such as bad block
management, FTL extensions to support atomic IOs, or metadata
persistence are still handled by the device.

The implementation of LightNVM consists of two parts: core and
(multiple) targets. The core implements functionality shared across
targets. This is initialization, teardown and statistics. The targets
implement the interface that exposes physical flash to user-space
applications. Examples of such targets include key-value store,
object-store, as well as traditional block devices, which can be
application-specific.

Contributions in this patch from:

   Javier Gonzalez 
   Jesper Madsen 

Signed-off-by: Matias Bjørling 
---
  MAINTAINERS   |   8 +
  drivers/Kconfig   |   2 +
  drivers/Makefile  |   5 +
  drivers/lightnvm/Kconfig  |  16 ++
  drivers/lightnvm/Makefile |   5 +
  drivers/lightnvm/core.c   | 590 ++
  include/linux/lightnvm.h  | 335 ++
  7 files changed, 961 insertions(+)
  create mode 100644 drivers/lightnvm/Kconfig
  create mode 100644 drivers/lightnvm/Makefile
  create mode 100644 drivers/lightnvm/core.c
  create mode 100644 include/linux/lightnvm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2d3d55c..d149104 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6162,6 +6162,14 @@ S:   Supported
  F:drivers/nvdimm/pmem.c
  F:include/linux/pmem.h

+LIGHTNVM PLATFORM SUPPORT
+M: Matias Bjorling 
+W: http://github/OpenChannelSSD
+S: Maintained
+F: drivers/lightnvm/
+F: include/linux/lightnvm.h
+F: include/uapi/linux/lightnvm.h
+
  LINUX FOR IBM pSERIES (RS/6000)
  M:Paul Mackerras 
  W:http://www.ibm.com/linux/ltc/projects/ppc
diff --git a/drivers/Kconfig b/drivers/Kconfig
index 6e973b8..3992902 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -42,6 +42,8 @@ source "drivers/net/Kconfig"

  source "drivers/isdn/Kconfig"

+source "drivers/lightnvm/Kconfig"
+
  # input before char - char/joystick depends on it. As does USB.

  source "drivers/input/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index b64b49f..75978ab 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -63,6 +63,10 @@ obj-$(CONFIG_FB_I810)   += video/fbdev/i810/
  obj-$(CONFIG_FB_INTEL)  += video/fbdev/intelfb/

  obj-$(CONFIG_PARPORT) += parport/
+
+# lightnvm/ comes before block to initialize bm before usage
+obj-$(CONFIG_NVM)  += lightnvm/
+
  obj-y += base/ block/ misc/ mfd/ nfc/
  obj-$(CONFIG_LIBNVDIMM)   += nvdimm/
  obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
@@ -165,3 +169,4 @@ obj-$(CONFIG_RAS)   += ras/
  obj-$(CONFIG_THUNDERBOLT) += thunderbolt/
  obj-$(CONFIG_CORESIGHT)   += hwtracing/coresight/
  obj-$(CONFIG_ANDROID) += android/
+
diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
new file mode 100644
index 000..1f8412c
--- /dev/null
+++ b/drivers/lightnvm/Kconfig
@@ -0,0 +1,16 @@
+#
+# Open-Channel SSD NVM configuration
+#
+
+menuconfig NVM
+   bool "Open-Channel SSD target support"
+   depends on BLOCK
+   help
+ Say Y here to get to enable Open-channel SSDs.
+
+ Open-Channel SSDs implement a set of extension to SSDs, that
+ exposes direct access to the underlying non-volatile memory.
+
+ If you say N, all options in this submenu will be skipped and disabled
+ only do this if you know what you are doing.
+
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
new file mode 100644
index 000..38185e9
--- /dev/null
+++ b/drivers/lightnvm/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for Open-Channel SSDs.
+#
+
+obj-$(CONFIG_NVM)  := core.o
diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
new file mode 100644
index 000..6499922
--- /dev/null
+++ b/drivers/lightnvm/core.c
@@ -0,0 +1,590 @@
+/*
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License 

Re: [PATCH v7 0/5] Support for Open-Channel SSDs

2015-09-01 Thread Dongsheng Yang

On 08/07/2015 10:29 PM, Matias Bjørling wrote:

These patches implement support for Open-Channel SSDs.

Applies against axboe's linux-block/for-4.3/drivers and can be found
in the lkml_v7 branch at https://github.com/OpenChannelSSD/linux

Any feedback is greatly appreciated.


Hi Matias,
After a reading of your code, that's a great idea.
I tried it with null_nvm and qemu-nvm. I have two questions
here.
(1), Why we name it lightnvm? IIUC, this framework
can work for other flashes not only NVMe protocol.
(2), There are gc and bm, but where is the wear leveling?
In hardware?

Thanx
Yang


Changes since v6:
  - Multipage support (Javier Gonzalez)
  - General code cleanups
  - Fixed memleak on register failure

Changes since v5:
Feedback from Christoph Hellwig.
  - Created new null_nvm from null_blk to register itself as a lightnvm
device.
  - Changed the interface of register/unregister to only take disk_name.
The gendisk alloc in nvme is kept. Most instantiations will
involve the device gendisk, therefore wait with refactoring to a
later time.
  - Renamed global parameters in core.c and rrpc.c

Changes since v4:
  - Remove gendisk->nvm dependency
  - Remove device driver rq private field dependency.
  - Update submission and completion. The flow is now
  Target -> Block Manager -> Device Driver, replacing callbacks in
  device driver.
  - Abstracted out the block manager into its own module. Other block
managers can now be implemented. For example to support fully
host-based SSDs.
  - No longer exposes the device driver gendisk to user-space.
  - Management is moved into /sys/modules/lnvm/parameters/configure_debug

Changes since v3:

  - Remove dependency on REQ_NVM_GC
  - Refactor nvme integration to use nvme_submit_sync_cmd for
internal commands.
  - Fix race condition bug on multiple threads on RRPC target.
  - Rename sysfs entry under the block device from nvm to lightnvm.
The configuration is found in /sys/block/*/lightnvm/

Changes since v2:

  Feedback from Paul Bolle:
  - Fix license to GPLv2, documentation, compilation.
  Feedback from Keith Busch:
  - nvme: Move lightnvm out and into nvme-lightnvm.c.
  - nvme: Set controller css on lightnvm command set.
  - nvme: Remove OACS.
  Feedback from Christoph Hellwig:
  - lightnvm: Move out of block layer into /drivers/lightnvm/core.c
  - lightnvm: refactor request->phys_sector into device drivers.
  - lightnvm: refactor prep/unprep into device drivers.
  - lightnvm: move nvm_dev from request_queue to gendisk.

  New
  - Bad block table support (From Javier).
  - Update maintainers file.

Changes since v1:

  - Splitted LightNVM into two parts. A get/put interface for flash
blocks and the respective targets that implement flash translation
layer logic.
  - Updated the patches according to the LightNVM specification changes.
  - Added interface to add/remove targets for a block device.

Thanks to Jens Axboe, Christoph Hellwig, Keith Busch, Paul Bolle,
Javier Gonzalez and Jesper Madsen for discussions and contributions.

Matias Bjørling (5):
   lightnvm: Support for Open-Channel SSDs
   lightnvm: Hybrid Open-Channel SSD RRPC target
   lightnvm: Hybrid Open-Channel SSD block manager
   null_nvm: Lightnvm test driver
   nvme: LightNVM support

  MAINTAINERS   |8 +
  drivers/Kconfig   |2 +
  drivers/Makefile  |5 +
  drivers/block/Makefile|2 +-
  drivers/block/nvme-core.c |   23 +-
  drivers/block/nvme-lightnvm.c |  568 ++
  drivers/lightnvm/Kconfig  |   36 ++
  drivers/lightnvm/Makefile |8 +
  drivers/lightnvm/bm_hb.c  |  366 
  drivers/lightnvm/bm_hb.h  |   46 ++
  drivers/lightnvm/core.c   |  591 +++
  drivers/lightnvm/null_nvm.c   |  481 +++
  drivers/lightnvm/rrpc.c   | 1296 +
  drivers/lightnvm/rrpc.h   |  236 
  include/linux/lightnvm.h  |  334 +++
  include/linux/nvme.h  |6 +
  include/uapi/linux/nvme.h |3 +
  17 files changed, 4007 insertions(+), 4 deletions(-)
  create mode 100644 drivers/block/nvme-lightnvm.c
  create mode 100644 drivers/lightnvm/Kconfig
  create mode 100644 drivers/lightnvm/Makefile
  create mode 100644 drivers/lightnvm/bm_hb.c
  create mode 100644 drivers/lightnvm/bm_hb.h
  create mode 100644 drivers/lightnvm/core.c
  create mode 100644 drivers/lightnvm/null_nvm.c
  create mode 100644 drivers/lightnvm/rrpc.c
  create mode 100644 drivers/lightnvm/rrpc.h
  create mode 100644 include/linux/lightnvm.h



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] ubifs: Remove dead xattr code

2015-08-26 Thread Dongsheng Yang

On 08/26/2015 10:15 PM, Josh Cartwright wrote:

On Thu, Aug 20, 2015 at 10:48:38AM +0800, Dongsheng Yang wrote:

On 08/20/2015 04:35 AM, Richard Weinberger wrote:

This is a partial revert of commit d7f0b70d30ffb9bbe6b8a3e1035cf0b79965ef53
("UBIFS: Add security.* XATTR support for the UBIFS").


Hi Richard,
What about a full reverting of this commit. In ubifs, we
*can* support any namespace of xattr including user, trusted, security
or other anyone prefixed by any words. But we have a check_namespace()
in xattr.c to limit what we want to support. That said, if we want to
"Add security.* XATTR support for the UBIFS", what we need to do is
just extending the check_namespace() to allow security namespace pass.
And yes, check_namespace() have been supporting security namespace.


Is this good enough?  Yes, it'd mean that the xattrs end up on disk, but
then who's responsible for invoking the selected LSMs inode_init_security() 
hooks?
AFAICT, we'd still need to invoke security_inode_init_security for newly
created inodes (which, Richard's proposed commit still does).


OH, right. My bad I missed the security_inode_init_security().
Besides to allow security.* prefix in xattr, we still need to call
security_inode_init_security() in ubifs_create(). That's true.

So what we need to remove is only the ubifs_xattr_handlers.

Thanx Josh, you are right.

And Richard, sorry for my bad mind.

Reviewed-by: Dongsheng Yang 

Thanx
Yang


Thanks,

   Josh (who, admittedly, is neither a filesystem nor security module guy :)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] ubifs: Remove dead xattr code

2015-08-26 Thread Dongsheng Yang

On 08/26/2015 10:15 PM, Josh Cartwright wrote:

On Thu, Aug 20, 2015 at 10:48:38AM +0800, Dongsheng Yang wrote:

On 08/20/2015 04:35 AM, Richard Weinberger wrote:

This is a partial revert of commit d7f0b70d30ffb9bbe6b8a3e1035cf0b79965ef53
(UBIFS: Add security.* XATTR support for the UBIFS).


Hi Richard,
What about a full reverting of this commit. In ubifs, we
*can* support any namespace of xattr including user, trusted, security
or other anyone prefixed by any words. But we have a check_namespace()
in xattr.c to limit what we want to support. That said, if we want to
Add security.* XATTR support for the UBIFS, what we need to do is
just extending the check_namespace() to allow security namespace pass.
And yes, check_namespace() have been supporting security namespace.


Is this good enough?  Yes, it'd mean that the xattrs end up on disk, but
then who's responsible for invoking the selected LSMs inode_init_security() 
hooks?
AFAICT, we'd still need to invoke security_inode_init_security for newly
created inodes (which, Richard's proposed commit still does).


OH, right. My bad I missed the security_inode_init_security().
Besides to allow security.* prefix in xattr, we still need to call
security_inode_init_security() in ubifs_create(). That's true.

So what we need to remove is only the ubifs_xattr_handlers.

Thanx Josh, you are right.

And Richard, sorry for my bad mind.

Reviewed-by: Dongsheng Yang yangds.f...@cn.fujitsu.com

Thanx
Yang


Thanks,

   Josh (who, admittedly, is neither a filesystem nor security module guy :)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ubifs: Allow O_DIRECT

2015-08-24 Thread Dongsheng Yang

On 08/24/2015 04:03 PM, Christoph Hellwig wrote:

On Mon, Aug 24, 2015 at 11:02:42AM +0300, Artem Bityutskiy wrote:

Back when we were writing UBIFS, we did not need direct IO, so we did
not implement it. But yes, probably someone who cares could just try
implementing this feature.


So I think the answer here is to implement a real version insted of
adding hacks to pretend it is supported.


Actually, I had a plan for it. But it's for 4.4 or 4.5 in my plan.

Yang

.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ubifs: Allow O_DIRECT

2015-08-24 Thread Dongsheng Yang

On 08/24/2015 03:17 PM, Dongsheng Yang wrote:

On 08/24/2015 03:18 PM, Artem Bityutskiy wrote:

On Thu, 2015-08-20 at 15:34 +0300, Artem Bityutskiy wrote:

On Thu, 2015-08-20 at 13:40 +0200, Richard Weinberger wrote:

Basically, we need to see what is the "common practice" here, and
follow it. This requires a small research. What would be the most
popular Linux FS which does not support direct I/O? Can we check
what
it does?


All popular filesystems seem to support direct IO.
That's the problem, application do not expect O_DIRECT to fail.

My intention was to do it like exofs:


Fair enough, thanks!

Signed-off-by: Artem Bityutskiy 


Richard, you mention this was suggested by Dave, could you please pint
to the discussion, if possible?


http://lists.infradead.org/pipermail/linux-mtd/2015-August/060702.html

That's in a discussion I want to introduce ubifs into xfstests.


And Artem, I cc-ed that discussion to you, maybe you can find it in your
Inbox.


Yang


Artem.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ubifs: Allow O_DIRECT

2015-08-24 Thread Dongsheng Yang

On 08/24/2015 03:18 PM, Artem Bityutskiy wrote:

On Thu, 2015-08-20 at 15:34 +0300, Artem Bityutskiy wrote:

On Thu, 2015-08-20 at 13:40 +0200, Richard Weinberger wrote:

Basically, we need to see what is the "common practice" here, and
follow it. This requires a small research. What would be the most
popular Linux FS which does not support direct I/O? Can we check
what
it does?


All popular filesystems seem to support direct IO.
That's the problem, application do not expect O_DIRECT to fail.

My intention was to do it like exofs:


Fair enough, thanks!

Signed-off-by: Artem Bityutskiy 


Richard, you mention this was suggested by Dave, could you please pint
to the discussion, if possible?


http://lists.infradead.org/pipermail/linux-mtd/2015-August/060702.html

That's in a discussion I want to introduce ubifs into xfstests.

Yang


Artem.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ubifs: Allow O_DIRECT

2015-08-24 Thread Dongsheng Yang

On 08/24/2015 04:03 PM, Christoph Hellwig wrote:

On Mon, Aug 24, 2015 at 11:02:42AM +0300, Artem Bityutskiy wrote:

Back when we were writing UBIFS, we did not need direct IO, so we did
not implement it. But yes, probably someone who cares could just try
implementing this feature.


So I think the answer here is to implement a real version insted of
adding hacks to pretend it is supported.


Actually, I had a plan for it. But it's for 4.4 or 4.5 in my plan.

Yang

.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ubifs: Allow O_DIRECT

2015-08-24 Thread Dongsheng Yang

On 08/24/2015 03:18 PM, Artem Bityutskiy wrote:

On Thu, 2015-08-20 at 15:34 +0300, Artem Bityutskiy wrote:

On Thu, 2015-08-20 at 13:40 +0200, Richard Weinberger wrote:

Basically, we need to see what is the common practice here, and
follow it. This requires a small research. What would be the most
popular Linux FS which does not support direct I/O? Can we check
what
it does?


All popular filesystems seem to support direct IO.
That's the problem, application do not expect O_DIRECT to fail.

My intention was to do it like exofs:


Fair enough, thanks!

Signed-off-by: Artem Bityutskiy artem.bityuts...@intel.com


Richard, you mention this was suggested by Dave, could you please pint
to the discussion, if possible?


http://lists.infradead.org/pipermail/linux-mtd/2015-August/060702.html

That's in a discussion I want to introduce ubifs into xfstests.

Yang


Artem.
--
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ubifs: Allow O_DIRECT

2015-08-24 Thread Dongsheng Yang

On 08/24/2015 03:17 PM, Dongsheng Yang wrote:

On 08/24/2015 03:18 PM, Artem Bityutskiy wrote:

On Thu, 2015-08-20 at 15:34 +0300, Artem Bityutskiy wrote:

On Thu, 2015-08-20 at 13:40 +0200, Richard Weinberger wrote:

Basically, we need to see what is the common practice here, and
follow it. This requires a small research. What would be the most
popular Linux FS which does not support direct I/O? Can we check
what
it does?


All popular filesystems seem to support direct IO.
That's the problem, application do not expect O_DIRECT to fail.

My intention was to do it like exofs:


Fair enough, thanks!

Signed-off-by: Artem Bityutskiy artem.bityuts...@intel.com


Richard, you mention this was suggested by Dave, could you please pint
to the discussion, if possible?


http://lists.infradead.org/pipermail/linux-mtd/2015-August/060702.html

That's in a discussion I want to introduce ubifs into xfstests.


And Artem, I cc-ed that discussion to you, maybe you can find it in your
Inbox.


Yang


Artem.
--
To unsubscribe from this list: send the line unsubscribe
linux-fsdevel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ubifs: Allow O_DIRECT

2015-08-20 Thread Dongsheng Yang

On 08/20/2015 02:42 PM, Richard Weinberger wrote:

Yang, (Sorry if I've used your last name lately)


Haha, that's fine. My friends in China all call me Dongsheng. :)


Am 20.08.2015 um 05:00 schrieb Dongsheng Yang:

On 08/20/2015 04:35 AM, Richard Weinberger wrote:

Currently UBIFS does not support direct IO, but some applications
blindly use the O_DIRECT flag.
Instead of failing upon open() we can do better and fall back
to buffered IO.


H, to be honest, I am not sure we have to do it as Dave
suggested. I think that's just a work-around for current fstests.

IMHO, perform a buffered IO when user request direct IO without
any warning sounds not a good idea. Maybe adding a warning would
make it better.


Well, how would you inform the user?
A printk() to dmesg is useless are the vast majority of open()
callers do not check dmesg... :)

Major filesystems implement ->direct_IO these days and having
a "return 0"-stub seems to be legit.
For example exofs does too. So, I really don't consider it a work around.


Hmmm, then I am okey with this idea now.



I think we need more discussion about AIO in ubifs, and actually
I have a plan for it. But I have not listed the all cons and pros of
it so far.


Sure, having a real ->direct_IO would be be best solution.
My patch won't block this.


Yes, agree. So let's return 0 currently.

Yang


Thanks,
//richard
.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ubifs: Allow O_DIRECT

2015-08-20 Thread Dongsheng Yang

On 08/20/2015 02:42 PM, Richard Weinberger wrote:

Yang, (Sorry if I've used your last name lately)


Haha, that's fine. My friends in China all call me Dongsheng. :)


Am 20.08.2015 um 05:00 schrieb Dongsheng Yang:

On 08/20/2015 04:35 AM, Richard Weinberger wrote:

Currently UBIFS does not support direct IO, but some applications
blindly use the O_DIRECT flag.
Instead of failing upon open() we can do better and fall back
to buffered IO.


H, to be honest, I am not sure we have to do it as Dave
suggested. I think that's just a work-around for current fstests.

IMHO, perform a buffered IO when user request direct IO without
any warning sounds not a good idea. Maybe adding a warning would
make it better.


Well, how would you inform the user?
A printk() to dmesg is useless are the vast majority of open()
callers do not check dmesg... :)

Major filesystems implement -direct_IO these days and having
a return 0-stub seems to be legit.
For example exofs does too. So, I really don't consider it a work around.


Hmmm, then I am okey with this idea now.



I think we need more discussion about AIODIO in ubifs, and actually
I have a plan for it. But I have not listed the all cons and pros of
it so far.


Sure, having a real -direct_IO would be be best solution.
My patch won't block this.


Yes, agree. So let's return 0 currently.

Yang


Thanks,
//richard
.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ubifs: Allow O_DIRECT

2015-08-19 Thread Dongsheng Yang

On 08/20/2015 04:35 AM, Richard Weinberger wrote:

Currently UBIFS does not support direct IO, but some applications
blindly use the O_DIRECT flag.
Instead of failing upon open() we can do better and fall back
to buffered IO.


H, to be honest, I am not sure we have to do it as Dave
suggested. I think that's just a work-around for current fstests.

IMHO, perform a buffered IO when user request direct IO without
any warning sounds not a good idea. Maybe adding a warning would
make it better.

I think we need more discussion about AIO in ubifs, and actually
I have a plan for it. But I have not listed the all cons and pros of
it so far.

Artem, what's your opinion?

Yang


Cc: Dongsheng Yang 
Cc: dedeki...@gmail.com
Suggested-by: Dave Chinner 
Signed-off-by: Richard Weinberger 
---
  fs/ubifs/file.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index a3dfe2a..a61fe86 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1540,6 +1540,15 @@ static int ubifs_file_mmap(struct file *file, struct 
vm_area_struct *vma)
return 0;
  }

+/*
+ * For now fall back to buffered IO.
+ */
+static ssize_t ubifs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
+  loff_t offset)
+{
+   return 0;
+}
+
  const struct address_space_operations ubifs_file_address_operations = {
.readpage   = ubifs_readpage,
.writepage  = ubifs_writepage,
@@ -1548,6 +1557,7 @@ const struct address_space_operations 
ubifs_file_address_operations = {
.invalidatepage = ubifs_invalidatepage,
.set_page_dirty = ubifs_set_page_dirty,
.releasepage= ubifs_releasepage,
+   .direct_IO  = ubifs_direct_IO,
  };

  const struct inode_operations ubifs_file_inode_operations = {



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] ubifs: Remove dead xattr code

2015-08-19 Thread Dongsheng Yang

On 08/20/2015 04:35 AM, Richard Weinberger wrote:

This is a partial revert of commit d7f0b70d30ffb9bbe6b8a3e1035cf0b79965ef53
("UBIFS: Add security.* XATTR support for the UBIFS").


Hi Richard,
What about a full reverting of this commit. In ubifs, we
*can* support any namespace of xattr including user, trusted, security
or other anyone prefixed by any words. But we have a check_namespace()
in xattr.c to limit what we want to support. That said, if we want to
"Add security.* XATTR support for the UBIFS", what we need to do is
just extending the check_namespace() to allow security namespace pass.
And yes, check_namespace() have been supporting security namespace.

So, IMHO, we do not depend on the generic mechanism at all, and we can
fully revert this commit.

But strange to me, why we picked this commit for ubifs? Artem, is there
something I am missing?

Yang


As UBIFS does not use generic inode xattr inode operations
the code behind sb->s_xattr is never called.
Remove that dead code for now.

Cc: Subodh Nijsure 
Cc: Marc Kleine-Budde 
Cc: Brad Mouring 
Cc: Gratian Crisan 
Cc: Artem Bityutskiy 
Reported-by: Andreas Grünbacher 
Signed-off-by: Richard Weinberger 
---
After the merge window (and my vacation) I'll have the time to
re-introduce/work security xattr support.

Thanks,
//richard
---
  fs/ubifs/super.c |  1 -
  fs/ubifs/ubifs.h |  1 -
  fs/ubifs/xattr.c | 40 
  3 files changed, 42 deletions(-)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 9547a278..c71edca 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -2037,7 +2037,6 @@ static int ubifs_fill_super(struct super_block *sb, void 
*data, int silent)
if (c->max_inode_sz > MAX_LFS_FILESIZE)
sb->s_maxbytes = c->max_inode_sz = MAX_LFS_FILESIZE;
sb->s_op = _super_operations;
-   sb->s_xattr = ubifs_xattr_handlers;

mutex_lock(>umount_mutex);
err = mount_ubifs(c);
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index de75902..33b6ee7 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -1470,7 +1470,6 @@ extern spinlock_t ubifs_infos_lock;
  extern atomic_long_t ubifs_clean_zn_cnt;
  extern struct kmem_cache *ubifs_inode_slab;
  extern const struct super_operations ubifs_super_operations;
-extern const struct xattr_handler *ubifs_xattr_handlers[];
  extern const struct address_space_operations ubifs_file_address_operations;
  extern const struct file_operations ubifs_file_operations;
  extern const struct inode_operations ubifs_file_inode_operations;
diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c
index 96f3448..b512b14 100644
--- a/fs/ubifs/xattr.c
+++ b/fs/ubifs/xattr.c
@@ -582,46 +582,6 @@ out_free:
return err;
  }

-static size_t security_listxattr(struct dentry *d, char *list, size_t 
list_size,
-const char *name, size_t name_len, int flags)
-{
-   const int prefix_len = XATTR_SECURITY_PREFIX_LEN;
-   const size_t total_len = prefix_len + name_len + 1;
-
-   if (list && total_len <= list_size) {
-   memcpy(list, XATTR_SECURITY_PREFIX, prefix_len);
-   memcpy(list + prefix_len, name, name_len);
-   list[prefix_len + name_len] = '\0';
-   }
-
-   return total_len;
-}
-
-static int security_getxattr(struct dentry *d, const char *name, void *buffer,
- size_t size, int flags)
-{
-   return ubifs_getxattr(d, name, buffer, size);
-}
-
-static int security_setxattr(struct dentry *d, const char *name,
-const void *value, size_t size, int flags,
-int handler_flags)
-{
-   return ubifs_setxattr(d, name, value, size, flags);
-}
-
-static const struct xattr_handler ubifs_xattr_security_handler = {
-   .prefix = XATTR_SECURITY_PREFIX,
-   .list   = security_listxattr,
-   .get= security_getxattr,
-   .set= security_setxattr,
-};
-
-const struct xattr_handler *ubifs_xattr_handlers[] = {
-   _xattr_security_handler,
-   NULL,
-};
-
  static int init_xattrs(struct inode *inode, const struct xattr *xattr_array,
  void *fs_info)
  {



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] ubifs: Remove dead xattr code

2015-08-19 Thread Dongsheng Yang

On 08/20/2015 04:35 AM, Richard Weinberger wrote:

This is a partial revert of commit d7f0b70d30ffb9bbe6b8a3e1035cf0b79965ef53
(UBIFS: Add security.* XATTR support for the UBIFS).


Hi Richard,
What about a full reverting of this commit. In ubifs, we
*can* support any namespace of xattr including user, trusted, security
or other anyone prefixed by any words. But we have a check_namespace()
in xattr.c to limit what we want to support. That said, if we want to
Add security.* XATTR support for the UBIFS, what we need to do is
just extending the check_namespace() to allow security namespace pass.
And yes, check_namespace() have been supporting security namespace.

So, IMHO, we do not depend on the generic mechanism at all, and we can
fully revert this commit.

But strange to me, why we picked this commit for ubifs? Artem, is there
something I am missing?

Yang


As UBIFS does not use generic inode xattr inode operations
the code behind sb-s_xattr is never called.
Remove that dead code for now.

Cc: Subodh Nijsure snijs...@grid-net.com
Cc: Marc Kleine-Budde m...@pengutronix.de
Cc: Brad Mouring brad.mour...@ni.com
Cc: Gratian Crisan gratian.cri...@ni.com
Cc: Artem Bityutskiy artem.bityuts...@linux.intel.com
Reported-by: Andreas Grünbacher andreas.gruenbac...@gmail.com
Signed-off-by: Richard Weinberger rich...@nod.at
---
After the merge window (and my vacation) I'll have the time to
re-introduce/work security xattr support.

Thanks,
//richard
---
  fs/ubifs/super.c |  1 -
  fs/ubifs/ubifs.h |  1 -
  fs/ubifs/xattr.c | 40 
  3 files changed, 42 deletions(-)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 9547a278..c71edca 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -2037,7 +2037,6 @@ static int ubifs_fill_super(struct super_block *sb, void 
*data, int silent)
if (c-max_inode_sz  MAX_LFS_FILESIZE)
sb-s_maxbytes = c-max_inode_sz = MAX_LFS_FILESIZE;
sb-s_op = ubifs_super_operations;
-   sb-s_xattr = ubifs_xattr_handlers;

mutex_lock(c-umount_mutex);
err = mount_ubifs(c);
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index de75902..33b6ee7 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -1470,7 +1470,6 @@ extern spinlock_t ubifs_infos_lock;
  extern atomic_long_t ubifs_clean_zn_cnt;
  extern struct kmem_cache *ubifs_inode_slab;
  extern const struct super_operations ubifs_super_operations;
-extern const struct xattr_handler *ubifs_xattr_handlers[];
  extern const struct address_space_operations ubifs_file_address_operations;
  extern const struct file_operations ubifs_file_operations;
  extern const struct inode_operations ubifs_file_inode_operations;
diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c
index 96f3448..b512b14 100644
--- a/fs/ubifs/xattr.c
+++ b/fs/ubifs/xattr.c
@@ -582,46 +582,6 @@ out_free:
return err;
  }

-static size_t security_listxattr(struct dentry *d, char *list, size_t 
list_size,
-const char *name, size_t name_len, int flags)
-{
-   const int prefix_len = XATTR_SECURITY_PREFIX_LEN;
-   const size_t total_len = prefix_len + name_len + 1;
-
-   if (list  total_len = list_size) {
-   memcpy(list, XATTR_SECURITY_PREFIX, prefix_len);
-   memcpy(list + prefix_len, name, name_len);
-   list[prefix_len + name_len] = '\0';
-   }
-
-   return total_len;
-}
-
-static int security_getxattr(struct dentry *d, const char *name, void *buffer,
- size_t size, int flags)
-{
-   return ubifs_getxattr(d, name, buffer, size);
-}
-
-static int security_setxattr(struct dentry *d, const char *name,
-const void *value, size_t size, int flags,
-int handler_flags)
-{
-   return ubifs_setxattr(d, name, value, size, flags);
-}
-
-static const struct xattr_handler ubifs_xattr_security_handler = {
-   .prefix = XATTR_SECURITY_PREFIX,
-   .list   = security_listxattr,
-   .get= security_getxattr,
-   .set= security_setxattr,
-};
-
-const struct xattr_handler *ubifs_xattr_handlers[] = {
-   ubifs_xattr_security_handler,
-   NULL,
-};
-
  static int init_xattrs(struct inode *inode, const struct xattr *xattr_array,
  void *fs_info)
  {



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ubifs: Allow O_DIRECT

2015-08-19 Thread Dongsheng Yang

On 08/20/2015 04:35 AM, Richard Weinberger wrote:

Currently UBIFS does not support direct IO, but some applications
blindly use the O_DIRECT flag.
Instead of failing upon open() we can do better and fall back
to buffered IO.


H, to be honest, I am not sure we have to do it as Dave
suggested. I think that's just a work-around for current fstests.

IMHO, perform a buffered IO when user request direct IO without
any warning sounds not a good idea. Maybe adding a warning would
make it better.

I think we need more discussion about AIODIO in ubifs, and actually
I have a plan for it. But I have not listed the all cons and pros of
it so far.

Artem, what's your opinion?

Yang


Cc: Dongsheng Yang yangds.f...@cn.fujitsu.com
Cc: dedeki...@gmail.com
Suggested-by: Dave Chinner da...@fromorbit.com
Signed-off-by: Richard Weinberger rich...@nod.at
---
  fs/ubifs/file.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index a3dfe2a..a61fe86 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1540,6 +1540,15 @@ static int ubifs_file_mmap(struct file *file, struct 
vm_area_struct *vma)
return 0;
  }

+/*
+ * For now fall back to buffered IO.
+ */
+static ssize_t ubifs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
+  loff_t offset)
+{
+   return 0;
+}
+
  const struct address_space_operations ubifs_file_address_operations = {
.readpage   = ubifs_readpage,
.writepage  = ubifs_writepage,
@@ -1548,6 +1557,7 @@ const struct address_space_operations 
ubifs_file_address_operations = {
.invalidatepage = ubifs_invalidatepage,
.set_page_dirty = ubifs_set_page_dirty,
.releasepage= ubifs_releasepage,
+   .direct_IO  = ubifs_direct_IO,
  };

  const struct inode_operations ubifs_file_inode_operations = {



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ubifs: Kill unneeded locking in ubifs_init_security

2015-07-26 Thread Dongsheng Yang

On 07/08/2015 05:46 PM, Richard Weinberger wrote:

Fixes the following lockdep splat:
[1.244527] =
[1.245193] [ INFO: possible recursive locking detected ]
[1.245193] 4.2.0-rc1+ #37 Not tainted
[1.245193] -
[1.245193] cp/742 is trying to acquire lock:
[1.245193]  (>s_type->i_mutex_key#9){+.+.+.}, at: [] 
ubifs_init_security+0x29/0xb0
[1.245193]
[1.245193] but task is already holding lock:
[1.245193]  (>s_type->i_mutex_key#9){+.+.+.}, at: [] 
path_openat+0x3af/0x1280
[1.245193]
[1.245193] other info that might help us debug this:
[1.245193]  Possible unsafe locking scenario:
[1.245193]
[1.245193]CPU0
[1.245193]
[1.245193]   lock(>s_type->i_mutex_key#9);
[1.245193]   lock(>s_type->i_mutex_key#9);
[1.245193]
[1.245193]  *** DEADLOCK ***
[1.245193]
[1.245193]  May be due to missing lock nesting notation
[1.245193]
[1.245193] 2 locks held by cp/742:
[1.245193]  #0:  (sb_writers#5){.+.+.+}, at: [] 
mnt_want_write+0x1f/0x50
[1.245193]  #1:  (>s_type->i_mutex_key#9){+.+.+.}, at: 
[] path_openat+0x3af/0x1280
[1.245193]
[1.245193] stack backtrace:
[1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37
[1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014
[1.245193]  8252d530 88007b023a38 814f6f49 
810b56c5
[1.245193]  88007c30cc80 88007b023af8 810a150d 
88007b023a68
[1.245193]  8101302a 8800 0008f447e23f 
8252d500
[1.245193] Call Trace:
[1.245193]  [] dump_stack+0x4c/0x65
[1.245193]  [] ? console_unlock+0x1c5/0x510
[1.245193]  [] __lock_acquire+0x1a6d/0x1ea0
[1.245193]  [] ? __lock_is_held+0x58/0x80
[1.245193]  [] lock_acquire+0xd3/0x270
[1.245193]  [] ? ubifs_init_security+0x29/0xb0
[1.245193]  [] mutex_lock_nested+0x6b/0x3a0
[1.245193]  [] ? ubifs_init_security+0x29/0xb0
[1.245193]  [] ? ubifs_init_security+0x29/0xb0
[1.245193]  [] ubifs_init_security+0x29/0xb0
[1.245193]  [] ubifs_create+0xa6/0x1f0
[1.245193]  [] ? path_openat+0x3af/0x1280
[1.245193]  [] vfs_create+0x95/0xc0
[1.245193]  [] path_openat+0x7cc/0x1280
[1.245193]  [] ? __lock_acquire+0x543/0x1ea0
[1.245193]  [] ? sched_clock_cpu+0x90/0xc0
[1.245193]  [] ? calc_global_load_tick+0x60/0x90
[1.245193]  [] ? sched_clock_cpu+0x90/0xc0
[1.245193]  [] ? __alloc_fd+0xaf/0x180
[1.245193]  [] do_filp_open+0x75/0xd0
[1.245193]  [] ? _raw_spin_unlock+0x26/0x40
[1.245193]  [] ? __alloc_fd+0xaf/0x180
[1.245193]  [] do_sys_open+0x129/0x200
[1.245193]  [] SyS_open+0x19/0x20
[1.245193]  [] entry_SYSCALL_64_fastpath+0x12/0x6f

While the lockdep splat is a false positive, becuase path_openat holds i_mutex
of the parent directory and ubifs_init_security() tries to acquire i_mutex
of a new inode, it reveals that taking i_mutex in ubifs_init_security() is
in vain because it is only being called in the inode allocation path
and therefore nobody else can see the inode yet.


Yes, makes sense to me.
Reviewed and Tested.

Thanx
Yang


Reported-by: Boris Brezillon 
Signed-off-by: Richard Weinberger 
---
  fs/ubifs/xattr.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c
index 96f3448..fd65b3f 100644
--- a/fs/ubifs/xattr.c
+++ b/fs/ubifs/xattr.c
@@ -652,11 +652,8 @@ int ubifs_init_security(struct inode *dentry, struct inode 
*inode,
  {
int err;

-   mutex_lock(>i_mutex);
err = security_inode_init_security(inode, dentry, qstr,
   _xattrs, 0);
-   mutex_unlock(>i_mutex);
-
if (err) {
struct ubifs_info *c = dentry->i_sb->s_fs_info;
ubifs_err(c, "cannot initialize security for inode %lu, error 
%d",



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ubifs: Kill unneeded locking in ubifs_init_security

2015-07-26 Thread Dongsheng Yang

On 07/08/2015 05:46 PM, Richard Weinberger wrote:

Fixes the following lockdep splat:
[1.244527] =
[1.245193] [ INFO: possible recursive locking detected ]
[1.245193] 4.2.0-rc1+ #37 Not tainted
[1.245193] -
[1.245193] cp/742 is trying to acquire lock:
[1.245193]  (sb-s_type-i_mutex_key#9){+.+.+.}, at: [812b3f69] 
ubifs_init_security+0x29/0xb0
[1.245193]
[1.245193] but task is already holding lock:
[1.245193]  (sb-s_type-i_mutex_key#9){+.+.+.}, at: [81198e7f] 
path_openat+0x3af/0x1280
[1.245193]
[1.245193] other info that might help us debug this:
[1.245193]  Possible unsafe locking scenario:
[1.245193]
[1.245193]CPU0
[1.245193]
[1.245193]   lock(sb-s_type-i_mutex_key#9);
[1.245193]   lock(sb-s_type-i_mutex_key#9);
[1.245193]
[1.245193]  *** DEADLOCK ***
[1.245193]
[1.245193]  May be due to missing lock nesting notation
[1.245193]
[1.245193] 2 locks held by cp/742:
[1.245193]  #0:  (sb_writers#5){.+.+.+}, at: [811ad37f] 
mnt_want_write+0x1f/0x50
[1.245193]  #1:  (sb-s_type-i_mutex_key#9){+.+.+.}, at: 
[81198e7f] path_openat+0x3af/0x1280
[1.245193]
[1.245193] stack backtrace:
[1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37
[1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014
[1.245193]  8252d530 88007b023a38 814f6f49 
810b56c5
[1.245193]  88007c30cc80 88007b023af8 810a150d 
88007b023a68
[1.245193]  8101302a 8800 0008f447e23f 
8252d500
[1.245193] Call Trace:
[1.245193]  [814f6f49] dump_stack+0x4c/0x65
[1.245193]  [810b56c5] ? console_unlock+0x1c5/0x510
[1.245193]  [810a150d] __lock_acquire+0x1a6d/0x1ea0
[1.245193]  [8109fa78] ? __lock_is_held+0x58/0x80
[1.245193]  [810a1a93] lock_acquire+0xd3/0x270
[1.245193]  [812b3f69] ? ubifs_init_security+0x29/0xb0
[1.245193]  [814fc83b] mutex_lock_nested+0x6b/0x3a0
[1.245193]  [812b3f69] ? ubifs_init_security+0x29/0xb0
[1.245193]  [812b3f69] ? ubifs_init_security+0x29/0xb0
[1.245193]  [812b3f69] ubifs_init_security+0x29/0xb0
[1.245193]  [8128e286] ubifs_create+0xa6/0x1f0
[1.245193]  [81198e7f] ? path_openat+0x3af/0x1280
[1.245193]  [81195d15] vfs_create+0x95/0xc0
[1.245193]  [8119929c] path_openat+0x7cc/0x1280
[1.245193]  [8109ffe3] ? __lock_acquire+0x543/0x1ea0
[1.245193]  [81088f20] ? sched_clock_cpu+0x90/0xc0
[1.245193]  [81088c00] ? calc_global_load_tick+0x60/0x90
[1.245193]  [81088f20] ? sched_clock_cpu+0x90/0xc0
[1.245193]  [811a9cef] ? __alloc_fd+0xaf/0x180
[1.245193]  [8119ac55] do_filp_open+0x75/0xd0
[1.245193]  [814ffd86] ? _raw_spin_unlock+0x26/0x40
[1.245193]  [811a9cef] ? __alloc_fd+0xaf/0x180
[1.245193]  [81189bd9] do_sys_open+0x129/0x200
[1.245193]  [81189cc9] SyS_open+0x19/0x20
[1.245193]  [81500717] entry_SYSCALL_64_fastpath+0x12/0x6f

While the lockdep splat is a false positive, becuase path_openat holds i_mutex
of the parent directory and ubifs_init_security() tries to acquire i_mutex
of a new inode, it reveals that taking i_mutex in ubifs_init_security() is
in vain because it is only being called in the inode allocation path
and therefore nobody else can see the inode yet.


Yes, makes sense to me.
Reviewed and Tested.

Thanx
Yang


Reported-by: Boris Brezillon boris.brezil...@free-electrons.com
Signed-off-by: Richard Weinberger rich...@nod.at
---
  fs/ubifs/xattr.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c
index 96f3448..fd65b3f 100644
--- a/fs/ubifs/xattr.c
+++ b/fs/ubifs/xattr.c
@@ -652,11 +652,8 @@ int ubifs_init_security(struct inode *dentry, struct inode 
*inode,
  {
int err;

-   mutex_lock(inode-i_mutex);
err = security_inode_init_security(inode, dentry, qstr,
   init_xattrs, 0);
-   mutex_unlock(inode-i_mutex);
-
if (err) {
struct ubifs_info *c = dentry-i_sb-s_fs_info;
ubifs_err(c, cannot initialize security for inode %lu, error 
%d,



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] cgroup: Introduce cgroup_detach_task().

2014-08-25 Thread Dongsheng Yang
On Tue, Aug 26, 2014 at 9:35 AM, Li Zefan  wrote:
> On 2014/8/25 23:00, Dongsheng Yang wrote:
>> On Mon, Aug 25, 2014 at 10:47 PM, Tejun Heo  wrote:
>>> On Mon, Aug 25, 2014 at 10:46:03PM +0800, Dongsheng Yang wrote:
>>>> My point here is that attaching and detaching are a pair of operations.
>>>
>>> There is no detaching from a cgroup.  A task is always attached to a
>>> cgroup whether that's a root or non-root cgroup.
>>
>> Okey, I should not think it as attaching and detaching. Just treat them as
>> a move between root and non-root cgroup.
>>
>> It sounds reasonable to me now.
>>
>
> I from time to time have to explain this to other people.

Ha, we usually want to find a detach function when we saw a function
named as cgroup_attach_task().
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] cgroup: Introduce cgroup_detach_task().

2014-08-25 Thread Dongsheng Yang
On Mon, Aug 25, 2014 at 10:47 PM, Tejun Heo  wrote:
> On Mon, Aug 25, 2014 at 10:46:03PM +0800, Dongsheng Yang wrote:
>> My point here is that attaching and detaching are a pair of operations.
>
> There is no detaching from a cgroup.  A task is always attached to a
> cgroup whether that's a root or non-root cgroup.

Okey, I should not think it as attaching and detaching. Just treat them as
a move between root and non-root cgroup.

It sounds reasonable to me now.

Thanx.
>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cgroup: Enable controllers for default hierarchy by default.

2014-08-25 Thread Dongsheng Yang
Hi tj,

On Mon, Aug 25, 2014 at 10:40 PM, Tejun Heo  wrote:
> Hello, Dongsheng.
>
> On Mon, Aug 25, 2014 at 07:27:32PM +0800, Dongsheng Yang wrote:
>> When we create a cgroup in unified hierarchy, we have to enable
>> controllers in cgrp_dfl_root.subtree_control manually. From
>> my practice, I did not find the benefit we disable controllers
>> in cgrp_dfl_root by default.
>
> Hehe, I actually enjoyed the frankness.  No fudging around, basically
> just "I used it for a bit and it didn't feel useful to me".
>
>> As I am a newbie to cgroup, please correct me if I am wrong.
>
> I don't think being new to cgroup is important here.  Regardless of
> the specific area, you should be able to build and argue the
> rationales for the changes you propose.  This is important because not
> only the changes need to be justified but also the process will help
> you actually think about and analyze the changes that you're
> proposing.  If you can't build strong enough rationales for a given
> change, it probably shouldn't be proposed.
>
> Here, your rationale is nothing more than "I played with it 5 mins and
> it seemed weird to me".  There was no effort in understanding the
> overall design, use cases or implications.  You're just driving by
> shooting random patches expecting other people to do what you should
> have done before submitting the patches.  If you want to change the
> behavior, please first study and think about it and ask specific
> questions when unsure.

Yes, I met something unsure in my studying. And I thought sending a
patch to show my question is an acceptable way. Sorry for that.

> Submitting basically random patches expecting
> others to argue for or against it is one of the worst ways to learn
> about a subsystem.  Please don't ever do this regardless of the
> subsystem in question.

Shame on me and sorry for troubling you. I really have some questions in
my study, but I should not post them via a patch.

I need more reading and thinking.
If finally I can't answer my question by myself or google.
Then I will ask you experts.

Thanx
>
> Nacked-by: Tejun Heo 
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] cgroup: Introduce cgroup_detach_task().

2014-08-25 Thread Dongsheng Yang
Hi tj.

On Mon, Aug 25, 2014 at 10:12 PM, Tejun Heo  wrote:
> On Mon, Aug 25, 2014 at 07:32:10PM +0800, Dongsheng Yang wrote:
>> Currently, the only method to detach a task from a cgroup is moving
>> it to others. It looks not natrual to me.
>
> Ummm... how is this different from moving it to the root cgroup?
> "just because" usually isn't a good enough rationale.
>

Thanx for your reply :).

My point here is that attaching and detaching are a pair of operations.
And the things involved in these operations should be a process
and a cgroup. But currently, when we want to detach a process from
a cgroup (A), we have to attach it to another cgroup (B). I think it is not
easy to understand. People would say: "Why I want to detach a process
from A, I have to care about another cgroup."

So I created this patch here wanting to make these operations more easy to
understand. And make attaching and detaching to be antonymous in action.

I think maybe we can use it in the Cgroup V2 which is in developing. :)

That's all my reason for this patch. But you could think it makes no sense.
It's okey to me.

Thanx

> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cgroup: Introduce cgroup_detach_task().

2014-08-25 Thread Dongsheng Yang

Sorry for the noise, :(.

This patch is not a correct version. Please ignore it and I have sent a 
v2 for it.


Thanx

On 08/25/2014 07:28 PM, Dongsheng Yang wrote:

Currently, the only method to detach a task from a cgroup is moving
it to others. It looks not natrual to me.

Inspired by cgroup_subtree_control_write(), this patch introduce allow
user to at-detach a process to/from a cgroup by echo "+/-pid" to
cgroup.procs. In addition, we keep the old method to allow user
echo "pid" without "+/-" to cgroup.procs as a attaching behavior.

Signed-off-by: Dongsheng Yang 
---
  kernel/cgroup.c | 62 +++--
  1 file changed, 60 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7dc8788..11ef4ac 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2348,6 +2348,43 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
return ret;
  }
  
+/**

+ * cgroup_detach_task - detach a task or a whole threadgroup to a cgroup
+ * @src_cgrp: the cgroup to detach from
+ * @leader: the task or the leader of the threadgroup to be attached
+ * @threadgroup: attach the whole threadgroup?
+ *
+ * Call holding cgroup_mutex and threadgroup_lock of @leader.
+ */
+static int cgroup_detach_task(struct cgroup *src_cgrp __maybe_unused,
+ struct task_struct *leader, bool threadgroup)
+{
+   LIST_HEAD(preloaded_csets);
+   struct task_struct *task;
+   int ret;
+
+   /* look up all src csets */
+   down_read(_set_rwsem);
+   rcu_read_lock();
+   task = leader;
+   do {
+   cgroup_migrate_add_src(task_css_set(task), _dfl_root.cgrp,
+  _csets);
+   if (!threadgroup)
+   break;
+   } while_each_thread(leader, task);
+   rcu_read_unlock();
+   up_read(_set_rwsem);
+
+   /* prepare dst csets and commit */
+   ret = cgroup_migrate_prepare_dst(_dfl_root.cgrp, _csets);
+   if (!ret)
+   ret = cgroup_migrate(_dfl_root.cgrp, leader, threadgroup);
+
+   cgroup_migrate_finish(_csets);
+   return ret;
+}
+
  /*
   * Find the task_struct of the task to attach by vpid and pass it along to the
   * function to attach either it or all tasks in its threadgroup. Will lock
@@ -2361,8 +2398,26 @@ static ssize_t __cgroup_procs_write(struct 
kernfs_open_file *of, char *buf,
struct cgroup *cgrp;
pid_t pid;
int ret;
+   bool attach;
  
-	if (kstrtoint(strstrip(buf), 0, ) || pid < 0)

+   /*
+* Parse input - space separated list of subsystem names prefixed
+* with either + or -.
+*/
+   buf = strstrip(buf);
+   if (*buf == '+') {
+   attach = true;
+   buf++;
+   } else if (*buf == '-') {
+   attach = false;
+   buf++;
+   } else {
+   if (!isdigit(*buf))
+   return -EINVAL;
+   attach = true;
+   }
+
+   if (kstrtoint(buf, 0, ) || pid < 0)
return -EINVAL;
  
  	cgrp = cgroup_kn_lock_live(of->kn);

@@ -2426,7 +2481,10 @@ retry_find_task:
}
}
  
-	ret = cgroup_attach_task(cgrp, tsk, threadgroup);

+   if (attach)
+   ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+   else
+   ret = cgroup_detach_task(cgrp, tsk, threadgroup);
  
  	threadgroup_unlock(tsk);
  


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2] cgroup: Introduce cgroup_detach_task().

2014-08-25 Thread Dongsheng Yang
Currently, the only method to detach a task from a cgroup is moving
it to others. It looks not natrual to me.

Inspired by cgroup_subtree_control_write(), this patch introduce allow
user to at-detach a process to/from a cgroup by echo "+/-pid" to
cgroup.procs. In addition, we keep the old method to allow user
echo "pid" without "+/-" to cgroup.procs as a attaching behavior.

Signed-off-by: Dongsheng Yang 
---

>From v1:
*Sorry for a incorrect comment in v1.

 kernel/cgroup.c | 61 +++--
 1 file changed, 59 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7dc8788..a4bb604 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2348,6 +2348,43 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
return ret;
 }
 
+/**
+ * cgroup_detach_task - detach a task or a whole threadgroup to a cgroup
+ * @src_cgrp: the cgroup to detach from
+ * @leader: the task or the leader of the threadgroup to be attached
+ * @threadgroup: attach the whole threadgroup?
+ *
+ * Call holding cgroup_mutex and threadgroup_lock of @leader.
+ */
+static int cgroup_detach_task(struct cgroup *src_cgrp __maybe_unused,
+ struct task_struct *leader, bool threadgroup)
+{
+   LIST_HEAD(preloaded_csets);
+   struct task_struct *task;
+   int ret;
+
+   /* look up all src csets */
+   down_read(_set_rwsem);
+   rcu_read_lock();
+   task = leader;
+   do {
+   cgroup_migrate_add_src(task_css_set(task), _dfl_root.cgrp,
+  _csets);
+   if (!threadgroup)
+   break;
+   } while_each_thread(leader, task);
+   rcu_read_unlock();
+   up_read(_set_rwsem);
+
+   /* prepare dst csets and commit */
+   ret = cgroup_migrate_prepare_dst(_dfl_root.cgrp, _csets);
+   if (!ret)
+   ret = cgroup_migrate(_dfl_root.cgrp, leader, threadgroup);
+
+   cgroup_migrate_finish(_csets);
+   return ret;
+}
+
 /*
  * Find the task_struct of the task to attach by vpid and pass it along to the
  * function to attach either it or all tasks in its threadgroup. Will lock
@@ -2361,8 +2398,25 @@ static ssize_t __cgroup_procs_write(struct 
kernfs_open_file *of, char *buf,
struct cgroup *cgrp;
pid_t pid;
int ret;
+   bool attach;
 
-   if (kstrtoint(strstrip(buf), 0, ) || pid < 0)
+   /*
+* Parse input - pid prefixed with either + or -.
+*/
+   buf = strstrip(buf);
+   if (*buf == '+') {
+   attach = true;
+   buf++;
+   } else if (*buf == '-') {
+   attach = false;
+   buf++;
+   } else {
+   if (!isdigit(*buf))
+   return -EINVAL;
+   attach = true;
+   }
+
+   if (kstrtoint(buf, 0, ) || pid < 0)
return -EINVAL;
 
cgrp = cgroup_kn_lock_live(of->kn);
@@ -2426,7 +2480,10 @@ retry_find_task:
}
}
 
-   ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+   if (attach)
+   ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+   else
+   ret = cgroup_detach_task(cgrp, tsk, threadgroup);
 
threadgroup_unlock(tsk);
 
-- 
1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cgroup: Introduce cgroup_detach_task().

2014-08-25 Thread Dongsheng Yang
Currently, the only method to detach a task from a cgroup is moving
it to others. It looks not natrual to me.

Inspired by cgroup_subtree_control_write(), this patch introduce allow
user to at-detach a process to/from a cgroup by echo "+/-pid" to
cgroup.procs. In addition, we keep the old method to allow user
echo "pid" without "+/-" to cgroup.procs as a attaching behavior.

Signed-off-by: Dongsheng Yang 
---
 kernel/cgroup.c | 62 +++--
 1 file changed, 60 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7dc8788..11ef4ac 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2348,6 +2348,43 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
return ret;
 }
 
+/**
+ * cgroup_detach_task - detach a task or a whole threadgroup to a cgroup
+ * @src_cgrp: the cgroup to detach from
+ * @leader: the task or the leader of the threadgroup to be attached
+ * @threadgroup: attach the whole threadgroup?
+ *
+ * Call holding cgroup_mutex and threadgroup_lock of @leader.
+ */
+static int cgroup_detach_task(struct cgroup *src_cgrp __maybe_unused,
+ struct task_struct *leader, bool threadgroup)
+{
+   LIST_HEAD(preloaded_csets);
+   struct task_struct *task;
+   int ret;
+
+   /* look up all src csets */
+   down_read(_set_rwsem);
+   rcu_read_lock();
+   task = leader;
+   do {
+   cgroup_migrate_add_src(task_css_set(task), _dfl_root.cgrp,
+  _csets);
+   if (!threadgroup)
+   break;
+   } while_each_thread(leader, task);
+   rcu_read_unlock();
+   up_read(_set_rwsem);
+
+   /* prepare dst csets and commit */
+   ret = cgroup_migrate_prepare_dst(_dfl_root.cgrp, _csets);
+   if (!ret)
+   ret = cgroup_migrate(_dfl_root.cgrp, leader, threadgroup);
+
+   cgroup_migrate_finish(_csets);
+   return ret;
+}
+
 /*
  * Find the task_struct of the task to attach by vpid and pass it along to the
  * function to attach either it or all tasks in its threadgroup. Will lock
@@ -2361,8 +2398,26 @@ static ssize_t __cgroup_procs_write(struct 
kernfs_open_file *of, char *buf,
struct cgroup *cgrp;
pid_t pid;
int ret;
+   bool attach;
 
-   if (kstrtoint(strstrip(buf), 0, ) || pid < 0)
+   /*
+* Parse input - space separated list of subsystem names prefixed
+* with either + or -.
+*/
+   buf = strstrip(buf);
+   if (*buf == '+') {
+   attach = true;
+   buf++;
+   } else if (*buf == '-') {
+   attach = false;
+   buf++;
+   } else {
+   if (!isdigit(*buf))
+   return -EINVAL;
+   attach = true;
+   }
+
+   if (kstrtoint(buf, 0, ) || pid < 0)
return -EINVAL;
 
cgrp = cgroup_kn_lock_live(of->kn);
@@ -2426,7 +2481,10 @@ retry_find_task:
}
}
 
-   ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+   if (attach)
+   ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+   else
+   ret = cgroup_detach_task(cgrp, tsk, threadgroup);
 
threadgroup_unlock(tsk);
 
-- 
1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cgroup: Enable controllers for default hierarchy by default.

2014-08-25 Thread Dongsheng Yang
When we create a cgroup in unified hierarchy, we have to enable
controllers in cgrp_dfl_root.subtree_control manually. From
my practice, I did not find the benefit we disable controllers
in cgrp_dfl_root by default.

As I am a newbie to cgroup, please correct me if I am wrong.

Thanx.

Signed-off-by: Dongsheng Yang 
---
 kernel/cgroup.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7dc8788..70996ee 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1267,12 +1267,9 @@ static int rebind_subsystems(struct cgroup_root 
*dst_root, unsigned int ss_mask)
src_root->cgrp.subtree_control &= ~(1 << ssid);
cgroup_refresh_child_subsys_mask(_root->cgrp);
 
-   /* default hierarchy doesn't enable controllers by default */
dst_root->subsys_mask |= 1 << ssid;
-   if (dst_root != _dfl_root) {
-   dst_root->cgrp.subtree_control |= 1 << ssid;
-   cgroup_refresh_child_subsys_mask(_root->cgrp);
-   }
+   dst_root->cgrp.subtree_control |= 1 << ssid;
+   cgroup_refresh_child_subsys_mask(_root->cgrp);
 
if (ss->bind)
ss->bind(css);
@@ -4990,6 +4987,8 @@ int __init cgroup_init(void)
}
}
 
+   cgrp_dfl_root.cgrp.subtree_control = cgrp_dfl_root.subsys_mask;
+
cgroup_kobj = kobject_create_and_add("cgroup", fs_kobj);
if (!cgroup_kobj)
return -ENOMEM;
-- 
1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cgroup: fix a typo in comment.

2014-08-25 Thread Dongsheng Yang
There is no function named cgroup_enable_task_cg_links().
Instead, the correct function name in this comment should
be cgroup_enabled_task_cg_lists().

Signed-off-by: Dongsheng Yang 
---
 kernel/cgroup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7dc8788..64bbb56 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5161,7 +5161,7 @@ void cgroup_post_fork(struct task_struct *child)
int i;
 
/*
-* This may race against cgroup_enable_task_cg_links().  As that
+* This may race against cgroup_enable_task_cg_lists().  As that
 * function sets use_task_css_set_links before grabbing
 * tasklist_lock and we just went through tasklist_lock to add
 * @child, it's guaranteed that either we see the set
@@ -5176,7 +5176,7 @@ void cgroup_post_fork(struct task_struct *child)
 * when implementing operations which need to migrate all tasks of
 * a cgroup to another.
 *
-* Note that if we lose to cgroup_enable_task_cg_links(), @child
+* Note that if we lose to cgroup_enable_task_cg_lists(), @child
 * will remain in init_css_set.  This is safe because all tasks are
 * in the init_css_set before cg_links is enabled and there's no
 * operation which transfers all tasks out of init_css_set.
-- 
1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cgroup: Enable controllers for default hierarchy by default.

2014-08-25 Thread Dongsheng Yang
When we create a cgroup in unified hierarchy, we have to enable
controllers in cgrp_dfl_root.subtree_control manually. From
my practice, I did not find the benefit we disable controllers
in cgrp_dfl_root by default.

As I am a newbie to cgroup, please correct me if I am wrong.

Thanx.

Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com
---
 kernel/cgroup.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7dc8788..70996ee 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1267,12 +1267,9 @@ static int rebind_subsystems(struct cgroup_root 
*dst_root, unsigned int ss_mask)
src_root-cgrp.subtree_control = ~(1  ssid);
cgroup_refresh_child_subsys_mask(src_root-cgrp);
 
-   /* default hierarchy doesn't enable controllers by default */
dst_root-subsys_mask |= 1  ssid;
-   if (dst_root != cgrp_dfl_root) {
-   dst_root-cgrp.subtree_control |= 1  ssid;
-   cgroup_refresh_child_subsys_mask(dst_root-cgrp);
-   }
+   dst_root-cgrp.subtree_control |= 1  ssid;
+   cgroup_refresh_child_subsys_mask(dst_root-cgrp);
 
if (ss-bind)
ss-bind(css);
@@ -4990,6 +4987,8 @@ int __init cgroup_init(void)
}
}
 
+   cgrp_dfl_root.cgrp.subtree_control = cgrp_dfl_root.subsys_mask;
+
cgroup_kobj = kobject_create_and_add(cgroup, fs_kobj);
if (!cgroup_kobj)
return -ENOMEM;
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cgroup: fix a typo in comment.

2014-08-25 Thread Dongsheng Yang
There is no function named cgroup_enable_task_cg_links().
Instead, the correct function name in this comment should
be cgroup_enabled_task_cg_lists().

Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com
---
 kernel/cgroup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7dc8788..64bbb56 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5161,7 +5161,7 @@ void cgroup_post_fork(struct task_struct *child)
int i;
 
/*
-* This may race against cgroup_enable_task_cg_links().  As that
+* This may race against cgroup_enable_task_cg_lists().  As that
 * function sets use_task_css_set_links before grabbing
 * tasklist_lock and we just went through tasklist_lock to add
 * @child, it's guaranteed that either we see the set
@@ -5176,7 +5176,7 @@ void cgroup_post_fork(struct task_struct *child)
 * when implementing operations which need to migrate all tasks of
 * a cgroup to another.
 *
-* Note that if we lose to cgroup_enable_task_cg_links(), @child
+* Note that if we lose to cgroup_enable_task_cg_lists(), @child
 * will remain in init_css_set.  This is safe because all tasks are
 * in the init_css_set before cg_links is enabled and there's no
 * operation which transfers all tasks out of init_css_set.
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cgroup: Introduce cgroup_detach_task().

2014-08-25 Thread Dongsheng Yang
Currently, the only method to detach a task from a cgroup is moving
it to others. It looks not natrual to me.

Inspired by cgroup_subtree_control_write(), this patch introduce allow
user to at-detach a process to/from a cgroup by echo +/-pid to
cgroup.procs. In addition, we keep the old method to allow user
echo pid without +/- to cgroup.procs as a attaching behavior.

Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com
---
 kernel/cgroup.c | 62 +++--
 1 file changed, 60 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7dc8788..11ef4ac 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2348,6 +2348,43 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
return ret;
 }
 
+/**
+ * cgroup_detach_task - detach a task or a whole threadgroup to a cgroup
+ * @src_cgrp: the cgroup to detach from
+ * @leader: the task or the leader of the threadgroup to be attached
+ * @threadgroup: attach the whole threadgroup?
+ *
+ * Call holding cgroup_mutex and threadgroup_lock of @leader.
+ */
+static int cgroup_detach_task(struct cgroup *src_cgrp __maybe_unused,
+ struct task_struct *leader, bool threadgroup)
+{
+   LIST_HEAD(preloaded_csets);
+   struct task_struct *task;
+   int ret;
+
+   /* look up all src csets */
+   down_read(css_set_rwsem);
+   rcu_read_lock();
+   task = leader;
+   do {
+   cgroup_migrate_add_src(task_css_set(task), cgrp_dfl_root.cgrp,
+  preloaded_csets);
+   if (!threadgroup)
+   break;
+   } while_each_thread(leader, task);
+   rcu_read_unlock();
+   up_read(css_set_rwsem);
+
+   /* prepare dst csets and commit */
+   ret = cgroup_migrate_prepare_dst(cgrp_dfl_root.cgrp, preloaded_csets);
+   if (!ret)
+   ret = cgroup_migrate(cgrp_dfl_root.cgrp, leader, threadgroup);
+
+   cgroup_migrate_finish(preloaded_csets);
+   return ret;
+}
+
 /*
  * Find the task_struct of the task to attach by vpid and pass it along to the
  * function to attach either it or all tasks in its threadgroup. Will lock
@@ -2361,8 +2398,26 @@ static ssize_t __cgroup_procs_write(struct 
kernfs_open_file *of, char *buf,
struct cgroup *cgrp;
pid_t pid;
int ret;
+   bool attach;
 
-   if (kstrtoint(strstrip(buf), 0, pid) || pid  0)
+   /*
+* Parse input - space separated list of subsystem names prefixed
+* with either + or -.
+*/
+   buf = strstrip(buf);
+   if (*buf == '+') {
+   attach = true;
+   buf++;
+   } else if (*buf == '-') {
+   attach = false;
+   buf++;
+   } else {
+   if (!isdigit(*buf))
+   return -EINVAL;
+   attach = true;
+   }
+
+   if (kstrtoint(buf, 0, pid) || pid  0)
return -EINVAL;
 
cgrp = cgroup_kn_lock_live(of-kn);
@@ -2426,7 +2481,10 @@ retry_find_task:
}
}
 
-   ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+   if (attach)
+   ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+   else
+   ret = cgroup_detach_task(cgrp, tsk, threadgroup);
 
threadgroup_unlock(tsk);
 
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2] cgroup: Introduce cgroup_detach_task().

2014-08-25 Thread Dongsheng Yang
Currently, the only method to detach a task from a cgroup is moving
it to others. It looks not natrual to me.

Inspired by cgroup_subtree_control_write(), this patch introduce allow
user to at-detach a process to/from a cgroup by echo +/-pid to
cgroup.procs. In addition, we keep the old method to allow user
echo pid without +/- to cgroup.procs as a attaching behavior.

Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com
---

From v1:
*Sorry for a incorrect comment in v1.

 kernel/cgroup.c | 61 +++--
 1 file changed, 59 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7dc8788..a4bb604 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2348,6 +2348,43 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
return ret;
 }
 
+/**
+ * cgroup_detach_task - detach a task or a whole threadgroup to a cgroup
+ * @src_cgrp: the cgroup to detach from
+ * @leader: the task or the leader of the threadgroup to be attached
+ * @threadgroup: attach the whole threadgroup?
+ *
+ * Call holding cgroup_mutex and threadgroup_lock of @leader.
+ */
+static int cgroup_detach_task(struct cgroup *src_cgrp __maybe_unused,
+ struct task_struct *leader, bool threadgroup)
+{
+   LIST_HEAD(preloaded_csets);
+   struct task_struct *task;
+   int ret;
+
+   /* look up all src csets */
+   down_read(css_set_rwsem);
+   rcu_read_lock();
+   task = leader;
+   do {
+   cgroup_migrate_add_src(task_css_set(task), cgrp_dfl_root.cgrp,
+  preloaded_csets);
+   if (!threadgroup)
+   break;
+   } while_each_thread(leader, task);
+   rcu_read_unlock();
+   up_read(css_set_rwsem);
+
+   /* prepare dst csets and commit */
+   ret = cgroup_migrate_prepare_dst(cgrp_dfl_root.cgrp, preloaded_csets);
+   if (!ret)
+   ret = cgroup_migrate(cgrp_dfl_root.cgrp, leader, threadgroup);
+
+   cgroup_migrate_finish(preloaded_csets);
+   return ret;
+}
+
 /*
  * Find the task_struct of the task to attach by vpid and pass it along to the
  * function to attach either it or all tasks in its threadgroup. Will lock
@@ -2361,8 +2398,25 @@ static ssize_t __cgroup_procs_write(struct 
kernfs_open_file *of, char *buf,
struct cgroup *cgrp;
pid_t pid;
int ret;
+   bool attach;
 
-   if (kstrtoint(strstrip(buf), 0, pid) || pid  0)
+   /*
+* Parse input - pid prefixed with either + or -.
+*/
+   buf = strstrip(buf);
+   if (*buf == '+') {
+   attach = true;
+   buf++;
+   } else if (*buf == '-') {
+   attach = false;
+   buf++;
+   } else {
+   if (!isdigit(*buf))
+   return -EINVAL;
+   attach = true;
+   }
+
+   if (kstrtoint(buf, 0, pid) || pid  0)
return -EINVAL;
 
cgrp = cgroup_kn_lock_live(of-kn);
@@ -2426,7 +2480,10 @@ retry_find_task:
}
}
 
-   ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+   if (attach)
+   ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+   else
+   ret = cgroup_detach_task(cgrp, tsk, threadgroup);
 
threadgroup_unlock(tsk);
 
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   >