Re: [PATCH] bcache: dont reset bio opf in bch_data_insert_start
Hi Christoph, 在 2021/1/28 星期四 下午 5:10, Dongsheng Yang 写道: Hi Christop: 在 2021/1/28 星期四 上午 1:37, Christoph Hellwig 写道: But the old code is also completely broken. We can't just OR in the op, as that implicitly assumes the old op was 0 (REQ_OP_READ). Yes, indeed, there is an assume that the op is just possible to be 0 (REQ_OP_READ) or 1 (REQ_OP_WRITE). REQ_OP_WRITE is from cached_dev_submit_bio() which would be submitted by upper user. REQ_OP_READ is from bcache itself, such as cached_dev_read_done() (when we found cache miss, we will read data from backing and then we want to insert it into cache device. then there is a read bio with data reach here, we need to set the bio_op to REQ_OP_WRITE, and send this bio to cache device). Please fix this to explicitly set the exact op and flags that you want instead of this fragile magic.blk_rq_map_kern This commit only want to fix the logic bug introduced in ad0d9e76a412 ("bcache: use bio op accessors"), that's more likely a partial revert. I agree that we can make it more clearly and explicitly. But I found there is no accessor to set op only, besides, the bio_set_op_attrs() was marked as obsolete. There are some others doing similar things as below: blk_rq_map_kern(): bio->bi_opf &= ~REQ_OP_MASK; bio->bi_opf |= req_op(rq); So what about below: diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index c7cadaafa947..bacc7366002f 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -244,7 +244,14 @@ static void bch_data_insert_start(struct closure *cl) trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - bio_set_op_attrs(n, REQ_OP_WRITE, 0); + /* + * n here would be REQ_OP_READ, if + * we are inserting data read from + * backing device in cache miss or + * inserting data in movinggc. + */ + n->bi_opf &= ~REQ_OP_MASK; + n->bi_opf |= REQ_OP_WRITE; bch_submit_bbio(n, op->c, k, 0); } while (n != bio); Another solution is introducing an accessor to set op only, something like bio_set_op(). Then we should keep the bcache patch as what it was to fix the bug. And send another patch to introduce bio_set_op(): diff --git a/block/blk-map.c b/block/blk-map.c index 6e804892d5ec..83bc33a59fa5 100644 --- a/block/blk-map.c +++ b/block/blk-map.c @@ -587,9 +587,7 @@ static int __blk_rq_map_user_iov(struct request *rq, if (IS_ERR(bio)) return PTR_ERR(bio); - bio->bi_opf &= ~REQ_OP_MASK; - bio->bi_opf |= req_op(rq); - + bio_set_op(bio, req_op(rq)); orig_bio = bio; /* diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index eb734f7ddaac..d8839300805e 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -244,7 +244,13 @@ static void bch_data_insert_start(struct closure *cl) trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - n->bi_opf |= REQ_OP_WRITE; + /* +* n here would be REQ_OP_READ, if +* we are inserting data read from +* backing device in cache miss or +* inserting data in movinggc. +*/ + bio_set_op(n, REQ_OP_WRITE); bch_submit_bbio(n, op->c, k, 0); } while (n != bio); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index b3fc5d3dd8ea..2affd3269bdc 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -439,6 +439,12 @@ static inline void bio_set_op_attrs(struct bio *bio, unsigned op, bio->bi_opf = op | op_flags; } +static inline void bio_set_op(struct bio *bio, unsigned op) +{ + bio->bi_opf &= ~REQ_OP_MASK; + bio->bi_opf |= op; +} + static inline bool op_is_write(unsigned int op) { return (op & 1); Thanx Yang
Re: [PATCH] bcache: dont reset bio opf in bch_data_insert_start
Hi Christop: 在 2021/1/28 星期四 上午 1:37, Christoph Hellwig 写道: But the old code is also completely broken. We can't just OR in the op, as that implicitly assumes the old op was 0 (REQ_OP_READ). Yes, indeed, there is an assume that the op is just possible to be 0 (REQ_OP_READ) or 1 (REQ_OP_WRITE). REQ_OP_WRITE is from cached_dev_submit_bio() which would be submitted by upper user. REQ_OP_READ is from bcache itself, such as cached_dev_read_done() (when we found cache miss, we will read data from backing and then we want to insert it into cache device. then there is a read bio with data reach here, we need to set the bio_op to REQ_OP_WRITE, and send this bio to cache device). Please fix this to explicitly set the exact op and flags that you want instead of this fragile magic.blk_rq_map_kern This commit only want to fix the logic bug introduced in ad0d9e76a412 ("bcache: use bio op accessors"), that's more likely a partial revert. I agree that we can make it more clearly and explicitly. But I found there is no accessor to set op only, besides, the bio_set_op_attrs() was marked as obsolete. There are some others doing similar things as below: blk_rq_map_kern(): bio->bi_opf &= ~REQ_OP_MASK; bio->bi_opf |= req_op(rq); So what about below: diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index c7cadaafa947..bacc7366002f 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -244,7 +244,14 @@ static void bch_data_insert_start(struct closure *cl) trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - bio_set_op_attrs(n, REQ_OP_WRITE, 0); + /* + * n here would be REQ_OP_READ, if + * we are inserting data read from + * backing device in cache miss or + * inserting data in movinggc. + */ + n->bi_opf &= ~REQ_OP_MASK; + n->bi_opf |= REQ_OP_WRITE; bch_submit_bbio(n, op->c, k, 0); } while (n != bio); Thanx Yang
Re: [PATCH] bcache: dont reset bio opf in bch_data_insert_start
在 2021/1/26 星期二 下午 12:34, Coly Li 写道: On 1/26/21 12:32 PM, Dongsheng Yang wrote: 在 2021/1/25 星期一 下午 12:53, Coly Li 写道: On 1/25/21 12:29 PM, Dongsheng Yang wrote: commit ad0d9e76(bcache: use bio op accessors) makes the bi_opf modified by bio_set_op_attrs(). But there is a logical problem in this commit: trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - n->bi_rw |= REQ_WRITE; + bio_set_op_attrs(n, REQ_OP_WRITE, 0); bch_submit_bbio(n, op->c, k, 0); } while (n != bio); The old code add REQ_WRITE into bio n and keep other flags; the new code set REQ_OP_WRITE to bi_opf, but reset all other flags. This problem is discoverd in our performance testing: (1) start a fio with 1M x 128depth for read in /dev/nvme0n1p1 (2) start a fio with 1M x 128depth for write in /dev/escache0 (cache device is /dev/nvme0n1p2) We found the BW of reading is 2000+M/s, but the BW of writing is 0-100M/s. After some debugging, we found the problem is io submit in writting is very slow. bch_data_insert_start() insert a bio to /dev/nvme0n1p1, but as cached_dev submit stack bio will be added into current->bio_list, and return.Then __submit_bio_noacct() will submit the new bio in bio_list into /dev/nvme0n1p1. This operation would be slow in blk_mq_submit_bio() -> rq_qos_throttle(q, bio); The rq_qos_throttle() will call wbt_should_throttle(), static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio) { switch (bio_op(bio)) { case REQ_OP_WRITE: /* * Don't throttle WRITE_ODIRECT */ if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE)) return false; ... ... } As the bio_set_op_attrs() reset the (REQ_SYNC | REQ_IDLE), so this write bio will be considered as non-direct write. After this fix, bio to nvme will flaged as (REQ_SYNC | REQ_IDLE), then fio for writing will get about 1000M/s bandwidth. Fixes: ad0d9e76a412470800c04fc4b56fc86c02d6 It should be, Fixes: ad0d9e76a412 ("bcache: use bio op accessors") Signed-off-by: Dongsheng Yang Please CC the fixed patch author Mike Christie. Hi Coly, Should I send a V2 for commit message update? Or you can help to fix it when you take it from maillist? Yes, please fix it in v2 version. And let's wait for response from Mike, maybe he has better suggestion to fix. okey,actually, Mike is in my cc list of first mail (but not note in commit message), so he can receive my patch. But anyway, I will send a v2 Thanks. Coly Li
[PATCH V2] bcache: dont reset bio opf in bch_data_insert_start
commit ad0d9e76(bcache: use bio op accessors) makes the bi_opf modified by bio_set_op_attrs(). But there is a logical problem in this commit: trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - n->bi_rw |= REQ_WRITE; + bio_set_op_attrs(n, REQ_OP_WRITE, 0); bch_submit_bbio(n, op->c, k, 0); } while (n != bio); The old code add REQ_WRITE into bio n and keep other flags; the new code set REQ_OP_WRITE to bi_opf, but reset all other flags. This problem is discoverd in our performance testing: (1) start a fio with 1M x 128depth for read in /dev/nvme0n1p1 (2) start a fio with 1M x 128depth for write in /dev/escache0 (cache device is /dev/nvme0n1p2) We found the BW of reading is 2000+M/s, but the BW of writing is 0-100M/s. After some debugging, we found the problem is io submit in writting is very slow. bch_data_insert_start() insert a bio to /dev/nvme0n1p1, but as cached_dev submit stack bio will be added into current->bio_list, and return.Then __submit_bio_noacct() will submit the new bio in bio_list into /dev/nvme0n1p1. This operation would be slow in blk_mq_submit_bio() -> rq_qos_throttle(q, bio); The rq_qos_throttle() will call wbt_should_throttle(), static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio) { switch (bio_op(bio)) { case REQ_OP_WRITE: /* * Don't throttle WRITE_ODIRECT */ if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE)) return false; ... ... } As the bio_set_op_attrs() reset the (REQ_SYNC | REQ_IDLE), so this write bio will be considered as non-direct write. After this fix, bio to nvme will flaged as (REQ_SYNC | REQ_IDLE), then fio for writing will get about 1000M/s bandwidth. Fixes: ad0d9e76a412 ("bcache: use bio op accessors") Close: EAS-60259 CC: Mike Christie Signed-off-by: Dongsheng Yang --- drivers/md/bcache/request.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index c7cadaafa947..eb734f7ddaac 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -244,7 +244,7 @@ static void bch_data_insert_start(struct closure *cl) trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - bio_set_op_attrs(n, REQ_OP_WRITE, 0); + n->bi_opf |= REQ_OP_WRITE; bch_submit_bbio(n, op->c, k, 0); } while (n != bio); -- 2.25.1
[PATCH V3] bcache: dont reset bio opf in bch_data_insert_start
commit ad0d9e76(bcache: use bio op accessors) makes the bi_opf modified by bio_set_op_attrs(). But there is a logical problem in this commit: trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - n->bi_rw |= REQ_WRITE; + bio_set_op_attrs(n, REQ_OP_WRITE, 0); bch_submit_bbio(n, op->c, k, 0); } while (n != bio); The old code add REQ_WRITE into bio n and keep other flags; the new code set REQ_OP_WRITE to bi_opf, but reset all other flags. This problem is discoverd in our performance testing: (1) start a fio with 1M x 128depth for read in /dev/nvme0n1p1 (2) start a fio with 1M x 128depth for write in /dev/escache0 (cache device is /dev/nvme0n1p2) We found the BW of reading is 2000+M/s, but the BW of writing is 0-100M/s. After some debugging, we found the problem is io submit in writting is very slow. bch_data_insert_start() insert a bio to /dev/nvme0n1p1, but as cached_dev submit stack bio will be added into current->bio_list, and return.Then __submit_bio_noacct() will submit the new bio in bio_list into /dev/nvme0n1p1. This operation would be slow in blk_mq_submit_bio() -> rq_qos_throttle(q, bio); The rq_qos_throttle() will call wbt_should_throttle(), static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio) { switch (bio_op(bio)) { case REQ_OP_WRITE: /* * Don't throttle WRITE_ODIRECT */ if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE)) return false; ... ... } As the bio_set_op_attrs() reset the (REQ_SYNC | REQ_IDLE), so this write bio will be considered as non-direct write. After this fix, bio to nvme will flaged as (REQ_SYNC | REQ_IDLE), then fio for writing will get about 1000M/s bandwidth. Fixes: ad0d9e76a412 ("bcache: use bio op accessors") CC: Mike Christie Signed-off-by: Dongsheng Yang --- V3-V2: remove an unused close-line in commit message. drivers/md/bcache/request.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index c7cadaafa947..eb734f7ddaac 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -244,7 +244,7 @@ static void bch_data_insert_start(struct closure *cl) trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - bio_set_op_attrs(n, REQ_OP_WRITE, 0); + n->bi_opf |= REQ_OP_WRITE; bch_submit_bbio(n, op->c, k, 0); } while (n != bio); -- 2.25.1
Re: [PATCH] bcache: dont reset bio opf in bch_data_insert_start
在 2021/1/25 星期一 下午 12:53, Coly Li 写道: On 1/25/21 12:29 PM, Dongsheng Yang wrote: commit ad0d9e76(bcache: use bio op accessors) makes the bi_opf modified by bio_set_op_attrs(). But there is a logical problem in this commit: trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - n->bi_rw |= REQ_WRITE; + bio_set_op_attrs(n, REQ_OP_WRITE, 0); bch_submit_bbio(n, op->c, k, 0); } while (n != bio); The old code add REQ_WRITE into bio n and keep other flags; the new code set REQ_OP_WRITE to bi_opf, but reset all other flags. This problem is discoverd in our performance testing: (1) start a fio with 1M x 128depth for read in /dev/nvme0n1p1 (2) start a fio with 1M x 128depth for write in /dev/escache0 (cache device is /dev/nvme0n1p2) We found the BW of reading is 2000+M/s, but the BW of writing is 0-100M/s. After some debugging, we found the problem is io submit in writting is very slow. bch_data_insert_start() insert a bio to /dev/nvme0n1p1, but as cached_dev submit stack bio will be added into current->bio_list, and return.Then __submit_bio_noacct() will submit the new bio in bio_list into /dev/nvme0n1p1. This operation would be slow in blk_mq_submit_bio() -> rq_qos_throttle(q, bio); The rq_qos_throttle() will call wbt_should_throttle(), static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio) { switch (bio_op(bio)) { case REQ_OP_WRITE: /* * Don't throttle WRITE_ODIRECT */ if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE)) return false; ... ... } As the bio_set_op_attrs() reset the (REQ_SYNC | REQ_IDLE), so this write bio will be considered as non-direct write. After this fix, bio to nvme will flaged as (REQ_SYNC | REQ_IDLE), then fio for writing will get about 1000M/s bandwidth. Fixes: ad0d9e76a412470800c04fc4b56fc86c02d6 It should be, Fixes: ad0d9e76a412 ("bcache: use bio op accessors") Signed-off-by: Dongsheng Yang Please CC the fixed patch author Mike Christie. Hi Coly, Should I send a V2 for commit message update? Or you can help to fix it when you take it from maillist? Thanx Yang --- drivers/md/bcache/request.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index c7cadaafa947..eb734f7ddaac 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -244,7 +244,7 @@ static void bch_data_insert_start(struct closure *cl) trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - bio_set_op_attrs(n, REQ_OP_WRITE, 0); + n->bi_opf |= REQ_OP_WRITE; bch_submit_bbio(n, op->c, k, 0); } while (n != bio); The fix is OK to me, I'd like to see opinion from Mike Christie too. Thanks for the catch. Coly Li
[PATCH] bcache: dont reset bio opf in bch_data_insert_start
commit ad0d9e76(bcache: use bio op accessors) makes the bi_opf modified by bio_set_op_attrs(). But there is a logical problem in this commit: trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - n->bi_rw |= REQ_WRITE; + bio_set_op_attrs(n, REQ_OP_WRITE, 0); bch_submit_bbio(n, op->c, k, 0); } while (n != bio); The old code add REQ_WRITE into bio n and keep other flags; the new code set REQ_OP_WRITE to bi_opf, but reset all other flags. This problem is discoverd in our performance testing: (1) start a fio with 1M x 128depth for read in /dev/nvme0n1p1 (2) start a fio with 1M x 128depth for write in /dev/escache0 (cache device is /dev/nvme0n1p2) We found the BW of reading is 2000+M/s, but the BW of writing is 0-100M/s. After some debugging, we found the problem is io submit in writting is very slow. bch_data_insert_start() insert a bio to /dev/nvme0n1p1, but as cached_dev submit stack bio will be added into current->bio_list, and return.Then __submit_bio_noacct() will submit the new bio in bio_list into /dev/nvme0n1p1. This operation would be slow in blk_mq_submit_bio() -> rq_qos_throttle(q, bio); The rq_qos_throttle() will call wbt_should_throttle(), static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio) { switch (bio_op(bio)) { case REQ_OP_WRITE: /* * Don't throttle WRITE_ODIRECT */ if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE)) return false; ... ... } As the bio_set_op_attrs() reset the (REQ_SYNC | REQ_IDLE), so this write bio will be considered as non-direct write. After this fix, bio to nvme will flaged as (REQ_SYNC | REQ_IDLE), then fio for writing will get about 1000M/s bandwidth. Fixes: ad0d9e76a412470800c04fc4b56fc86c02d6 Signed-off-by: Dongsheng Yang --- drivers/md/bcache/request.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index c7cadaafa947..eb734f7ddaac 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -244,7 +244,7 @@ static void bch_data_insert_start(struct closure *cl) trace_bcache_cache_insert(k); bch_keylist_push(>insert_keys); - bio_set_op_attrs(n, REQ_OP_WRITE, 0); + n->bi_opf |= REQ_OP_WRITE; bch_submit_bbio(n, op->c, k, 0); } while (n != bio); -- 2.25.1
Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
在 2020/12/9 星期三 下午 12:48, Dongdong Tao 写道: Hi Dongsheng, I'm working on it, next step I'm gathering some testing data and upload (very sorry for the delay...) Thanks for the comment. One of the main concerns to alleviate this issue with the writeback process is that we need to minimize the impact on the client IO performance. writeback_percent by default is 10, start writeback when dirty buckets reached 10 percent might be a bit too aggressive, as the writeback_cutoff_sync is 70 percent. So i chose to start the writeback when dirty buckets reached 50 percent so that this patch will only take effect after dirty buckets percent is above that Agree with that's too aggressive to reuse writeback_percent, and that's less flexable. Okey, let's wait for your testing result. Thanx Thanks, Dongdong On Wed, Dec 9, 2020 at 10:27 AM Dongsheng Yang wrote: 在 2020/11/3 星期二 下午 8:42, Dongdong Tao 写道: From: dongdong tao Current way to calculate the writeback rate only considered the dirty sectors, this usually works fine when the fragmentation is not high, but it will give us unreasonable small rate when we are under a situation that very few dirty sectors consumed a lot dirty buckets. In some case, the dirty bucekts can reached to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven reached the writeback_percent, the writeback rate will still be the minimum value (4k), thus it will cause all the writes to be stucked in a non-writeback mode because of the slow writeback. This patch will try to accelerate the writeback rate when the fragmentation is high. It calculate the propotional_scaled value based on below: (dirty_sectors / writeback_rate_p_term_inverse) * fragment As we can see, the higher fragmentation will result a larger proportional_scaled value, thus cause a larger writeback rate. The fragment value is calculated based on below: (dirty_buckets * bucket_size) / dirty_sectors If you think about it, the value of fragment will be always inside [1, bucket_size]. This patch only considers the fragmentation when the number of dirty_buckets reached to a dirty threshold(configurable by writeback_fragment_percent, default is 50), so bcache will remain the original behaviour before the dirty buckets reached the threshold. Signed-off-by: dongdong tao --- drivers/md/bcache/bcache.h| 1 + drivers/md/bcache/sysfs.c | 6 ++ drivers/md/bcache/writeback.c | 21 + 3 files changed, 28 insertions(+) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 1d57f48307e6..87632f7032b6 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -374,6 +374,7 @@ struct cached_dev { unsigned intwriteback_metadata:1; unsigned intwriteback_running:1; unsigned char writeback_percent; + unsigned char writeback_fragment_percent; unsigned intwriteback_delay; uint64_twriteback_rate_target; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 554e3afc9b68..69499113aef8 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed); rw_attribute(writeback_metadata); rw_attribute(writeback_running); rw_attribute(writeback_percent); +rw_attribute(writeback_fragment_percent); Hi Dongdong and Coly, What is the status about this patch? In my opinion, it is a problem we need to solve, but can we just reuse the parameter of writeback_percent, rather than introduce a new writeback_fragment_percent? That means the semantic of writeback_percent will act on dirty data percent and dirty bucket percent. When we found there are dirty buckets more than (c->nbuckets * writeback_percent), start the writeback. Thanx Yang rw_attribute(writeback_delay); rw_attribute(writeback_rate); @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev) var_printf(writeback_running, "%i"); var_print(writeback_delay); var_print(writeback_percent); + var_print(writeback_fragment_percent); sysfs_hprint(writeback_rate, wb ? atomic_long_read(>writeback_rate.rate) << 9 : 0); sysfs_printf(io_errors, "%i", atomic_read(>io_errors)); @@ -308,6 +310,9 @@ STORE(__cached_dev) sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, 0, bch_cutoff_writeback); + sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent, + 0, bch_cutoff_writeback_sync); + if (attr == _writeback_rate) { ssize_t ret; long int v = atomic_long_read(>writeback_rate.rate); @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = { _writeback_running, _writeback_delay, _writeback_percent, + _writeb
Re: [PATCH] bcache: consider the fragmentation when update the writeback rate
在 2020/11/3 星期二 下午 8:42, Dongdong Tao 写道: From: dongdong tao Current way to calculate the writeback rate only considered the dirty sectors, this usually works fine when the fragmentation is not high, but it will give us unreasonable small rate when we are under a situation that very few dirty sectors consumed a lot dirty buckets. In some case, the dirty bucekts can reached to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) noteven reached the writeback_percent, the writeback rate will still be the minimum value (4k), thus it will cause all the writes to be stucked in a non-writeback mode because of the slow writeback. This patch will try to accelerate the writeback rate when the fragmentation is high. It calculate the propotional_scaled value based on below: (dirty_sectors / writeback_rate_p_term_inverse) * fragment As we can see, the higher fragmentation will result a larger proportional_scaled value, thus cause a larger writeback rate. The fragment value is calculated based on below: (dirty_buckets * bucket_size) / dirty_sectors If you think about it, the value of fragment will be always inside [1, bucket_size]. This patch only considers the fragmentation when the number of dirty_buckets reached to a dirty threshold(configurable by writeback_fragment_percent, default is 50), so bcache will remain the original behaviour before the dirty buckets reached the threshold. Signed-off-by: dongdong tao --- drivers/md/bcache/bcache.h| 1 + drivers/md/bcache/sysfs.c | 6 ++ drivers/md/bcache/writeback.c | 21 + 3 files changed, 28 insertions(+) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 1d57f48307e6..87632f7032b6 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -374,6 +374,7 @@ struct cached_dev { unsigned intwriteback_metadata:1; unsigned intwriteback_running:1; unsigned char writeback_percent; + unsigned char writeback_fragment_percent; unsigned intwriteback_delay; uint64_t writeback_rate_target; diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 554e3afc9b68..69499113aef8 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -115,6 +115,7 @@ rw_attribute(stop_when_cache_set_failed); rw_attribute(writeback_metadata); rw_attribute(writeback_running); rw_attribute(writeback_percent); +rw_attribute(writeback_fragment_percent); Hi Dongdong and Coly, What is the status about this patch? In my opinion, it is a problem we need to solve, but can we just reuse the parameter of writeback_percent, rather than introduce a new writeback_fragment_percent? That means the semantic of writeback_percent will act on dirty data percent and dirty bucket percent. When we found there are dirty buckets more than (c->nbuckets * writeback_percent), start the writeback. Thanx Yang rw_attribute(writeback_delay); rw_attribute(writeback_rate); @@ -197,6 +198,7 @@ SHOW(__bch_cached_dev) var_printf(writeback_running, "%i"); var_print(writeback_delay); var_print(writeback_percent); + var_print(writeback_fragment_percent); sysfs_hprint(writeback_rate, wb ? atomic_long_read(>writeback_rate.rate) << 9 : 0); sysfs_printf(io_errors, "%i", atomic_read(>io_errors)); @@ -308,6 +310,9 @@ STORE(__cached_dev) sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, 0, bch_cutoff_writeback); + sysfs_strtoul_clamp(writeback_fragment_percent, dc->writeback_fragment_percent, + 0, bch_cutoff_writeback_sync); + if (attr == _writeback_rate) { ssize_t ret; long int v = atomic_long_read(>writeback_rate.rate); @@ -498,6 +503,7 @@ static struct attribute *bch_cached_dev_files[] = { _writeback_running, _writeback_delay, _writeback_percent, + _writeback_fragment_percent, _writeback_rate, _writeback_rate_update_seconds, _writeback_rate_i_term_inverse, diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 3c74996978da..34babc89fdf3 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -88,6 +88,26 @@ static void __update_writeback_rate(struct cached_dev *dc) int64_t integral_scaled; uint32_t new_rate; + /* +* We need to consider the number of dirty buckets as well +* when calculating the proportional_scaled, Otherwise we might +* have an unreasonable small writeback rate at a highly fragmented situation +* when very few dirty sectors consumed a lot dirty buckets, the +* worst case is when dirty_data reached writeback_percent and +* dirty buckets reached to cutoff_writeback_sync, but the rate +* still will be at
Re: [PATCH] rbd: fix spelling mistake: "reregisteration" -> "re-registration"
Hi Colin, On 03/19/2018 09:33 PM, Colin King wrote: From: Colin Ian KingTrivial fix to spelling mistake in rdb_warn message text Signed-off-by: Colin Ian King --- drivers/block/rbd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 1e03b04819c8..7b97240ff15e 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3887,7 +3887,7 @@ static void rbd_reregister_watch(struct work_struct *work) ret = rbd_dev_refresh(rbd_dev); if (ret) - rbd_warn(rbd_dev, "reregisteration refresh failed: %d", ret); + rbd_warn(rbd_dev, "re-registration refresh failed: %d", ret); Hmmm, I am not sure is that a spelling mistake, because the function name is rbd_reregister_watch(). Thanx Yang } /*
Re: [PATCH] rbd: fix spelling mistake: "reregisteration" -> "re-registration"
Hi Colin, On 03/19/2018 09:33 PM, Colin King wrote: From: Colin Ian King Trivial fix to spelling mistake in rdb_warn message text Signed-off-by: Colin Ian King --- drivers/block/rbd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 1e03b04819c8..7b97240ff15e 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3887,7 +3887,7 @@ static void rbd_reregister_watch(struct work_struct *work) ret = rbd_dev_refresh(rbd_dev); if (ret) - rbd_warn(rbd_dev, "reregisteration refresh failed: %d", ret); + rbd_warn(rbd_dev, "re-registration refresh failed: %d", ret); Hmmm, I am not sure is that a spelling mistake, because the function name is rbd_reregister_watch(). Thanx Yang } /*
[tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage
Commit-ID: d740037fac7052e49450f6fa1454f1144a103b55 Gitweb: http://git.kernel.org/tip/d740037fac7052e49450f6fa1454f1144a103b55 Author: Dongsheng Yang <yangds.f...@cn.fujitsu.com> AuthorDate: Tue, 22 Mar 2016 16:37:08 +0800 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Thu, 31 Mar 2016 10:48:54 +0200 sched/cpuacct: Split usage accounting into user_usage and sys_usage Sometimes, cpuacct.usage is not detailed enough to see how much CPU usage a group had. We want to know how much time it used in user mode and how much in kernel mode. This patch introduces more files to give this information: # ls /sys/fs/cgroup/cpuacct/cpuacct.usage* /sys/fs/cgroup/cpuacct/cpuacct.usage /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu /sys/fs/cgroup/cpuacct/cpuacct.usage_user /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user /sys/fs/cgroup/cpuacct/cpuacct.usage_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys ... while keeping the ABI with the existing counter. Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com> [ Ported to newer kernels. ] Signed-off-by: Zhao Lei <zhao...@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> Acked-by: Tejun Heo <t...@kernel.org> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Mike Galbraith <efa...@gmx.de> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Tejun Heo <hte...@gmail.com> Cc: Thomas Gleixner <t...@linutronix.de> Link: http://lkml.kernel.org/r/aa171da036b520b51c79549e9b3215d29473f19d.1458635566.git.zhao...@cn.fujitsu.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- kernel/sched/cpuacct.c | 140 +++-- 1 file changed, 113 insertions(+), 27 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index e39bd4f..df947e0 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -25,11 +25,22 @@ enum cpuacct_stat_index { CPUACCT_STAT_NSTATS, }; +enum cpuacct_usage_index { + CPUACCT_USAGE_USER, /* ... user mode */ + CPUACCT_USAGE_SYSTEM, /* ... kernel mode */ + + CPUACCT_USAGE_NRUSAGE, +}; + +struct cpuacct_usage { + u64 usages[CPUACCT_USAGE_NRUSAGE]; +}; + /* track cpu usage of a group of tasks and its child groups */ struct cpuacct { struct cgroup_subsys_state css; /* cpuusage holds pointer to a u64-type object on every cpu */ - u64 __percpu *cpuusage; + struct cpuacct_usage __percpu *cpuusage; struct kernel_cpustat __percpu *cpustat; }; @@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca) return css_ca(ca->css.parent); } -static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage); +static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage); static struct cpuacct root_cpuacct = { .cpustat= _cpustat, .cpuusage = _cpuacct_cpuusage, @@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css) if (!ca) goto out; - ca->cpuusage = alloc_percpu(u64); + ca->cpuusage = alloc_percpu(struct cpuacct_usage); if (!ca->cpuusage) goto out_free_ca; @@ -96,20 +107,37 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css) kfree(ca); } -static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu) +static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, +enum cpuacct_usage_index index) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); u64 data; + /* +* We allow index == CPUACCT_USAGE_NRUSAGE here to read +* the sum of suages. +*/ + BUG_ON(index > CPUACCT_USAGE_NRUSAGE); + #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit read safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - data = *cpuusage; +#endif + + if (index == CPUACCT_USAGE_NRUSAGE) { + int i = 0; + + data = 0; + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + data += cpuusage->usages[i]; + } else { + data = cpuusage->usages[index]; + } + +#ifndef CONFIG_64BIT raw_spin_unlock_irq(_rq(cpu)->lock); -#else - data = *cpuusage; #endif return data; @@ -117,69 +145,103 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu) static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + int i; #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit write safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->loc
[tip:sched/core] sched/cpuacct: Split usage accounting into user_usage and sys_usage
Commit-ID: d740037fac7052e49450f6fa1454f1144a103b55 Gitweb: http://git.kernel.org/tip/d740037fac7052e49450f6fa1454f1144a103b55 Author: Dongsheng Yang AuthorDate: Tue, 22 Mar 2016 16:37:08 +0800 Committer: Ingo Molnar CommitDate: Thu, 31 Mar 2016 10:48:54 +0200 sched/cpuacct: Split usage accounting into user_usage and sys_usage Sometimes, cpuacct.usage is not detailed enough to see how much CPU usage a group had. We want to know how much time it used in user mode and how much in kernel mode. This patch introduces more files to give this information: # ls /sys/fs/cgroup/cpuacct/cpuacct.usage* /sys/fs/cgroup/cpuacct/cpuacct.usage /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu /sys/fs/cgroup/cpuacct/cpuacct.usage_user /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user /sys/fs/cgroup/cpuacct/cpuacct.usage_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys ... while keeping the ABI with the existing counter. Signed-off-by: Dongsheng Yang [ Ported to newer kernels. ] Signed-off-by: Zhao Lei Signed-off-by: Peter Zijlstra (Intel) Acked-by: Tejun Heo Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Tejun Heo Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/aa171da036b520b51c79549e9b3215d29473f19d.1458635566.git.zhao...@cn.fujitsu.com Signed-off-by: Ingo Molnar --- kernel/sched/cpuacct.c | 140 +++-- 1 file changed, 113 insertions(+), 27 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index e39bd4f..df947e0 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -25,11 +25,22 @@ enum cpuacct_stat_index { CPUACCT_STAT_NSTATS, }; +enum cpuacct_usage_index { + CPUACCT_USAGE_USER, /* ... user mode */ + CPUACCT_USAGE_SYSTEM, /* ... kernel mode */ + + CPUACCT_USAGE_NRUSAGE, +}; + +struct cpuacct_usage { + u64 usages[CPUACCT_USAGE_NRUSAGE]; +}; + /* track cpu usage of a group of tasks and its child groups */ struct cpuacct { struct cgroup_subsys_state css; /* cpuusage holds pointer to a u64-type object on every cpu */ - u64 __percpu *cpuusage; + struct cpuacct_usage __percpu *cpuusage; struct kernel_cpustat __percpu *cpustat; }; @@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca) return css_ca(ca->css.parent); } -static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage); +static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage); static struct cpuacct root_cpuacct = { .cpustat= _cpustat, .cpuusage = _cpuacct_cpuusage, @@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css) if (!ca) goto out; - ca->cpuusage = alloc_percpu(u64); + ca->cpuusage = alloc_percpu(struct cpuacct_usage); if (!ca->cpuusage) goto out_free_ca; @@ -96,20 +107,37 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css) kfree(ca); } -static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu) +static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, +enum cpuacct_usage_index index) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); u64 data; + /* +* We allow index == CPUACCT_USAGE_NRUSAGE here to read +* the sum of suages. +*/ + BUG_ON(index > CPUACCT_USAGE_NRUSAGE); + #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit read safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - data = *cpuusage; +#endif + + if (index == CPUACCT_USAGE_NRUSAGE) { + int i = 0; + + data = 0; + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + data += cpuusage->usages[i]; + } else { + data = cpuusage->usages[index]; + } + +#ifndef CONFIG_64BIT raw_spin_unlock_irq(_rq(cpu)->lock); -#else - data = *cpuusage; #endif return data; @@ -117,69 +145,103 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu) static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + int i; #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit write safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - *cpuusage = val; +#endif + + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + cpuusage->usages[i] = val; + +#ifndef CONFIG_64BIT raw_spin_unlock_irq(_rq(cpu)->lock); -#else - *cpuusage = val; #endif } /* return total cpu usage (in nanoseconds) of a group */ -static u64 cpuus
Re: [PATCH 0/5] drivers/mtd: make several functions return bool
ccing: Brian and Richard Hi Yao, Is that really necessary? I am not sure how much benefit we can achieve from this change. Could you explain more? Yang On 03/25/2016 10:41 AM, Yaowei Bai wrote: This series only make several funcitons return bool to improve readability, no other funcitonal changes. Yaowei Bai (5): drivers/mtd: mtd_is_partition can be boolean drivers/mtd: cfi_interleave_supported can be boolean drivers/mtd: map_bankwidth_supported can be boolean drivers/mtd: mtd_nand_has_bch can be boolean drivers/mtd/nand: nand_opcode_8bits can be boolean drivers/mtd/mtdpart.c | 6 +++--- include/linux/mtd/cfi.h| 6 +++--- include/linux/mtd/map.h| 6 +++--- include/linux/mtd/nand.h | 6 +++--- include/linux/mtd/nand_bch.h | 4 ++-- include/linux/mtd/partitions.h | 2 +- 6 files changed, 15 insertions(+), 15 deletions(-)
Re: [PATCH 0/5] drivers/mtd: make several functions return bool
ccing: Brian and Richard Hi Yao, Is that really necessary? I am not sure how much benefit we can achieve from this change. Could you explain more? Yang On 03/25/2016 10:41 AM, Yaowei Bai wrote: This series only make several funcitons return bool to improve readability, no other funcitonal changes. Yaowei Bai (5): drivers/mtd: mtd_is_partition can be boolean drivers/mtd: cfi_interleave_supported can be boolean drivers/mtd: map_bankwidth_supported can be boolean drivers/mtd: mtd_nand_has_bch can be boolean drivers/mtd/nand: nand_opcode_8bits can be boolean drivers/mtd/mtdpart.c | 6 +++--- include/linux/mtd/cfi.h| 6 +++--- include/linux/mtd/map.h| 6 +++--- include/linux/mtd/nand.h | 6 +++--- include/linux/mtd/nand_bch.h | 4 ++-- include/linux/mtd/partitions.h | 2 +- 6 files changed, 15 insertions(+), 15 deletions(-)
[tip:sched/urgent] sched/cpuacct: Rename parameter in cpuusage_write() for readability
Commit-ID: 1a736b77a3f50910843d076623204ba6e5057dc1 Gitweb: http://git.kernel.org/tip/1a736b77a3f50910843d076623204ba6e5057dc1 Author: Dongsheng Yang <yangds.f...@cn.fujitsu.com> AuthorDate: Mon, 21 Dec 2015 19:14:42 +0800 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Mon, 21 Mar 2016 10:59:29 +0100 sched/cpuacct: Rename parameter in cpuusage_write() for readability The name of the 'reset' parameter to cpuusage_write() is quite confusing, because the only valid value we allow is '0', so !reset is actually the case that resets ... Rename it to 'val' and explain it in a comment that we only allow 0. Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: cgro...@vger.kernel.org Cc: t...@kernel.org Link: http://lkml.kernel.org/r/1450696483-2864-1-git-send-email-yangds.f...@cn.fujitsu.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- kernel/sched/cpuacct.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index dd7cbb5..9c2bbf7 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft) } static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft, - u64 reset) + u64 val) { struct cpuacct *ca = css_ca(css); int err = 0; int i; - if (reset) { + /* +* Only allow '0' here to do a reset. +*/ + if (val) { err = -EINVAL; goto out; }
[tip:sched/urgent] sched/cpuacct: Rename parameter in cpuusage_write() for readability
Commit-ID: 1a736b77a3f50910843d076623204ba6e5057dc1 Gitweb: http://git.kernel.org/tip/1a736b77a3f50910843d076623204ba6e5057dc1 Author: Dongsheng Yang AuthorDate: Mon, 21 Dec 2015 19:14:42 +0800 Committer: Ingo Molnar CommitDate: Mon, 21 Mar 2016 10:59:29 +0100 sched/cpuacct: Rename parameter in cpuusage_write() for readability The name of the 'reset' parameter to cpuusage_write() is quite confusing, because the only valid value we allow is '0', so !reset is actually the case that resets ... Rename it to 'val' and explain it in a comment that we only allow 0. Signed-off-by: Dongsheng Yang Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: cgro...@vger.kernel.org Cc: t...@kernel.org Link: http://lkml.kernel.org/r/1450696483-2864-1-git-send-email-yangds.f...@cn.fujitsu.com Signed-off-by: Ingo Molnar --- kernel/sched/cpuacct.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index dd7cbb5..9c2bbf7 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft) } static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft, - u64 reset) + u64 val) { struct cpuacct *ca = css_ca(css); int err = 0; int i; - if (reset) { + /* +* Only allow '0' here to do a reset. +*/ + if (val) { err = -EINVAL; goto out; }
[tip:sched/core] sched/core: Remove duplicated sched_group_set_shares() prototype
Commit-ID: 41d93397334f3c3374810f45e7bcce9007d1a7bb Gitweb: http://git.kernel.org/tip/41d93397334f3c3374810f45e7bcce9007d1a7bb Author: Dongsheng Yang <yangds.f...@cn.fujitsu.com> AuthorDate: Wed, 13 Jan 2016 16:42:38 +0800 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Mon, 29 Feb 2016 09:53:05 +0100 sched/core: Remove duplicated sched_group_set_shares() prototype Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> Cc: <lize...@huawei.com> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Mike Galbraith <efa...@gmx.de> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Link: http://lkml.kernel.org/r/1452674558-31897-1-git-send-email-yangds.f...@cn.fujitsu.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- kernel/sched/sched.h | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 91f0a77..3dc7b8b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -318,7 +318,6 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq, struct sched_entity *se, int cpu, struct sched_entity *parent); extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b); -extern int sched_group_set_shares(struct task_group *tg, unsigned long shares); extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b); extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
[tip:sched/core] sched/core: Remove duplicated sched_group_set_shares() prototype
Commit-ID: 41d93397334f3c3374810f45e7bcce9007d1a7bb Gitweb: http://git.kernel.org/tip/41d93397334f3c3374810f45e7bcce9007d1a7bb Author: Dongsheng Yang AuthorDate: Wed, 13 Jan 2016 16:42:38 +0800 Committer: Ingo Molnar CommitDate: Mon, 29 Feb 2016 09:53:05 +0100 sched/core: Remove duplicated sched_group_set_shares() prototype Signed-off-by: Dongsheng Yang Signed-off-by: Peter Zijlstra (Intel) Cc: Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1452674558-31897-1-git-send-email-yangds.f...@cn.fujitsu.com Signed-off-by: Ingo Molnar --- kernel/sched/sched.h | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 91f0a77..3dc7b8b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -318,7 +318,6 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq, struct sched_entity *se, int cpu, struct sched_entity *parent); extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b); -extern int sched_group_set_shares(struct task_group *tg, unsigned long shares); extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b); extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/24/2015 12:36 AM, Eric W. Biederman wrote: Dongsheng Yang writes: [...] Hi Eric, Happy new year and sorry for the late reply. Given the other constraints on an implementation the pid namespace looks by far the one best suited to host such a sysctl if it is possible to implement safely. So you think it's better to isolate the core_pattern in pid_namespace, am I right? But, core_file_path and user_mode_helper_path in core_pattern are much more related with mnt_namespace IMO. Could you help to explain it more? Thanx Yang Eric . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/24/2015 12:36 AM, Eric W. Biederman wrote: Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: [...] Hi Eric, Happy new year and sorry for the late reply. Given the other constraints on an implementation the pid namespace looks by far the one best suited to host such a sysctl if it is possible to implement safely. So you think it's better to isolate the core_pattern in pid_namespace, am I right? But, core_file_path and user_mode_helper_path in core_pattern are much more related with mnt_namespace IMO. Could you help to explain it more? Thanx Yang Eric . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/22/2015 05:52 AM, Eric W. Biederman wrote: Dongsheng Yang writes: On 12/20/2015 05:47 PM, Eric W. Biederman wrote: Dongsheng Yang writes: On 12/20/2015 10:37 AM, Al Viro wrote: On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote: On 12/17/2015 07:23 PM, Dongsheng Yang wrote: Hi guys, We are working on making core dump behaviour isolated in container. But the problem is, the /proc/sys/kernel/core_pattern is a kernel wide setting, not belongs to a container. So we want to add core_pattern into mnt namespace. What do you think about it? Hi Eric, I found your patch about "net: Implement the per network namespace sysctl infrastructure", I want to do the similar thing in mnt namespace. Is that suggested way? Why mnt namespace and not something else? Hi Al, Well, because core_pattern indicates the path to store core file. In different mnt namespace, we would like to change the path with different value. In addition, Let's considering other namespaces: UTS ns: contains informations of kernel and arch, not proper for core_pattern. IPC ns: communication informations, not proper for core_pattern PID ns: core_pattern is not related with pid net ns: obviousely no. user ns: not proper too. Then I believe it's better to do this in mnt namespace. of course, core_pattern is just one example. After this infrastructure finished, we can implement more sysctls as per-mnt if necessary, I think. Al, what do you think about this idea? The hard part is not the sysctl. The hard part is starting the usermode helper, in an environment that it can deal with. The mount namespace really provides you with no help there. Do you mean the core dump helper? But I think I don't want to touch it in my development. I think I can use non-pipe way to get what I want, Let me try to explain what I want here. (1). introduce a --core-path option in docker run command to specify the path in host to store core file in one container. E.g: docker run --core-path=/core/test --name=test IMAGE (2). When the container starting, docker attach a volume to it, similar with "-v /core/test:/var/lib/docker/coredump". That means, the path of /var/lib/docker/coredump in container is a link to /core/test in host. (3). Set the /proc/sys/kernel/core_pattern in container as "/var/lib/docker/coredump". But that should not affect the core_pattern in host or other containers. Then I think I can collect the core files from each container and save them in the paths where I want. For your case that sounds like it would work. Unfortunately for this to be generally applicable and to let the OS in the contianer control it's fate the core dump pattern needs to be supported. Otherwise something clever in userspace that can be written now should be sufficient to fill the gap. There is enough information for the user mode helper to implement the policy you would like today. Hi Eric, To make sure I understand your point correctly: Do you mean we can write a userspace helper in host such as /usr/libexec/docker-pipe to get what I want? Yes, I would say, for my case, it would work. This helper can get the dump data from containers and dispatch them to different path such as /var/lib/docker/cores//. But there would be two problems in this solution. (1). It may affect core dump on host. Normally, other processes in host would not be happy to use a helper of docker-pipe for themselves. But host have to share the core_pattern with containers, can't config it by itself. (2). If there are some containers don't want to pass the core files to host, they can't set the core_pattern in this solution. IMO, we can get core files on host currently, by either non-pipe way I described above or the pipe way you suggested. But the problem is both of these methods would affect the core_pattern on host and other containers. So, I think the key point here is just isolating the core dump related sysctl in mnt namespace. Thanx Yang Eric . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/22/2015 11:12 AM, Kamezawa Hiroyuki wrote: > On 2015/12/22 6:52, Eric W. Biederman wrote: >> Dongsheng Yang writes: >> >>> On 12/20/2015 05:47 PM, Eric W. Biederman wrote: >>>> Dongsheng Yang writes: >>>> >>>>> On 12/20/2015 10:37 AM, Al Viro wrote: >>>>>> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote: >>>>>>> On 12/17/2015 07:23 PM, Dongsheng Yang wrote: >>>>>>>> Hi guys, >>>>>>>> We are working on making core dump behaviour isolated in >>>>>>>> container. But the problem is, the /proc/sys/kernel/core_pattern >>>>>>>> is a kernel wide setting, not belongs to a container. >>>>>>>> >>>>>>>> So we want to add core_pattern into mnt namespace. What >>>>>>>> do you think about it? >>>>>>> >>>>>>> Hi Eric, >>>>>>> I found your patch about "net: Implement the per network >>>>>>> namespace >>>>>>> sysctl infrastructure", I want to do the similar thing >>>>>>> in mnt namespace. Is that suggested way? >>>>>> >>>>>> Why mnt namespace and not something else? >>>>> >>>>> Hi Al, >>>>> >>>>> Well, because core_pattern indicates the path to store core file. >>>>> In different mnt namespace, we would like to change the path with >>>>> different value. >>>>> >>>>> In addition, Let's considering other namespaces: >>>>> UTS ns: contains informations of kernel and arch, not proper for >>>>> core_pattern. >>>>> IPC ns: communication informations, not proper for core_pattern >>>>> PID ns: core_pattern is not related with pid >>>>> net ns: obviousely no. >>>>> user ns: not proper too. >>>>> >>>>> Then I believe it's better to do this in mnt namespace. of course, >>>>> core_pattern is just one example. After this infrastructure finished, >>>>> we can implement more sysctls as per-mnt if necessary, I think. >>>>> >>>>> Al, what do you think about this idea? >>>> >>>> The hard part is not the sysctl. The hard part is starting the usermode >>>> helper, in an environment that it can deal with. The mount namespace >>>> really provides you with no help there. >>> >>> Do you mean the core dump helper? But I think I don't want to touch it >>> in my development. I think I can use non-pipe way to get what I want, >>> Let me try to explain what I want here. >>> >>> (1). introduce a --core-path option in docker run command to specify the >>> path in host to store core file in one container. >>> E.g: docker run --core-path=/core/test --name=test IMAGE >>> >>> (2). When the container starting, docker attach a volume to it, similar >>> with "-v /core/test:/var/lib/docker/coredump". That means, the path of >>> /var/lib/docker/coredump in container is a link to /core/test in host. >>> >>> (3). Set the /proc/sys/kernel/core_pattern in container as >>> "/var/lib/docker/coredump". But that should not affect the core_pattern >>> in host or other containers. >>> >>> Then I think I can collect the core files from each container and save >>> them in the paths where I want. >> >> For your case that sounds like it would work. Unfortunately for this to >> be generally applicable and to let the OS in the contianer control it's >> fate the core dump pattern needs to be supported. >> >> Otherwise something clever in userspace that can be written now should >> be sufficient to fill the gap. There is enough information for the user >> mode helper to implement the policy you would like today. >> > Let me clarify my understanding. > > 1) running user-mode-helper in a container. > It's not supported by the kernel. user-mode-helper always works on a host. > > 2) running user mode helper in a host. >It's supported in the newest distro(FC23). (abrt supports container.) >Summary is here. https://github.com/abrt/abrt/wiki/Containers-and-chroots > > If a guest user doesn't want to pass a core to the host owner, core_pattern > should be configurable but it can't. Agreed, then we have to make the core_pattern namespace-ed in mnt namespace. Thanx Yang > > Thanks, > -Kame > > > > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.
On 12/22/2015 05:33 AM, Tejun Heo wrote: On Mon, Dec 21, 2015 at 07:14:43PM +0800, Dongsheng Yang wrote: Sometimes, cpuacct.usage is not detialed enough to user to see how much usage a group used. We want to know how much time it used in user mode and how much in kernel mode. cpuusage is being phased out. If you need these stats, please implement it on cpu side. Hi TJ, thanx, then I will look at introducing this feature in cpu side then. Thanx Yang Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/22/2015 11:12 AM, Kamezawa Hiroyuki wrote: > On 2015/12/22 6:52, Eric W. Biederman wrote: >> Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: >> >>> On 12/20/2015 05:47 PM, Eric W. Biederman wrote: >>>> Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: >>>> >>>>> On 12/20/2015 10:37 AM, Al Viro wrote: >>>>>> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote: >>>>>>> On 12/17/2015 07:23 PM, Dongsheng Yang wrote: >>>>>>>> Hi guys, >>>>>>>> We are working on making core dump behaviour isolated in >>>>>>>> container. But the problem is, the /proc/sys/kernel/core_pattern >>>>>>>> is a kernel wide setting, not belongs to a container. >>>>>>>> >>>>>>>> So we want to add core_pattern into mnt namespace. What >>>>>>>> do you think about it? >>>>>>> >>>>>>> Hi Eric, >>>>>>> I found your patch about "net: Implement the per network >>>>>>> namespace >>>>>>> sysctl infrastructure", I want to do the similar thing >>>>>>> in mnt namespace. Is that suggested way? >>>>>> >>>>>> Why mnt namespace and not something else? >>>>> >>>>> Hi Al, >>>>> >>>>> Well, because core_pattern indicates the path to store core file. >>>>> In different mnt namespace, we would like to change the path with >>>>> different value. >>>>> >>>>> In addition, Let's considering other namespaces: >>>>> UTS ns: contains informations of kernel and arch, not proper for >>>>> core_pattern. >>>>> IPC ns: communication informations, not proper for core_pattern >>>>> PID ns: core_pattern is not related with pid >>>>> net ns: obviousely no. >>>>> user ns: not proper too. >>>>> >>>>> Then I believe it's better to do this in mnt namespace. of course, >>>>> core_pattern is just one example. After this infrastructure finished, >>>>> we can implement more sysctls as per-mnt if necessary, I think. >>>>> >>>>> Al, what do you think about this idea? >>>> >>>> The hard part is not the sysctl. The hard part is starting the usermode >>>> helper, in an environment that it can deal with. The mount namespace >>>> really provides you with no help there. >>> >>> Do you mean the core dump helper? But I think I don't want to touch it >>> in my development. I think I can use non-pipe way to get what I want, >>> Let me try to explain what I want here. >>> >>> (1). introduce a --core-path option in docker run command to specify the >>> path in host to store core file in one container. >>> E.g: docker run --core-path=/core/test --name=test IMAGE >>> >>> (2). When the container starting, docker attach a volume to it, similar >>> with "-v /core/test:/var/lib/docker/coredump". That means, the path of >>> /var/lib/docker/coredump in container is a link to /core/test in host. >>> >>> (3). Set the /proc/sys/kernel/core_pattern in container as >>> "/var/lib/docker/coredump". But that should not affect the core_pattern >>> in host or other containers. >>> >>> Then I think I can collect the core files from each container and save >>> them in the paths where I want. >> >> For your case that sounds like it would work. Unfortunately for this to >> be generally applicable and to let the OS in the contianer control it's >> fate the core dump pattern needs to be supported. >> >> Otherwise something clever in userspace that can be written now should >> be sufficient to fill the gap. There is enough information for the user >> mode helper to implement the policy you would like today. >> > Let me clarify my understanding. > > 1) running user-mode-helper in a container. > It's not supported by the kernel. user-mode-helper always works on a host. > > 2) running user mode helper in a host. >It's supported in the newest distro(FC23). (abrt supports container.) >Summary is here. https://github.com/abrt/abrt/wiki/Containers-and-chroots > > If a guest user doesn't want to pass a core to the host owner, core_pattern > should be configurable but it can't. Agreed, then we have to make the core_pattern namespace-ed in mnt namespace. Thanx Yang > > Thanks, > -Kame > > > > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/22/2015 05:52 AM, Eric W. Biederman wrote: Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: On 12/20/2015 05:47 PM, Eric W. Biederman wrote: Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: On 12/20/2015 10:37 AM, Al Viro wrote: On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote: On 12/17/2015 07:23 PM, Dongsheng Yang wrote: Hi guys, We are working on making core dump behaviour isolated in container. But the problem is, the /proc/sys/kernel/core_pattern is a kernel wide setting, not belongs to a container. So we want to add core_pattern into mnt namespace. What do you think about it? Hi Eric, I found your patch about "net: Implement the per network namespace sysctl infrastructure", I want to do the similar thing in mnt namespace. Is that suggested way? Why mnt namespace and not something else? Hi Al, Well, because core_pattern indicates the path to store core file. In different mnt namespace, we would like to change the path with different value. In addition, Let's considering other namespaces: UTS ns: contains informations of kernel and arch, not proper for core_pattern. IPC ns: communication informations, not proper for core_pattern PID ns: core_pattern is not related with pid net ns: obviousely no. user ns: not proper too. Then I believe it's better to do this in mnt namespace. of course, core_pattern is just one example. After this infrastructure finished, we can implement more sysctls as per-mnt if necessary, I think. Al, what do you think about this idea? The hard part is not the sysctl. The hard part is starting the usermode helper, in an environment that it can deal with. The mount namespace really provides you with no help there. Do you mean the core dump helper? But I think I don't want to touch it in my development. I think I can use non-pipe way to get what I want, Let me try to explain what I want here. (1). introduce a --core-path option in docker run command to specify the path in host to store core file in one container. E.g: docker run --core-path=/core/test --name=test IMAGE (2). When the container starting, docker attach a volume to it, similar with "-v /core/test:/var/lib/docker/coredump". That means, the path of /var/lib/docker/coredump in container is a link to /core/test in host. (3). Set the /proc/sys/kernel/core_pattern in container as "/var/lib/docker/coredump". But that should not affect the core_pattern in host or other containers. Then I think I can collect the core files from each container and save them in the paths where I want. For your case that sounds like it would work. Unfortunately for this to be generally applicable and to let the OS in the contianer control it's fate the core dump pattern needs to be supported. Otherwise something clever in userspace that can be written now should be sufficient to fill the gap. There is enough information for the user mode helper to implement the policy you would like today. Hi Eric, To make sure I understand your point correctly: Do you mean we can write a userspace helper in host such as /usr/libexec/docker-pipe to get what I want? Yes, I would say, for my case, it would work. This helper can get the dump data from containers and dispatch them to different path such as /var/lib/docker/cores//. But there would be two problems in this solution. (1). It may affect core dump on host. Normally, other processes in host would not be happy to use a helper of docker-pipe for themselves. But host have to share the core_pattern with containers, can't config it by itself. (2). If there are some containers don't want to pass the core files to host, they can't set the core_pattern in this solution. IMO, we can get core files on host currently, by either non-pipe way I described above or the pipe way you suggested. But the problem is both of these methods would affect the core_pattern on host and other containers. So, I think the key point here is just isolating the core dump related sysctl in mnt namespace. Thanx Yang Eric . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.
On 12/22/2015 05:33 AM, Tejun Heo wrote: On Mon, Dec 21, 2015 at 07:14:43PM +0800, Dongsheng Yang wrote: Sometimes, cpuacct.usage is not detialed enough to user to see how much usage a group used. We want to know how much time it used in user mode and how much in kernel mode. cpuusage is being phased out. If you need these stats, please implement it on cpu side. Hi TJ, thanx, then I will look at introducing this feature in cpu side then. Thanx Yang Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH 1/2] cpuacct: rename parameter in cpuusage_write for readability
The name of 'reset' makes a little confusion in reading, we would say, if we want to reset usage, return -EINVAL. That's not true. Actually, we want to say, we only allow user to do a reset. This patch rename reset to val and add a comment here, making the code more readable. Signed-off-by: Dongsheng Yang --- kernel/sched/cpuacct.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index dd7cbb5..9c2bbf7 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft) } static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft, - u64 reset) + u64 val) { struct cpuacct *ca = css_ca(css); int err = 0; int i; - if (reset) { + /* +* Only allow '0' here to do a reset. +*/ + if (val) { err = -EINVAL; goto out; } -- 1.8.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.
Sometimes, cpuacct.usage is not detialed enough to user to see how much usage a group used. We want to know how much time it used in user mode and how much in kernel mode. This patch introduce some more files to tell user these informations. # ls /sys/fs/cgroup/cpuacct/cpuacct.usage* /sys/fs/cgroup/cpuacct/cpuacct.usage /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu /sys/fs/cgroup/cpuacct/cpuacct.usage_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_user Signed-off-by: Dongsheng Yang --- kernel/sched/cpuacct.c | 140 ++--- 1 file changed, 120 insertions(+), 20 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 9c2bbf7..b3e8971 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -25,11 +25,22 @@ enum cpuacct_stat_index { CPUACCT_STAT_NSTATS, }; +enum cpuacct_usage_index { + CPUACCT_USAGE_USER, /* ... user mode */ + CPUACCT_USAGE_SYSTEM, /* ... kernel mode */ + + CPUACCT_USAGE_NRUSAGE, +}; + +struct cpuacct_usage { + u64 usages[CPUACCT_USAGE_NRUSAGE]; +}; + /* track cpu usage of a group of tasks and its child groups */ struct cpuacct { struct cgroup_subsys_state css; /* cpuusage holds pointer to a u64-type object on every cpu */ - u64 __percpu *cpuusage; + struct cpuacct_usage __percpu *cpuusage; struct kernel_cpustat __percpu *cpustat; }; @@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca) return css_ca(ca->css.parent); } -static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage); +static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage); static struct cpuacct root_cpuacct = { .cpustat= _cpustat, .cpuusage = _cpuacct_cpuusage, @@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css) if (!ca) goto out; - ca->cpuusage = alloc_percpu(u64); + ca->cpuusage = alloc_percpu(struct cpuacct_usage); if (!ca->cpuusage) goto out_free_ca; @@ -96,54 +107,104 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css) kfree(ca); } -static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu) +static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, +enum cpuacct_usage_index index) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); - u64 data; + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + u64 data = 0; + int i = 0; + + /* +* We allow index == CPUACCT_USAGE_NRUSAGE here to read +* the sum of suages. +*/ + BUG_ON(index > CPUACCT_USAGE_NRUSAGE); + + if (index == CPUACCT_USAGE_NRUSAGE) { + raw_spin_lock_irq(_rq(cpu)->lock); + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + data += cpuusage->usages[i]; + raw_spin_unlock_irq(_rq(cpu)->lock); + + goto out; + } #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit read safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - data = *cpuusage; + data = cpuusage->usages[index]; raw_spin_unlock_irq(_rq(cpu)->lock); #else - data = *cpuusage; + data = cpuusage->usages[index]; #endif +out: return data; } -static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) +static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, + enum cpuacct_usage_index index, u64 val) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + int i = 0; + + /* +* We allow index == CPUACCT_USAGE_NRUSAGE here to write +* val to each index of usages. +*/ + BUG_ON(index > CPUACCT_USAGE_NRUSAGE); + + if (index == CPUACCT_USAGE_NRUSAGE) { + raw_spin_lock_irq(_rq(cpu)->lock); + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + cpuusage->usages[i] = val; + raw_spin_unlock_irq(_rq(cpu)->lock); + + return; + } #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit write safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - *cpuusage = val; + cpuusage->usages[index] = val; raw_spin_unlock_irq(_rq(cpu)->lock); #else - *cpuusage = val; + cpuusage->usages[index] = val; #endif } /* return total cpu usage (in nanoseconds) of a group */ -static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft) +static u64 __cpuusage_read(struct
[RESEND PATCH 1/2] cpuacct: rename parameter in cpuusage_write for readability
The name of 'reset' makes a little confusion in reading, we would say, if we want to reset usage, return -EINVAL. That's not true. Actually, we want to say, we only allow user to do a reset. This patch rename reset to val and add a comment here, making the code more readable. Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com> --- kernel/sched/cpuacct.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index dd7cbb5..9c2bbf7 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft) } static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft, - u64 reset) + u64 val) { struct cpuacct *ca = css_ca(css); int err = 0; int i; - if (reset) { + /* +* Only allow '0' here to do a reset. +*/ + if (val) { err = -EINVAL; goto out; } -- 1.8.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.
Sometimes, cpuacct.usage is not detialed enough to user to see how much usage a group used. We want to know how much time it used in user mode and how much in kernel mode. This patch introduce some more files to tell user these informations. # ls /sys/fs/cgroup/cpuacct/cpuacct.usage* /sys/fs/cgroup/cpuacct/cpuacct.usage /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu /sys/fs/cgroup/cpuacct/cpuacct.usage_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_user Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com> --- kernel/sched/cpuacct.c | 140 ++--- 1 file changed, 120 insertions(+), 20 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 9c2bbf7..b3e8971 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -25,11 +25,22 @@ enum cpuacct_stat_index { CPUACCT_STAT_NSTATS, }; +enum cpuacct_usage_index { + CPUACCT_USAGE_USER, /* ... user mode */ + CPUACCT_USAGE_SYSTEM, /* ... kernel mode */ + + CPUACCT_USAGE_NRUSAGE, +}; + +struct cpuacct_usage { + u64 usages[CPUACCT_USAGE_NRUSAGE]; +}; + /* track cpu usage of a group of tasks and its child groups */ struct cpuacct { struct cgroup_subsys_state css; /* cpuusage holds pointer to a u64-type object on every cpu */ - u64 __percpu *cpuusage; + struct cpuacct_usage __percpu *cpuusage; struct kernel_cpustat __percpu *cpustat; }; @@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca) return css_ca(ca->css.parent); } -static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage); +static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage); static struct cpuacct root_cpuacct = { .cpustat= _cpustat, .cpuusage = _cpuacct_cpuusage, @@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css) if (!ca) goto out; - ca->cpuusage = alloc_percpu(u64); + ca->cpuusage = alloc_percpu(struct cpuacct_usage); if (!ca->cpuusage) goto out_free_ca; @@ -96,54 +107,104 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css) kfree(ca); } -static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu) +static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, +enum cpuacct_usage_index index) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); - u64 data; + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + u64 data = 0; + int i = 0; + + /* +* We allow index == CPUACCT_USAGE_NRUSAGE here to read +* the sum of suages. +*/ + BUG_ON(index > CPUACCT_USAGE_NRUSAGE); + + if (index == CPUACCT_USAGE_NRUSAGE) { + raw_spin_lock_irq(_rq(cpu)->lock); + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + data += cpuusage->usages[i]; + raw_spin_unlock_irq(_rq(cpu)->lock); + + goto out; + } #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit read safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - data = *cpuusage; + data = cpuusage->usages[index]; raw_spin_unlock_irq(_rq(cpu)->lock); #else - data = *cpuusage; + data = cpuusage->usages[index]; #endif +out: return data; } -static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) +static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, + enum cpuacct_usage_index index, u64 val) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + int i = 0; + + /* +* We allow index == CPUACCT_USAGE_NRUSAGE here to write +* val to each index of usages. +*/ + BUG_ON(index > CPUACCT_USAGE_NRUSAGE); + + if (index == CPUACCT_USAGE_NRUSAGE) { + raw_spin_lock_irq(_rq(cpu)->lock); + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + cpuusage->usages[i] = val; + raw_spin_unlock_irq(_rq(cpu)->lock); + + return; + } #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit write safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - *cpuusage = val; + cpuusage->usages[index] = val; raw_spin_unlock_irq(_rq(cpu)->lock); #else - *cpuusage = val; + cpuusage->usages[index] = val; #endif } /* return total cpu usage (in nanoseconds) of a group */ -static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft) +s
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/20/2015 05:47 PM, Eric W. Biederman wrote: Dongsheng Yang writes: On 12/20/2015 10:37 AM, Al Viro wrote: On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote: On 12/17/2015 07:23 PM, Dongsheng Yang wrote: Hi guys, We are working on making core dump behaviour isolated in container. But the problem is, the /proc/sys/kernel/core_pattern is a kernel wide setting, not belongs to a container. So we want to add core_pattern into mnt namespace. What do you think about it? Hi Eric, I found your patch about "net: Implement the per network namespace sysctl infrastructure", I want to do the similar thing in mnt namespace. Is that suggested way? Why mnt namespace and not something else? Hi Al, Well, because core_pattern indicates the path to store core file. In different mnt namespace, we would like to change the path with different value. In addition, Let's considering other namespaces: UTS ns: contains informations of kernel and arch, not proper for core_pattern. IPC ns: communication informations, not proper for core_pattern PID ns: core_pattern is not related with pid net ns: obviousely no. user ns: not proper too. Then I believe it's better to do this in mnt namespace. of course, core_pattern is just one example. After this infrastructure finished, we can implement more sysctls as per-mnt if necessary, I think. Al, what do you think about this idea? The hard part is not the sysctl. The hard part is starting the usermode helper, in an environment that it can deal with. The mount namespace really provides you with no help there. Do you mean the core dump helper? But I think I don't want to touch it in my development. I think I can use non-pipe way to get what I want, Let me try to explain what I want here. (1). introduce a --core-path option in docker run command to specify the path in host to store core file in one container. E.g: docker run --core-path=/core/test --name=test IMAGE (2). When the container starting, docker attach a volume to it, similar with "-v /core/test:/var/lib/docker/coredump". That means, the path of /var/lib/docker/coredump in container is a link to /core/test in host. (3). Set the /proc/sys/kernel/core_pattern in container as "/var/lib/docker/coredump". But that should not affect the core_pattern in host or other containers. Then I think I can collect the core files from each container and save them in the paths where I want. Thanx Yang Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/20/2015 05:47 PM, Eric W. Biederman wrote: Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: On 12/20/2015 10:37 AM, Al Viro wrote: On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote: On 12/17/2015 07:23 PM, Dongsheng Yang wrote: Hi guys, We are working on making core dump behaviour isolated in container. But the problem is, the /proc/sys/kernel/core_pattern is a kernel wide setting, not belongs to a container. So we want to add core_pattern into mnt namespace. What do you think about it? Hi Eric, I found your patch about "net: Implement the per network namespace sysctl infrastructure", I want to do the similar thing in mnt namespace. Is that suggested way? Why mnt namespace and not something else? Hi Al, Well, because core_pattern indicates the path to store core file. In different mnt namespace, we would like to change the path with different value. In addition, Let's considering other namespaces: UTS ns: contains informations of kernel and arch, not proper for core_pattern. IPC ns: communication informations, not proper for core_pattern PID ns: core_pattern is not related with pid net ns: obviousely no. user ns: not proper too. Then I believe it's better to do this in mnt namespace. of course, core_pattern is just one example. After this infrastructure finished, we can implement more sysctls as per-mnt if necessary, I think. Al, what do you think about this idea? The hard part is not the sysctl. The hard part is starting the usermode helper, in an environment that it can deal with. The mount namespace really provides you with no help there. Do you mean the core dump helper? But I think I don't want to touch it in my development. I think I can use non-pipe way to get what I want, Let me try to explain what I want here. (1). introduce a --core-path option in docker run command to specify the path in host to store core file in one container. E.g: docker run --core-path=/core/test --name=test IMAGE (2). When the container starting, docker attach a volume to it, similar with "-v /core/test:/var/lib/docker/coredump". That means, the path of /var/lib/docker/coredump in container is a link to /core/test in host. (3). Set the /proc/sys/kernel/core_pattern in container as "/var/lib/docker/coredump". But that should not affect the core_pattern in host or other containers. Then I think I can collect the core files from each container and save them in the paths where I want. Thanx Yang Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/20/2015 10:37 AM, Al Viro wrote: On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote: On 12/17/2015 07:23 PM, Dongsheng Yang wrote: Hi guys, We are working on making core dump behaviour isolated in container. But the problem is, the /proc/sys/kernel/core_pattern is a kernel wide setting, not belongs to a container. So we want to add core_pattern into mnt namespace. What do you think about it? Hi Eric, I found your patch about "net: Implement the per network namespace sysctl infrastructure", I want to do the similar thing in mnt namespace. Is that suggested way? Why mnt namespace and not something else? Hi Al, Well, because core_pattern indicates the path to store core file. In different mnt namespace, we would like to change the path with different value. In addition, Let's considering other namespaces: UTS ns: contains informations of kernel and arch, not proper for core_pattern. IPC ns: communication informations, not proper for core_pattern PID ns: core_pattern is not related with pid net ns: obviousely no. user ns: not proper too. Then I believe it's better to do this in mnt namespace. of course, core_pattern is just one example. After this infrastructure finished, we can implement more sysctls as per-mnt if necessary, I think. Al, what do you think about this idea? Yang . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/17/2015 07:23 PM, Dongsheng Yang wrote: Hi guys, We are working on making core dump behaviour isolated in container. But the problem is, the /proc/sys/kernel/core_pattern is a kernel wide setting, not belongs to a container. So we want to add core_pattern into mnt namespace. What do you think about it? Hi Eric, I found your patch about "net: Implement the per network namespace sysctl infrastructure", I want to do the similar thing in mnt namespace. Is that suggested way? Thanx Yang Yang -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/17/2015 07:23 PM, Dongsheng Yang wrote: Hi guys, We are working on making core dump behaviour isolated in container. But the problem is, the /proc/sys/kernel/core_pattern is a kernel wide setting, not belongs to a container. So we want to add core_pattern into mnt namespace. What do you think about it? Hi Eric, I found your patch about "net: Implement the per network namespace sysctl infrastructure", I want to do the similar thing in mnt namespace. Is that suggested way? Thanx Yang Yang -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 12/20/2015 10:37 AM, Al Viro wrote: On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote: On 12/17/2015 07:23 PM, Dongsheng Yang wrote: Hi guys, We are working on making core dump behaviour isolated in container. But the problem is, the /proc/sys/kernel/core_pattern is a kernel wide setting, not belongs to a container. So we want to add core_pattern into mnt namespace. What do you think about it? Hi Eric, I found your patch about "net: Implement the per network namespace sysctl infrastructure", I want to do the similar thing in mnt namespace. Is that suggested way? Why mnt namespace and not something else? Hi Al, Well, because core_pattern indicates the path to store core file. In different mnt namespace, we would like to change the path with different value. In addition, Let's considering other namespaces: UTS ns: contains informations of kernel and arch, not proper for core_pattern. IPC ns: communication informations, not proper for core_pattern PID ns: core_pattern is not related with pid net ns: obviousely no. user ns: not proper too. Then I believe it's better to do this in mnt namespace. of course, core_pattern is just one example. After this infrastructure finished, we can implement more sysctls as per-mnt if necessary, I think. Al, what do you think about this idea? Yang . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Propose] Isolate core_pattern in mnt namespace.
Hi guys, We are working on making core dump behaviour isolated in container. But the problem is, the /proc/sys/kernel/core_pattern is a kernel wide setting, not belongs to a container. So we want to add core_pattern into mnt namespace. What do you think about it? Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Propose] Isolate core_pattern in mnt namespace.
Hi guys, We are working on making core dump behaviour isolated in container. But the problem is, the /proc/sys/kernel/core_pattern is a kernel wide setting, not belongs to a container. So we want to add core_pattern into mnt namespace. What do you think about it? Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] cpuacct: rename parameter in cpuusage_write for readability
The name of 'reset' makes a little confusion in reading, we would say, if we want to reset usage, return -EINVAL. That's not true. Actually, we want to say, we only allow user to do a reset. This patch rename reset to val and add a comment here, making the code more readable. Signed-off-by: Dongsheng Yang --- kernel/sched/cpuacct.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index dd7cbb5..9c2bbf7 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft) } static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft, - u64 reset) + u64 val) { struct cpuacct *ca = css_ca(css); int err = 0; int i; - if (reset) { + /* +* Only allow '0' here to do a reset. +*/ + if (val) { err = -EINVAL; goto out; } -- 1.8.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.
Sometimes, cpuacct.usage is not detialed enough to user to see how much usage a group used. We want to know how much time it used in user mode and how much in kernel mode. This patch introduce some more files to tell user these informations. # ls /sys/fs/cgroup/cpuacct/cpuacct.usage* /sys/fs/cgroup/cpuacct/cpuacct.usage /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu /sys/fs/cgroup/cpuacct/cpuacct.usage_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_user Signed-off-by: Dongsheng Yang --- kernel/sched/cpuacct.c | 140 ++--- 1 file changed, 120 insertions(+), 20 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 9c2bbf7..b3e8971 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -25,11 +25,22 @@ enum cpuacct_stat_index { CPUACCT_STAT_NSTATS, }; +enum cpuacct_usage_index { + CPUACCT_USAGE_USER, /* ... user mode */ + CPUACCT_USAGE_SYSTEM, /* ... kernel mode */ + + CPUACCT_USAGE_NRUSAGE, +}; + +struct cpuacct_usage { + u64 usages[CPUACCT_USAGE_NRUSAGE]; +}; + /* track cpu usage of a group of tasks and its child groups */ struct cpuacct { struct cgroup_subsys_state css; /* cpuusage holds pointer to a u64-type object on every cpu */ - u64 __percpu *cpuusage; + struct cpuacct_usage __percpu *cpuusage; struct kernel_cpustat __percpu *cpustat; }; @@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca) return css_ca(ca->css.parent); } -static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage); +static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage); static struct cpuacct root_cpuacct = { .cpustat= _cpustat, .cpuusage = _cpuacct_cpuusage, @@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css) if (!ca) goto out; - ca->cpuusage = alloc_percpu(u64); + ca->cpuusage = alloc_percpu(struct cpuacct_usage); if (!ca->cpuusage) goto out_free_ca; @@ -96,54 +107,104 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css) kfree(ca); } -static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu) +static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, +enum cpuacct_usage_index index) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); - u64 data; + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + u64 data = 0; + int i = 0; + + /* +* We allow index == CPUACCT_USAGE_NRUSAGE here to read +* the sum of suages. +*/ + BUG_ON(index > CPUACCT_USAGE_NRUSAGE); + + if (index == CPUACCT_USAGE_NRUSAGE) { + raw_spin_lock_irq(_rq(cpu)->lock); + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + data += cpuusage->usages[i]; + raw_spin_unlock_irq(_rq(cpu)->lock); + + goto out; + } #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit read safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - data = *cpuusage; + data = cpuusage->usages[index]; raw_spin_unlock_irq(_rq(cpu)->lock); #else - data = *cpuusage; + data = cpuusage->usages[index]; #endif +out: return data; } -static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) +static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, + enum cpuacct_usage_index index, u64 val) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + int i = 0; + + /* +* We allow index == CPUACCT_USAGE_NRUSAGE here to write +* val to each index of usages. +*/ + BUG_ON(index > CPUACCT_USAGE_NRUSAGE); + + if (index == CPUACCT_USAGE_NRUSAGE) { + raw_spin_lock_irq(_rq(cpu)->lock); + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + cpuusage->usages[i] = val; + raw_spin_unlock_irq(_rq(cpu)->lock); + + return; + } #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit write safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - *cpuusage = val; + cpuusage->usages[index] = val; raw_spin_unlock_irq(_rq(cpu)->lock); #else - *cpuusage = val; + cpuusage->usages[index] = val; #endif } /* return total cpu usage (in nanoseconds) of a group */ -static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft) +static u64 __cpuusage_read(struct
[PATCH 1/2] cpuacct: rename parameter in cpuusage_write for readability
The name of 'reset' makes a little confusion in reading, we would say, if we want to reset usage, return -EINVAL. That's not true. Actually, we want to say, we only allow user to do a reset. This patch rename reset to val and add a comment here, making the code more readable. Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com> --- kernel/sched/cpuacct.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index dd7cbb5..9c2bbf7 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -145,13 +145,16 @@ static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft) } static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft, - u64 reset) + u64 val) { struct cpuacct *ca = css_ca(css); int err = 0; int i; - if (reset) { + /* +* Only allow '0' here to do a reset. +*/ + if (val) { err = -EINVAL; goto out; } -- 1.8.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] cpuacct: split usage into user_usage and sys_usage.
Sometimes, cpuacct.usage is not detialed enough to user to see how much usage a group used. We want to know how much time it used in user mode and how much in kernel mode. This patch introduce some more files to tell user these informations. # ls /sys/fs/cgroup/cpuacct/cpuacct.usage* /sys/fs/cgroup/cpuacct/cpuacct.usage /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu /sys/fs/cgroup/cpuacct/cpuacct.usage_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys /sys/fs/cgroup/cpuacct/cpuacct.usage_user Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com> --- kernel/sched/cpuacct.c | 140 ++--- 1 file changed, 120 insertions(+), 20 deletions(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 9c2bbf7..b3e8971 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -25,11 +25,22 @@ enum cpuacct_stat_index { CPUACCT_STAT_NSTATS, }; +enum cpuacct_usage_index { + CPUACCT_USAGE_USER, /* ... user mode */ + CPUACCT_USAGE_SYSTEM, /* ... kernel mode */ + + CPUACCT_USAGE_NRUSAGE, +}; + +struct cpuacct_usage { + u64 usages[CPUACCT_USAGE_NRUSAGE]; +}; + /* track cpu usage of a group of tasks and its child groups */ struct cpuacct { struct cgroup_subsys_state css; /* cpuusage holds pointer to a u64-type object on every cpu */ - u64 __percpu *cpuusage; + struct cpuacct_usage __percpu *cpuusage; struct kernel_cpustat __percpu *cpustat; }; @@ -49,7 +60,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca) return css_ca(ca->css.parent); } -static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage); +static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage); static struct cpuacct root_cpuacct = { .cpustat= _cpustat, .cpuusage = _cpuacct_cpuusage, @@ -68,7 +79,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css) if (!ca) goto out; - ca->cpuusage = alloc_percpu(u64); + ca->cpuusage = alloc_percpu(struct cpuacct_usage); if (!ca->cpuusage) goto out_free_ca; @@ -96,54 +107,104 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css) kfree(ca); } -static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu) +static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu, +enum cpuacct_usage_index index) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); - u64 data; + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + u64 data = 0; + int i = 0; + + /* +* We allow index == CPUACCT_USAGE_NRUSAGE here to read +* the sum of suages. +*/ + BUG_ON(index > CPUACCT_USAGE_NRUSAGE); + + if (index == CPUACCT_USAGE_NRUSAGE) { + raw_spin_lock_irq(_rq(cpu)->lock); + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + data += cpuusage->usages[i]; + raw_spin_unlock_irq(_rq(cpu)->lock); + + goto out; + } #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit read safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - data = *cpuusage; + data = cpuusage->usages[index]; raw_spin_unlock_irq(_rq(cpu)->lock); #else - data = *cpuusage; + data = cpuusage->usages[index]; #endif +out: return data; } -static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val) +static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, + enum cpuacct_usage_index index, u64 val) { - u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu); + int i = 0; + + /* +* We allow index == CPUACCT_USAGE_NRUSAGE here to write +* val to each index of usages. +*/ + BUG_ON(index > CPUACCT_USAGE_NRUSAGE); + + if (index == CPUACCT_USAGE_NRUSAGE) { + raw_spin_lock_irq(_rq(cpu)->lock); + for (i = 0; i < CPUACCT_USAGE_NRUSAGE; i++) + cpuusage->usages[i] = val; + raw_spin_unlock_irq(_rq(cpu)->lock); + + return; + } #ifndef CONFIG_64BIT /* * Take rq->lock to make 64-bit write safe on 32-bit platforms. */ raw_spin_lock_irq(_rq(cpu)->lock); - *cpuusage = val; + cpuusage->usages[index] = val; raw_spin_unlock_irq(_rq(cpu)->lock); #else - *cpuusage = val; + cpuusage->usages[index] = val; #endif } /* return total cpu usage (in nanoseconds) of a group */ -static u64 cpuusage_read(struct cgroup_subsys_state *css, struct cftype *cft) +s
Re: piping core dump to a program escapes container
On 12/09/2015 11:29 AM, Eric W. Biederman wrote: Dongsheng Yang writes: On 12/09/2015 10:26 AM, Dongsheng Yang wrote: On 10/25/2015 05:54 AM, Shayan Pooya wrote: I noticed the following core_pattern behavior in my linux box while running docker containers. I am not sure if it is bug, but it is inconsistent and not documented. If the core_pattern is set on the host, the containers will observe and use the pattern for dumping cores (there is no per cgroup core_pattern). According to core(5) for setting core_pattern one can: 1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern 2. echo "|/bin/custom_core /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern The former pattern evaluates the /tmp/cores path in the container's filesystem namespace. Which means, the host does not see a core file in /tmp/cores. However, the latter evaluates the /bin/custom_core path in the global filesystem namespace. Moreover, if /bin/core decides to write the core to a path (/tmp/cores in this case as shown by the arg to custom_core), the path will be evaluated in the global filesystem namespace as well. The latter behaviour is counter-intuitive and error-prone as the container can fill up the core-file directory which it does not have direct access to (which means the core is also not accessible for debugging if someone only has access to the container). From a container perspective it is perhaps counter intuitive from the perspective of the operator of the machine nothing works specially about core_pattern and it works as designed with no unusual danages. Hi Shayan, We found the same problem with what you described here. Is there any document for this behaviour? I want to know is that intentional or as you said a 'bug'. Maybe that's intentional to provide a way for admin to collect core dumps from all containers as Richard said. I am interested in it too. Anyone can help here? In addition, is that a good idea to make core_pattern to be seperated in different namespace? The behavior was the best we could do at the time last time this issue was examined.There is enough information available to be able to write a core dumping program that can reliably place your core dumps in your container. There has not yet been an obvious namespace in which to stick core_pattern, and even worse exactly how to appropriate launch a process in a container has not been figured out. Hi Eric, Could you provide an reference to these discussion?? In addition, is there a already infrastructure to do this kind of thing? Thanx Yang If those tricky problems can be solved we can have a core_pattern in a container. What we have now is the best we have been able to figure out so far. Eric Yang Yang Currently, I work around this issue by detecting that the process is crashing from a container (by comparing the namespace pid to the global pid) and refuse to dump the core if it is from a container. Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1). -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 12/09/2015 04:34 PM, Bruno Prémont wrote: On Tue, 08 Dec 2015 21:29:13 -0600 Eric W. Biederman wrote: Dongsheng Yang writes: On 12/09/2015 10:26 AM, Dongsheng Yang wrote: On 10/25/2015 05:54 AM, Shayan Pooya wrote: I noticed the following core_pattern behavior in my linux box while running docker containers. I am not sure if it is bug, but it is inconsistent and not documented. If the core_pattern is set on the host, the containers will observe and use the pattern for dumping cores (there is no per cgroup core_pattern). According to core(5) for setting core_pattern one can: 1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern 2. echo "|/bin/custom_core /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern The former pattern evaluates the /tmp/cores path in the container's filesystem namespace. Which means, the host does not see a core file in /tmp/cores. However, the latter evaluates the /bin/custom_core path in the global filesystem namespace. Moreover, if /bin/core decides to write the core to a path (/tmp/cores in this case as shown by the arg to custom_core), the path will be evaluated in the global filesystem namespace as well. The latter behaviour is counter-intuitive and error-prone as the container can fill up the core-file directory which it does not have direct access to (which means the core is also not accessible for debugging if someone only has access to the container). From a container perspective it is perhaps counter intuitive from the perspective of the operator of the machine nothing works specially about core_pattern and it works as designed with no unusual danages. Hi Shayan, We found the same problem with what you described here. Is there any document for this behaviour? I want to know is that intentional or as you said a 'bug'. Maybe that's intentional to provide a way for admin to collect core dumps from all containers as Richard said. I am interested in it too. Anyone can help here? In addition, is that a good idea to make core_pattern to be seperated in different namespace? The behavior was the best we could do at the time last time this issue was examined.There is enough information available to be able to write a core dumping program that can reliably place your core dumps in your container. There has not yet been an obvious namespace in which to stick core_pattern, and even worse exactly how to appropriate launch a process in a container has not been figured out. If those tricky problems can be solved we can have a core_pattern in a container. What we have now is the best we have been able to figure out so far. Isn't the second option dangerous if its run in global namespace and settable from some other namespace/container? If a process inside a container can set /proc/sys/kernel/core_pattern then it could e.g. set it to echo "|/bin/rm -rf / /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern and kill the host (eventually itself included). Other command lines could do different bad things. Yes, if you don't give a privileged to container, that's read-only to them. But if you give containers privilege, that true as you said. But that's similar with if you give a privilege to any of process running in the machine. So I think it's not a problem. Yang Something that would sound reasonable is to have the core dumping helper process run under the namespaces the process which wrote to /proc/sys/kernel/core_pattern had. When some of those namespaces are gone, falling back to the namespaces of the process for which core is to be dumped might seem reasonable (or just not dumping core at all as is done when core_pipe_limit is exceeded). The value of core_pattern (and other core_* sysctls) should probably belong to the mount namespace the proc filesystem used for setting its value was in - or the matching namespace of calling process when set via syscall. Bruno . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 12/09/2015 02:32 PM, Eric W. Biederman wrote: Dongsheng Yang writes: On 12/09/2015 11:29 AM, Eric W. Biederman wrote: Dongsheng Yang writes: [...] There has not yet been an obvious namespace in which to stick core_pattern, and even worse exactly how to appropriate launch a process in a container has not been figured out. If those tricky problems can be solved we can have a core_pattern in a container. What we have now is the best we have been able to figure out so far. Thanx Eric, but if I want to make docker works rely on this behaviour, is that reliable? I mean, I want to make a docker container to dump the core file to a specified path in host by a pipe way. But I am afraid this behaviour would be changed later. Any suggestion? The kernel rules say if there is a behavior someone depends on and that behavior changes and breaks userspace that is a regression and it is not allowed. As developers we try not to create regressions. But some days it requires someone testing/using the functional enough to catdch an issue. That said the real issue you are likely to run into when developing this as part of docker is that docker doesn't get to own the core pattern. It doesn't make sense for any one application to, as it is a kernel wide setting. Agreed. To have different app or container specific policies for core dumping likely requires either solving the problems I mentioned with containers or in userspace a solution so there can be an /etc/core_pattern.d/ with different configuration and different scripts that somehow know how to select which core files they want and dump them sanely. We would try to solve the problems you mentioned, but sound not easy. Anyway, I need to read some old discussion at first I think. Thanx Yang Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 12/09/2015 02:32 PM, Eric W. Biederman wrote: Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: On 12/09/2015 11:29 AM, Eric W. Biederman wrote: Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: [...] There has not yet been an obvious namespace in which to stick core_pattern, and even worse exactly how to appropriate launch a process in a container has not been figured out. If those tricky problems can be solved we can have a core_pattern in a container. What we have now is the best we have been able to figure out so far. Thanx Eric, but if I want to make docker works rely on this behaviour, is that reliable? I mean, I want to make a docker container to dump the core file to a specified path in host by a pipe way. But I am afraid this behaviour would be changed later. Any suggestion? The kernel rules say if there is a behavior someone depends on and that behavior changes and breaks userspace that is a regression and it is not allowed. As developers we try not to create regressions. But some days it requires someone testing/using the functional enough to catdch an issue. That said the real issue you are likely to run into when developing this as part of docker is that docker doesn't get to own the core pattern. It doesn't make sense for any one application to, as it is a kernel wide setting. Agreed. To have different app or container specific policies for core dumping likely requires either solving the problems I mentioned with containers or in userspace a solution so there can be an /etc/core_pattern.d/ with different configuration and different scripts that somehow know how to select which core files they want and dump them sanely. We would try to solve the problems you mentioned, but sound not easy. Anyway, I need to read some old discussion at first I think. Thanx Yang Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 12/09/2015 04:34 PM, Bruno Prémont wrote: On Tue, 08 Dec 2015 21:29:13 -0600 Eric W. Biederman wrote: Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: On 12/09/2015 10:26 AM, Dongsheng Yang wrote: On 10/25/2015 05:54 AM, Shayan Pooya wrote: I noticed the following core_pattern behavior in my linux box while running docker containers. I am not sure if it is bug, but it is inconsistent and not documented. If the core_pattern is set on the host, the containers will observe and use the pattern for dumping cores (there is no per cgroup core_pattern). According to core(5) for setting core_pattern one can: 1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern 2. echo "|/bin/custom_core /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern The former pattern evaluates the /tmp/cores path in the container's filesystem namespace. Which means, the host does not see a core file in /tmp/cores. However, the latter evaluates the /bin/custom_core path in the global filesystem namespace. Moreover, if /bin/core decides to write the core to a path (/tmp/cores in this case as shown by the arg to custom_core), the path will be evaluated in the global filesystem namespace as well. The latter behaviour is counter-intuitive and error-prone as the container can fill up the core-file directory which it does not have direct access to (which means the core is also not accessible for debugging if someone only has access to the container). From a container perspective it is perhaps counter intuitive from the perspective of the operator of the machine nothing works specially about core_pattern and it works as designed with no unusual danages. Hi Shayan, We found the same problem with what you described here. Is there any document for this behaviour? I want to know is that intentional or as you said a 'bug'. Maybe that's intentional to provide a way for admin to collect core dumps from all containers as Richard said. I am interested in it too. Anyone can help here? In addition, is that a good idea to make core_pattern to be seperated in different namespace? The behavior was the best we could do at the time last time this issue was examined.There is enough information available to be able to write a core dumping program that can reliably place your core dumps in your container. There has not yet been an obvious namespace in which to stick core_pattern, and even worse exactly how to appropriate launch a process in a container has not been figured out. If those tricky problems can be solved we can have a core_pattern in a container. What we have now is the best we have been able to figure out so far. Isn't the second option dangerous if its run in global namespace and settable from some other namespace/container? If a process inside a container can set /proc/sys/kernel/core_pattern then it could e.g. set it to echo "|/bin/rm -rf / /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern and kill the host (eventually itself included). Other command lines could do different bad things. Yes, if you don't give a privileged to container, that's read-only to them. But if you give containers privilege, that true as you said. But that's similar with if you give a privilege to any of process running in the machine. So I think it's not a problem. Yang Something that would sound reasonable is to have the core dumping helper process run under the namespaces the process which wrote to /proc/sys/kernel/core_pattern had. When some of those namespaces are gone, falling back to the namespaces of the process for which core is to be dumped might seem reasonable (or just not dumping core at all as is done when core_pipe_limit is exceeded). The value of core_pattern (and other core_* sysctls) should probably belong to the mount namespace the proc filesystem used for setting its value was in - or the matching namespace of calling process when set via syscall. Bruno . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 12/09/2015 11:29 AM, Eric W. Biederman wrote: Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: On 12/09/2015 10:26 AM, Dongsheng Yang wrote: On 10/25/2015 05:54 AM, Shayan Pooya wrote: I noticed the following core_pattern behavior in my linux box while running docker containers. I am not sure if it is bug, but it is inconsistent and not documented. If the core_pattern is set on the host, the containers will observe and use the pattern for dumping cores (there is no per cgroup core_pattern). According to core(5) for setting core_pattern one can: 1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern 2. echo "|/bin/custom_core /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern The former pattern evaluates the /tmp/cores path in the container's filesystem namespace. Which means, the host does not see a core file in /tmp/cores. However, the latter evaluates the /bin/custom_core path in the global filesystem namespace. Moreover, if /bin/core decides to write the core to a path (/tmp/cores in this case as shown by the arg to custom_core), the path will be evaluated in the global filesystem namespace as well. The latter behaviour is counter-intuitive and error-prone as the container can fill up the core-file directory which it does not have direct access to (which means the core is also not accessible for debugging if someone only has access to the container). From a container perspective it is perhaps counter intuitive from the perspective of the operator of the machine nothing works specially about core_pattern and it works as designed with no unusual danages. Hi Shayan, We found the same problem with what you described here. Is there any document for this behaviour? I want to know is that intentional or as you said a 'bug'. Maybe that's intentional to provide a way for admin to collect core dumps from all containers as Richard said. I am interested in it too. Anyone can help here? In addition, is that a good idea to make core_pattern to be seperated in different namespace? The behavior was the best we could do at the time last time this issue was examined.There is enough information available to be able to write a core dumping program that can reliably place your core dumps in your container. There has not yet been an obvious namespace in which to stick core_pattern, and even worse exactly how to appropriate launch a process in a container has not been figured out. Hi Eric, Could you provide an reference to these discussion?? In addition, is there a already infrastructure to do this kind of thing? Thanx Yang If those tricky problems can be solved we can have a core_pattern in a container. What we have now is the best we have been able to figure out so far. Eric Yang Yang Currently, I work around this issue by detecting that the process is crashing from a container (by comparing the namespace pid to the global pid) and refuse to dump the core if it is from a container. Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1). -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 12/09/2015 11:29 AM, Eric W. Biederman wrote: Dongsheng Yang writes: [...] There has not yet been an obvious namespace in which to stick core_pattern, and even worse exactly how to appropriate launch a process in a container has not been figured out. If those tricky problems can be solved we can have a core_pattern in a container. What we have now is the best we have been able to figure out so far. Thanx Eric, but if I want to make docker works rely on this behaviour, is that reliable? I mean, I want to make a docker container to dump the core file to a specified path in host by a pipe way. But I am afraid this behaviour would be changed later. Any suggestion? Yang Eric Yang Yang Currently, I work around this issue by detecting that the process is crashing from a container (by comparing the namespace pid to the global pid) and refuse to dump the core if it is from a container. Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1). -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 12/09/2015 10:26 AM, Dongsheng Yang wrote: On 10/25/2015 05:54 AM, Shayan Pooya wrote: I noticed the following core_pattern behavior in my linux box while running docker containers. I am not sure if it is bug, but it is inconsistent and not documented. If the core_pattern is set on the host, the containers will observe and use the pattern for dumping cores (there is no per cgroup core_pattern). According to core(5) for setting core_pattern one can: 1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern 2. echo "|/bin/custom_core /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern The former pattern evaluates the /tmp/cores path in the container's filesystem namespace. Which means, the host does not see a core file in /tmp/cores. However, the latter evaluates the /bin/custom_core path in the global filesystem namespace. Moreover, if /bin/core decides to write the core to a path (/tmp/cores in this case as shown by the arg to custom_core), the path will be evaluated in the global filesystem namespace as well. The latter behaviour is counter-intuitive and error-prone as the container can fill up the core-file directory which it does not have direct access to (which means the core is also not accessible for debugging if someone only has access to the container). Hi Shayan, We found the same problem with what you described here. Is there any document for this behaviour? I want to know is that intentional or as you said a 'bug'. Maybe that's intentional to provide a way for admin to collect core dumps from all containers as Richard said. I am interested in it too. Anyone can help here? In addition, is that a good idea to make core_pattern to be seperated in different namespace? Yang Yang Currently, I work around this issue by detecting that the process is crashing from a container (by comparing the namespace pid to the global pid) and refuse to dump the core if it is from a container. Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1). -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 10/25/2015 05:54 AM, Shayan Pooya wrote: I noticed the following core_pattern behavior in my linux box while running docker containers. I am not sure if it is bug, but it is inconsistent and not documented. If the core_pattern is set on the host, the containers will observe and use the pattern for dumping cores (there is no per cgroup core_pattern). According to core(5) for setting core_pattern one can: 1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern 2. echo "|/bin/custom_core /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern The former pattern evaluates the /tmp/cores path in the container's filesystem namespace. Which means, the host does not see a core file in /tmp/cores. However, the latter evaluates the /bin/custom_core path in the global filesystem namespace. Moreover, if /bin/core decides to write the core to a path (/tmp/cores in this case as shown by the arg to custom_core), the path will be evaluated in the global filesystem namespace as well. The latter behaviour is counter-intuitive and error-prone as the container can fill up the core-file directory which it does not have direct access to (which means the core is also not accessible for debugging if someone only has access to the container). Hi Shayan, We found the same problem with what you described here. Is there any document for this behaviour? I want to know is that intentional or as you said a 'bug'. Maybe that's intentional to provide a way for admin to collect core dumps from all containers as Richard said. I am interested in it too. Anyone can help here? Yang Currently, I work around this issue by detecting that the process is crashing from a container (by comparing the namespace pid to the global pid) and refuse to dump the core if it is from a container. Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1). -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 10/25/2015 05:54 AM, Shayan Pooya wrote: I noticed the following core_pattern behavior in my linux box while running docker containers. I am not sure if it is bug, but it is inconsistent and not documented. If the core_pattern is set on the host, the containers will observe and use the pattern for dumping cores (there is no per cgroup core_pattern). According to core(5) for setting core_pattern one can: 1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern 2. echo "|/bin/custom_core /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern The former pattern evaluates the /tmp/cores path in the container's filesystem namespace. Which means, the host does not see a core file in /tmp/cores. However, the latter evaluates the /bin/custom_core path in the global filesystem namespace. Moreover, if /bin/core decides to write the core to a path (/tmp/cores in this case as shown by the arg to custom_core), the path will be evaluated in the global filesystem namespace as well. The latter behaviour is counter-intuitive and error-prone as the container can fill up the core-file directory which it does not have direct access to (which means the core is also not accessible for debugging if someone only has access to the container). Hi Shayan, We found the same problem with what you described here. Is there any document for this behaviour? I want to know is that intentional or as you said a 'bug'. Maybe that's intentional to provide a way for admin to collect core dumps from all containers as Richard said. I am interested in it too. Anyone can help here? Yang Currently, I work around this issue by detecting that the process is crashing from a container (by comparing the namespace pid to the global pid) and refuse to dump the core if it is from a container. Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1). -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 12/09/2015 11:29 AM, Eric W. Biederman wrote: Dongsheng Yang <yangds.f...@cn.fujitsu.com> writes: [...] There has not yet been an obvious namespace in which to stick core_pattern, and even worse exactly how to appropriate launch a process in a container has not been figured out. If those tricky problems can be solved we can have a core_pattern in a container. What we have now is the best we have been able to figure out so far. Thanx Eric, but if I want to make docker works rely on this behaviour, is that reliable? I mean, I want to make a docker container to dump the core file to a specified path in host by a pipe way. But I am afraid this behaviour would be changed later. Any suggestion? Yang Eric Yang Yang Currently, I work around this issue by detecting that the process is crashing from a container (by comparing the namespace pid to the global pid) and refuse to dump the core if it is from a container. Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1). -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: piping core dump to a program escapes container
On 12/09/2015 10:26 AM, Dongsheng Yang wrote: On 10/25/2015 05:54 AM, Shayan Pooya wrote: I noticed the following core_pattern behavior in my linux box while running docker containers. I am not sure if it is bug, but it is inconsistent and not documented. If the core_pattern is set on the host, the containers will observe and use the pattern for dumping cores (there is no per cgroup core_pattern). According to core(5) for setting core_pattern one can: 1. echo "/tmp/cores/core.%e.%p" > /proc/sys/kernel/core_pattern 2. echo "|/bin/custom_core /tmp/cores/ %e %p " > /proc/sys/kernel/core_pattern The former pattern evaluates the /tmp/cores path in the container's filesystem namespace. Which means, the host does not see a core file in /tmp/cores. However, the latter evaluates the /bin/custom_core path in the global filesystem namespace. Moreover, if /bin/core decides to write the core to a path (/tmp/cores in this case as shown by the arg to custom_core), the path will be evaluated in the global filesystem namespace as well. The latter behaviour is counter-intuitive and error-prone as the container can fill up the core-file directory which it does not have direct access to (which means the core is also not accessible for debugging if someone only has access to the container). Hi Shayan, We found the same problem with what you described here. Is there any document for this behaviour? I want to know is that intentional or as you said a 'bug'. Maybe that's intentional to provide a way for admin to collect core dumps from all containers as Richard said. I am interested in it too. Anyone can help here? In addition, is that a good idea to make core_pattern to be seperated in different namespace? Yang Yang Currently, I work around this issue by detecting that the process is crashing from a container (by comparing the namespace pid to the global pid) and refuse to dump the core if it is from a container. Tested on Ubuntu (kernel 3.16) and Fedora (kernel 4.1). -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v13 2/5] gennvm: Generic NVM manager
On 10/28/2015 08:30 AM, Matias Bjørling wrote: The implementation for Open-Channel SSDs is divided into media [...] + lun->reserved_blocks = 2; /* for GC only */ + lun->vlun.id = i; + lun->vlun.lun_id = i % dev->luns_per_chnl; + lun->vlun.chnl_id = i / dev->luns_per_chnl; Please use do_div(). % would be not supported in some platforms, as the kbuild pointed in V12. Yang + lun->vlun.nr_free_blocks = dev->blks_per_lun; + } + return 0; +} + +static int gennvm_block_bb(u32 lun_id, void *bb_bitmap, unsigned int nr_blocks, + void *private) +{ + struct gen_nvm *gn = private; + struct gen_lun *lun = >luns[lun_id]; + struct nvm_block *block; + int i; + + if (unlikely(bitmap_empty(bb_bitmap, nr_blocks))) + return 0; + + i = -1; + while ((i = find_next_bit(bb_bitmap, nr_blocks, i + 1)) < + nr_blocks) { + block = >vlun.blocks[i]; + if (!block) { + pr_err("gen_nvm: BB data is out of bounds.\n"); + return -EINVAL; + } + list_move_tail(>list, >bb_list); + } + + return 0; +} + +static int gennvm_block_map(u64 slba, u32 nlb, __le64 *entries, void *private) +{ + struct nvm_dev *dev = private; + struct gen_nvm *gn = dev->mp; + sector_t max_pages = dev->total_pages * (dev->sec_size >> 9); + u64 elba = slba + nlb; + struct gen_lun *lun; + struct nvm_block *blk; + u64 i; + int lun_id; + + if (unlikely(elba > dev->total_pages)) { + pr_err("gen_nvm: L2P data from device is out of bounds!\n"); + return -EINVAL; + } + + for (i = 0; i < nlb; i++) { + u64 pba = le64_to_cpu(entries[i]); + + if (unlikely(pba >= max_pages && pba != U64_MAX)) { + pr_err("gen_nvm: L2P data entry is out of bounds!\n"); + return -EINVAL; + } + + /* Address zero is a special one. The first page on a disk is +* protected. It often holds internal device boot +* information. +*/ + if (!pba) + continue; + + /* resolve block from physical address */ + lun_id = div_u64(pba, dev->sec_per_lun); + lun = >luns[lun_id]; + + /* Calculate block offset into lun */ + pba = pba - (dev->sec_per_lun * lun_id); + blk = >vlun.blocks[div_u64(pba, dev->sec_per_blk)]; + + if (!blk->type) { + /* at this point, we don't know anything about the +* block. It's up to the FTL on top to re-etablish the +* block state +*/ + list_move_tail(>list, >used_list); + blk->type = 1; + lun->vlun.nr_free_blocks--; + } + } + + return 0; +} + +static int gennvm_blocks_init(struct nvm_dev *dev, struct gen_nvm *gn) +{ + struct gen_lun *lun; + struct nvm_block *block; + sector_t lun_iter, blk_iter, cur_block_id = 0; + int ret; + + gennvm_for_each_lun(gn, lun, lun_iter) { + lun->vlun.blocks = vzalloc(sizeof(struct nvm_block) * + dev->blks_per_lun); + if (!lun->vlun.blocks) + return -ENOMEM; + + for (blk_iter = 0; blk_iter < dev->blks_per_lun; blk_iter++) { + block = >vlun.blocks[blk_iter]; + + INIT_LIST_HEAD(>list); + + block->lun = >vlun; + block->id = cur_block_id++; + + /* First block is reserved for device */ + if (unlikely(lun_iter == 0 && blk_iter == 0)) + continue; + + list_add_tail(>list, >free_list); + } + + if (dev->ops->get_bb_tbl) { + ret = dev->ops->get_bb_tbl(dev->q, lun->vlun.id, + dev->blks_per_lun, gennvm_block_bb, gn); + if (ret) + pr_err("gen_nvm: could not read BB table\n"); + } + } + + if (dev->ops->get_l2p_tbl) { + ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages, + gennvm_block_map, dev); + if (ret) { + pr_err("gen_nvm: could not read L2P table.\n"); + pr_warn("gen_nvm: default block initialization"); + } + } + + return 0; +} + +static int
Re: [PATCH v13 2/5] gennvm: Generic NVM manager
On 10/28/2015 08:30 AM, Matias Bjørling wrote: The implementation for Open-Channel SSDs is divided into media [...] + lun->reserved_blocks = 2; /* for GC only */ + lun->vlun.id = i; + lun->vlun.lun_id = i % dev->luns_per_chnl; + lun->vlun.chnl_id = i / dev->luns_per_chnl; Please use do_div(). % would be not supported in some platforms, as the kbuild pointed in V12. Yang + lun->vlun.nr_free_blocks = dev->blks_per_lun; + } + return 0; +} + +static int gennvm_block_bb(u32 lun_id, void *bb_bitmap, unsigned int nr_blocks, + void *private) +{ + struct gen_nvm *gn = private; + struct gen_lun *lun = >luns[lun_id]; + struct nvm_block *block; + int i; + + if (unlikely(bitmap_empty(bb_bitmap, nr_blocks))) + return 0; + + i = -1; + while ((i = find_next_bit(bb_bitmap, nr_blocks, i + 1)) < + nr_blocks) { + block = >vlun.blocks[i]; + if (!block) { + pr_err("gen_nvm: BB data is out of bounds.\n"); + return -EINVAL; + } + list_move_tail(>list, >bb_list); + } + + return 0; +} + +static int gennvm_block_map(u64 slba, u32 nlb, __le64 *entries, void *private) +{ + struct nvm_dev *dev = private; + struct gen_nvm *gn = dev->mp; + sector_t max_pages = dev->total_pages * (dev->sec_size >> 9); + u64 elba = slba + nlb; + struct gen_lun *lun; + struct nvm_block *blk; + u64 i; + int lun_id; + + if (unlikely(elba > dev->total_pages)) { + pr_err("gen_nvm: L2P data from device is out of bounds!\n"); + return -EINVAL; + } + + for (i = 0; i < nlb; i++) { + u64 pba = le64_to_cpu(entries[i]); + + if (unlikely(pba >= max_pages && pba != U64_MAX)) { + pr_err("gen_nvm: L2P data entry is out of bounds!\n"); + return -EINVAL; + } + + /* Address zero is a special one. The first page on a disk is +* protected. It often holds internal device boot +* information. +*/ + if (!pba) + continue; + + /* resolve block from physical address */ + lun_id = div_u64(pba, dev->sec_per_lun); + lun = >luns[lun_id]; + + /* Calculate block offset into lun */ + pba = pba - (dev->sec_per_lun * lun_id); + blk = >vlun.blocks[div_u64(pba, dev->sec_per_blk)]; + + if (!blk->type) { + /* at this point, we don't know anything about the +* block. It's up to the FTL on top to re-etablish the +* block state +*/ + list_move_tail(>list, >used_list); + blk->type = 1; + lun->vlun.nr_free_blocks--; + } + } + + return 0; +} + +static int gennvm_blocks_init(struct nvm_dev *dev, struct gen_nvm *gn) +{ + struct gen_lun *lun; + struct nvm_block *block; + sector_t lun_iter, blk_iter, cur_block_id = 0; + int ret; + + gennvm_for_each_lun(gn, lun, lun_iter) { + lun->vlun.blocks = vzalloc(sizeof(struct nvm_block) * + dev->blks_per_lun); + if (!lun->vlun.blocks) + return -ENOMEM; + + for (blk_iter = 0; blk_iter < dev->blks_per_lun; blk_iter++) { + block = >vlun.blocks[blk_iter]; + + INIT_LIST_HEAD(>list); + + block->lun = >vlun; + block->id = cur_block_id++; + + /* First block is reserved for device */ + if (unlikely(lun_iter == 0 && blk_iter == 0)) + continue; + + list_add_tail(>list, >free_list); + } + + if (dev->ops->get_bb_tbl) { + ret = dev->ops->get_bb_tbl(dev->q, lun->vlun.id, + dev->blks_per_lun, gennvm_block_bb, gn); + if (ret) + pr_err("gen_nvm: could not read BB table\n"); + } + } + + if (dev->ops->get_l2p_tbl) { + ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages, + gennvm_block_map, dev); + if (ret) { + pr_err("gen_nvm: could not read L2P table.\n"); + pr_warn("gen_nvm: default block initialization"); + } + } + + return 0; +} + +static int
Re: [PATCH] mtd: mtdram: check offs and len in mtdram->erase
On 10/03/2015 01:38 AM, Brian Norris wrote: On Fri, Oct 02, 2015 at 03:31:33PM +0530, Sudip Mukherjee wrote: On Fri, Oct 02, 2015 at 05:39:02PM +0800, Dongsheng Yang wrote: On 10/01/2015 12:41 AM, Sudip Mukherjee wrote: We should prevent user to erasing mtd device with an unaligned offset or length. Signed-off-by: Sudip Mukherjee --- I am not sure if I should add the Signed-off-by of Dongsheng Yang . He is the original author and he should get the credit for that. But I had sent a a patch out to fix this problem before your v1. http://lists.infradead.org/pipermail/linux-mtd/2015-September/062234.html I didn't know that. I think your v1 was applied. Sorry if I left any confusion. Dongsheng's v1 was applied and reverted. v2 is still under review (and was sent slightly before (?) your v1). Yea, sorry I should have mentioned it earlier. But I was and am still in a vacation, then I did not point it out in time. Sudip, any comment or test for my patch there is always welcome. Thanx Yang Feel free to comment there. Brian . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mtd: mtdram: check offs and len in mtdram->erase
On 10/01/2015 12:41 AM, Sudip Mukherjee wrote: We should prevent user to erasing mtd device with an unaligned offset or length. Signed-off-by: Sudip Mukherjee --- I am not sure if I should add the Signed-off-by of Dongsheng Yang . He is the original author and he should get the credit for that. But I had sent a a patch out to fix this problem before your v1. http://lists.infradead.org/pipermail/linux-mtd/2015-September/062234.html Yang drivers/mtd/devices/mtdram.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/mtd/devices/mtdram.c b/drivers/mtd/devices/mtdram.c index 8e28508..21b6a05 100644 --- a/drivers/mtd/devices/mtdram.c +++ b/drivers/mtd/devices/mtdram.c @@ -32,8 +32,35 @@ MODULE_PARM_DESC(erase_size, "Device erase block size in KiB"); // We could store these in the mtd structure, but we only support 1 device.. static struct mtd_info *mtd_info; +static int check_offs_len(struct mtd_info *mtd, loff_t ofs, uint64_t len) +{ + int ret = 0; + uint64_t temp_len, rem; + + /* Start address must align on block boundary */ + temp_len = ofs; + rem = do_div(temp_len, mtd->erasesize); + if (rem) { + pr_debug("%s: unaligned address\n", __func__); + ret = -EINVAL; + } + + /* Length must align on block boundary */ + temp_len = len; + rem = do_div(temp_len, mtd->erasesize); + + if (rem) { + pr_debug("%s: length not block aligned\n", __func__); + ret = -EINVAL; + } + + return ret; +} + static int ram_erase(struct mtd_info *mtd, struct erase_info *instr) { + if (check_offs_len(mtd, instr->addr, instr->len)) + return -EINVAL; memset((char *)mtd->priv + instr->addr, 0xff, instr->len); instr->state = MTD_ERASE_DONE; mtd_erase_callback(instr); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mtd: mtdram: check offs and len in mtdram->erase
On 10/01/2015 12:41 AM, Sudip Mukherjee wrote: We should prevent user to erasing mtd device with an unaligned offset or length. Signed-off-by: Sudip Mukherjee <su...@vectorindia.org> --- I am not sure if I should add the Signed-off-by of Dongsheng Yang <yangds.f...@cn.fujitsu.com> . He is the original author and he should get the credit for that. But I had sent a a patch out to fix this problem before your v1. http://lists.infradead.org/pipermail/linux-mtd/2015-September/062234.html Yang drivers/mtd/devices/mtdram.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/mtd/devices/mtdram.c b/drivers/mtd/devices/mtdram.c index 8e28508..21b6a05 100644 --- a/drivers/mtd/devices/mtdram.c +++ b/drivers/mtd/devices/mtdram.c @@ -32,8 +32,35 @@ MODULE_PARM_DESC(erase_size, "Device erase block size in KiB"); // We could store these in the mtd structure, but we only support 1 device.. static struct mtd_info *mtd_info; +static int check_offs_len(struct mtd_info *mtd, loff_t ofs, uint64_t len) +{ + int ret = 0; + uint64_t temp_len, rem; + + /* Start address must align on block boundary */ + temp_len = ofs; + rem = do_div(temp_len, mtd->erasesize); + if (rem) { + pr_debug("%s: unaligned address\n", __func__); + ret = -EINVAL; + } + + /* Length must align on block boundary */ + temp_len = len; + rem = do_div(temp_len, mtd->erasesize); + + if (rem) { + pr_debug("%s: length not block aligned\n", __func__); + ret = -EINVAL; + } + + return ret; +} + static int ram_erase(struct mtd_info *mtd, struct erase_info *instr) { + if (check_offs_len(mtd, instr->addr, instr->len)) + return -EINVAL; memset((char *)mtd->priv + instr->addr, 0xff, instr->len); instr->state = MTD_ERASE_DONE; mtd_erase_callback(instr); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mtd: mtdram: check offs and len in mtdram->erase
On 10/03/2015 01:38 AM, Brian Norris wrote: On Fri, Oct 02, 2015 at 03:31:33PM +0530, Sudip Mukherjee wrote: On Fri, Oct 02, 2015 at 05:39:02PM +0800, Dongsheng Yang wrote: On 10/01/2015 12:41 AM, Sudip Mukherjee wrote: We should prevent user to erasing mtd device with an unaligned offset or length. Signed-off-by: Sudip Mukherjee <su...@vectorindia.org> --- I am not sure if I should add the Signed-off-by of Dongsheng Yang <yangds.f...@cn.fujitsu.com> . He is the original author and he should get the credit for that. But I had sent a a patch out to fix this problem before your v1. http://lists.infradead.org/pipermail/linux-mtd/2015-September/062234.html I didn't know that. I think your v1 was applied. Sorry if I left any confusion. Dongsheng's v1 was applied and reverted. v2 is still under review (and was sent slightly before (?) your v1). Yea, sorry I should have mentioned it earlier. But I was and am still in a vacation, then I did not point it out in time. Sudip, any comment or test for my patch there is always welcome. Thanx Yang Feel free to comment there. Brian . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 4.3-rc1 build error on CentOS 5.11 "scripts/sign-file.c:23:25: fatal error: openssl/cms.h: No such file or directory"
On 09/12/2015 07:22 AM, Davidlohr Bueso wrote: On Fri, 11 Sep 2015, Vinson Lee wrote: Hi. With the latest Linux 4.3-rc1, I am hitting this build error on CentOS 5.11. HOSTCC scripts/sign-file scripts/sign-file.c:23:25: fatal error: openssl/cms.h: No such file or directory #include fwiw/rant, I have run into kernel build issues recently due to lack of openssl libs. The solution is trivial, in my case just intalling my distros openssl-devel package, Even I install-ed the openssl-devel, I can compile it but I met a problem in `make modules_install` At main.c:178: - SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:169 - SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:172 sign-file: : No such file or directory but this will probably break things for more people. And this is with default configs... annoying. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 4.3-rc1 build error on CentOS 5.11 "scripts/sign-file.c:23:25: fatal error: openssl/cms.h: No such file or directory"
On 09/12/2015 07:22 AM, Davidlohr Bueso wrote: On Fri, 11 Sep 2015, Vinson Lee wrote: Hi. With the latest Linux 4.3-rc1, I am hitting this build error on CentOS 5.11. HOSTCC scripts/sign-file scripts/sign-file.c:23:25: fatal error: openssl/cms.h: No such file or directory #include fwiw/rant, I have run into kernel build issues recently due to lack of openssl libs. The solution is trivial, in my case just intalling my distros openssl-devel package, Even I install-ed the openssl-devel, I can compile it but I met a problem in `make modules_install` At main.c:178: - SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:169 - SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:172 sign-file: : No such file or directory but this will probably break things for more people. And this is with default configs... annoying. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs
On 09/04/2015 04:05 PM, Matias Bjørling wrote: So here is a suggestion, register_bm again if we found nvm_dev->bm == NULL in create_target(). And if it is still NULL after that. return an error "nvm: no compatible bm was found" and stop target creating. Otherwise, there would be a NULL Pointer reference problem. That's a real problem I met in my testing and I did this change in my local using. I hope that's useful to you. Hi Yang, ac Similar to this? Okey, I attached two changes in my local using. I hope that useful to you. Yang diff --git i/drivers/lightnvm/core.c w/drivers/lightnvm/core.c index 5e4c2b8..0d2e5e3 100644 --- i/drivers/lightnvm/core.c +++ w/drivers/lightnvm/core.c @@ -262,8 +262,9 @@ int nvm_init(struct nvm_dev *dev) } if (!ret) { - pr_info("nvm: no compatible bm was found.\n"); - return 0; + pr_info("nvm: %s was not initialized due to no compatible bm.\n", + dev->name); + return -EINVAL; } pr_info("nvm: registered %s with luns: %u blocks: %lu sector size: %d\n", . >From 2060232d379328679b22753587d16249f01fa219 Mon Sep 17 00:00:00 2001 From: Dongsheng Yang Date: Fri, 4 Sep 2015 08:10:13 +0900 Subject: [PATCH 2/2] lightNVM: register bm in nvm_create_target if dev->bm is NULL When we create target, we need to make sure dev->bm is not NULL. If it's NULL try to register bm again. If we still fail to find a proper bm for this dev, return error rather than continue to provide a NULL pointer dereference problem later. Signed-off-by: Dongsheng Yang --- drivers/lightnvm/core.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 5e4c2b8..9c75ea4 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -293,10 +293,30 @@ static int nvm_create_target(struct nvm_dev *dev, char *ttname, char *tname, int lun_begin, int lun_end) { struct request_queue *tqueue; + struct nvm_bm_type *bt; struct gendisk *tdisk; struct nvm_tgt_type *tt; struct nvm_target *t; void *targetdata; + int ret = 0; + + if (!dev->bm) { + /* register with device with a supported BM */ + list_for_each_entry(bt, _bms, list) { + ret = bt->register_bm(dev); + if (ret < 0) +return ret; /* initialization failed */ + if (ret > 0) { +dev->bm = bt; +break; /* successfully initialized */ + } + } + + if (!ret) { + pr_info("nvm: no compatible bm was found.\n"); + return -ENODEV; + } + } tt = nvm_find_target_type(ttname); if (!tt) { -- 1.8.4.2 >From 699d279ee0dbf3db5a4e7a78d52fb93e954294a1 Mon Sep 17 00:00:00 2001 From: Dongsheng Yang Date: Mon, 31 Aug 2015 17:22:23 -0400 Subject: [PATCH 1/2] lightNVM: fix a compatibility problem in compiling. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In some old gcc version, such as [gcc version 4.4.7 20120313 (Red Hat 4.4.7-4)] there is a compiling error with this kind of code: struct test { union { int data; }; }; int main() { struct test ins = { .data = 1, }; return 0; } # gcc test.c # test.c: In function âmainâ: # test.c:12: error: unknown field âdataâ specified in initializer This patch fix this problem to initialize it in a compatible way. Signed-off-by: Dongsheng Yang --- drivers/block/nvme-lightnvm.c | 58 +++ 1 file changed, 26 insertions(+), 32 deletions(-) diff --git a/drivers/block/nvme-lightnvm.c b/drivers/block/nvme-lightnvm.c index 8ad84c9..d1dbc67 100644 --- a/drivers/block/nvme-lightnvm.c +++ b/drivers/block/nvme-lightnvm.c @@ -184,13 +184,13 @@ static int init_chnls(struct request_queue *q, struct nvm_id *nvm_id, struct nvme_nvm_id_chnl *src = nvme_nvm_id->chnls; struct nvm_id_chnl *dst = nvm_id->chnls; struct nvme_ns *ns = q->queuedata; - struct nvme_nvm_command c = { - .nvm_identify.opcode = nvme_nvm_admin_identify, - .nvm_identify.nsid = cpu_to_le32(ns->ns_id), - }; + struct nvme_nvm_command c = {}; unsigned int len = nvm_id->nchannels; int i, end, ret, off = 0; + c.nvm_identify.opcode = nvme_nvm_admin_identify; + c.nvm_identify.nsid = cpu_to_le32(ns->ns_id); + while (len) { end = min_t(u32, NVME_NVM_CHNLS_PR_REQ, len); @@ -230,13 +230,12 @@ static int nvme_nvm_identify(struct request_queue *q, struct nvm_id *nvm_id) { struct nvme_ns *ns = q->queuedata; struct nvme_nvm_id *nvme_nvm_id; - struct nvme_nvm_command c = { - .nvm_identify.opcode = nvme_nvm_admin_identify, - .nvm_identify.nsid = cpu_to_le32(ns->ns_id), - .nvm_identify.chnl_off = 0, - }; + struct nvme_nvm_command c = {}; int ret; + c.nvm_identify.opcode = nvme_nvm_admin_identify; + c.nvm_identify.nsid = cpu_to_le32(ns->n
Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs
On 09/02/2015 06:48 PM, Matias Bjørling wrote: + +/* register with device with a supported BM */ +list_for_each_entry(bt, _bms, list) { +ret = bt->register_bm(dev); +if (ret < 0) +goto err; /* initialization failed */ +if (ret > 0) { +dev->bm = bt; +break; /* successfully initialized */ +} +} Why just search it from head to tail? Can user specific it in nvm_create_target()? Hi Yang, Currently only the rrpc and a couple of out of tree block managers are built. The register_bm only tries to find a block manager that supports the device, when it finds it, that one is initialized. It is an open question on how we choose the right block manager, e.g. a proprietary and a open-source block manager is in place. Priorities might be a way to go? or mark certain block managers as a catch all? Hopefully we will get away with only a single or two block managers in the future, so we won't have one for each type of device. + +if (!ret) { +pr_info("nvm: no compatible bm was found.\n"); +return 0; +} If we allow nvm_device registered with no bm, we would get a NULL pointer reference problem in later using. Yes, definitely. So here is a suggestion, register_bm again if we found nvm_dev->bm == NULL in create_target(). And if it is still NULL after that. return an error "nvm: no compatible bm was found" and stop target creating. Otherwise, there would be a NULL Pointer reference problem. That's a real problem I met in my testing and I did this change in my local using. I hope that's useful to you. Thanx Yang In the care that happens, I envision it should be possible to register a block manager after a device is loaded, and then any outstanding devices (which does not have a registered block manager), will be probed again. As mentioned above, why we have to choose bm for nvm in nvm_register? Without a block manager, we don't know the structure of the device and how to interact with it. I want to initialize that as soon as possible. So that layers on top can start interacting. Thanx Yang . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs
On 09/02/2015 06:48 PM, Matias Bjørling wrote: + +/* register with device with a supported BM */ +list_for_each_entry(bt, _bms, list) { +ret = bt->register_bm(dev); +if (ret < 0) +goto err; /* initialization failed */ +if (ret > 0) { +dev->bm = bt; +break; /* successfully initialized */ +} +} Why just search it from head to tail? Can user specific it in nvm_create_target()? Hi Yang, Currently only the rrpc and a couple of out of tree block managers are built. The register_bm only tries to find a block manager that supports the device, when it finds it, that one is initialized. It is an open question on how we choose the right block manager, e.g. a proprietary and a open-source block manager is in place. Priorities might be a way to go? or mark certain block managers as a catch all? Hopefully we will get away with only a single or two block managers in the future, so we won't have one for each type of device. + +if (!ret) { +pr_info("nvm: no compatible bm was found.\n"); +return 0; +} If we allow nvm_device registered with no bm, we would get a NULL pointer reference problem in later using. Yes, definitely. So here is a suggestion, register_bm again if we found nvm_dev->bm == NULL in create_target(). And if it is still NULL after that. return an error "nvm: no compatible bm was found" and stop target creating. Otherwise, there would be a NULL Pointer reference problem. That's a real problem I met in my testing and I did this change in my local using. I hope that's useful to you. Thanx Yang In the care that happens, I envision it should be possible to register a block manager after a device is loaded, and then any outstanding devices (which does not have a registered block manager), will be probed again. As mentioned above, why we have to choose bm for nvm in nvm_register? Without a block manager, we don't know the structure of the device and how to interact with it. I want to initialize that as soon as possible. So that layers on top can start interacting. Thanx Yang . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs
On 09/04/2015 04:05 PM, Matias Bjørling wrote: So here is a suggestion, register_bm again if we found nvm_dev->bm == NULL in create_target(). And if it is still NULL after that. return an error "nvm: no compatible bm was found" and stop target creating. Otherwise, there would be a NULL Pointer reference problem. That's a real problem I met in my testing and I did this change in my local using. I hope that's useful to you. Hi Yang, ac Similar to this? Okey, I attached two changes in my local using. I hope that useful to you. Yang diff --git i/drivers/lightnvm/core.c w/drivers/lightnvm/core.c index 5e4c2b8..0d2e5e3 100644 --- i/drivers/lightnvm/core.c +++ w/drivers/lightnvm/core.c @@ -262,8 +262,9 @@ int nvm_init(struct nvm_dev *dev) } if (!ret) { - pr_info("nvm: no compatible bm was found.\n"); - return 0; + pr_info("nvm: %s was not initialized due to no compatible bm.\n", + dev->name); + return -EINVAL; } pr_info("nvm: registered %s with luns: %u blocks: %lu sector size: %d\n", . >From 2060232d379328679b22753587d16249f01fa219 Mon Sep 17 00:00:00 2001 From: Dongsheng Yang <yangds.f...@cn.fujitsu.com> Date: Fri, 4 Sep 2015 08:10:13 +0900 Subject: [PATCH 2/2] lightNVM: register bm in nvm_create_target if dev->bm is NULL When we create target, we need to make sure dev->bm is not NULL. If it's NULL try to register bm again. If we still fail to find a proper bm for this dev, return error rather than continue to provide a NULL pointer dereference problem later. Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com> --- drivers/lightnvm/core.c | 20 1 file changed, 20 insertions(+) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 5e4c2b8..9c75ea4 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -293,10 +293,30 @@ static int nvm_create_target(struct nvm_dev *dev, char *ttname, char *tname, int lun_begin, int lun_end) { struct request_queue *tqueue; + struct nvm_bm_type *bt; struct gendisk *tdisk; struct nvm_tgt_type *tt; struct nvm_target *t; void *targetdata; + int ret = 0; + + if (!dev->bm) { + /* register with device with a supported BM */ + list_for_each_entry(bt, _bms, list) { + ret = bt->register_bm(dev); + if (ret < 0) +return ret; /* initialization failed */ + if (ret > 0) { +dev->bm = bt; +break; /* successfully initialized */ + } + } + + if (!ret) { + pr_info("nvm: no compatible bm was found.\n"); + return -ENODEV; + } + } tt = nvm_find_target_type(ttname); if (!tt) { -- 1.8.4.2 >From 699d279ee0dbf3db5a4e7a78d52fb93e954294a1 Mon Sep 17 00:00:00 2001 From: Dongsheng Yang <yangds.f...@cn.fujitsu.com> Date: Mon, 31 Aug 2015 17:22:23 -0400 Subject: [PATCH 1/2] lightNVM: fix a compatibility problem in compiling. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In some old gcc version, such as [gcc version 4.4.7 20120313 (Red Hat 4.4.7-4)] there is a compiling error with this kind of code: struct test { union { int data; }; }; int main() { struct test ins = { .data = 1, }; return 0; } # gcc test.c # test.c: In function âmainâ: # test.c:12: error: unknown field âdataâ specified in initializer This patch fix this problem to initialize it in a compatible way. Signed-off-by: Dongsheng Yang <yangds.f...@cn.fujitsu.com> --- drivers/block/nvme-lightnvm.c | 58 +++ 1 file changed, 26 insertions(+), 32 deletions(-) diff --git a/drivers/block/nvme-lightnvm.c b/drivers/block/nvme-lightnvm.c index 8ad84c9..d1dbc67 100644 --- a/drivers/block/nvme-lightnvm.c +++ b/drivers/block/nvme-lightnvm.c @@ -184,13 +184,13 @@ static int init_chnls(struct request_queue *q, struct nvm_id *nvm_id, struct nvme_nvm_id_chnl *src = nvme_nvm_id->chnls; struct nvm_id_chnl *dst = nvm_id->chnls; struct nvme_ns *ns = q->queuedata; - struct nvme_nvm_command c = { - .nvm_identify.opcode = nvme_nvm_admin_identify, - .nvm_identify.nsid = cpu_to_le32(ns->ns_id), - }; + struct nvme_nvm_command c = {}; unsigned int len = nvm_id->nchannels; int i, end, ret, off = 0; + c.nvm_identify.opcode = nvme_nvm_admin_identify; + c.nvm_identify.nsid = cpu_to_le32(ns->ns_id); + while (len) { end = min_t(u32, NVME_NVM_CHNLS_PR_REQ, len); @@ -230,13 +230,12 @@ static int nvme_nvm_identify(struct request_queue *q, struct nvm_id *nvm_id) { struct nvme_ns *ns = q->queuedata; struct nvme_nvm_id *nvme_nvm_id; - struct nvme_nvm_command c = { - .nvm_identify.opcode = nvme_nvm_admin_identify, - .nvm_identify.nsid = cpu_to_le32(ns->ns_id), - .nvm_identify.chnl_off = 0, - }; + struct nvm
Re: [PATCH v7 0/5] Support for Open-Channel SSDs
On 08/07/2015 10:29 PM, Matias Bjørling wrote: These patches implement support for Open-Channel SSDs. Applies against axboe's linux-block/for-4.3/drivers and can be found in the lkml_v7 branch at https://github.com/OpenChannelSSD/linux Any feedback is greatly appreciated. Hi Matias, After a reading of your code, that's a great idea. I tried it with null_nvm and qemu-nvm. I have two questions here. (1), Why we name it lightnvm? IIUC, this framework can work for other flashes not only NVMe protocol. (2), There are gc and bm, but where is the wear leveling? In hardware? Thanx Yang Changes since v6: - Multipage support (Javier Gonzalez) - General code cleanups - Fixed memleak on register failure Changes since v5: Feedback from Christoph Hellwig. - Created new null_nvm from null_blk to register itself as a lightnvm device. - Changed the interface of register/unregister to only take disk_name. The gendisk alloc in nvme is kept. Most instantiations will involve the device gendisk, therefore wait with refactoring to a later time. - Renamed global parameters in core.c and rrpc.c Changes since v4: - Remove gendisk->nvm dependency - Remove device driver rq private field dependency. - Update submission and completion. The flow is now Target -> Block Manager -> Device Driver, replacing callbacks in device driver. - Abstracted out the block manager into its own module. Other block managers can now be implemented. For example to support fully host-based SSDs. - No longer exposes the device driver gendisk to user-space. - Management is moved into /sys/modules/lnvm/parameters/configure_debug Changes since v3: - Remove dependency on REQ_NVM_GC - Refactor nvme integration to use nvme_submit_sync_cmd for internal commands. - Fix race condition bug on multiple threads on RRPC target. - Rename sysfs entry under the block device from nvm to lightnvm. The configuration is found in /sys/block/*/lightnvm/ Changes since v2: Feedback from Paul Bolle: - Fix license to GPLv2, documentation, compilation. Feedback from Keith Busch: - nvme: Move lightnvm out and into nvme-lightnvm.c. - nvme: Set controller css on lightnvm command set. - nvme: Remove OACS. Feedback from Christoph Hellwig: - lightnvm: Move out of block layer into /drivers/lightnvm/core.c - lightnvm: refactor request->phys_sector into device drivers. - lightnvm: refactor prep/unprep into device drivers. - lightnvm: move nvm_dev from request_queue to gendisk. New - Bad block table support (From Javier). - Update maintainers file. Changes since v1: - Splitted LightNVM into two parts. A get/put interface for flash blocks and the respective targets that implement flash translation layer logic. - Updated the patches according to the LightNVM specification changes. - Added interface to add/remove targets for a block device. Thanks to Jens Axboe, Christoph Hellwig, Keith Busch, Paul Bolle, Javier Gonzalez and Jesper Madsen for discussions and contributions. Matias Bjørling (5): lightnvm: Support for Open-Channel SSDs lightnvm: Hybrid Open-Channel SSD RRPC target lightnvm: Hybrid Open-Channel SSD block manager null_nvm: Lightnvm test driver nvme: LightNVM support MAINTAINERS |8 + drivers/Kconfig |2 + drivers/Makefile |5 + drivers/block/Makefile|2 +- drivers/block/nvme-core.c | 23 +- drivers/block/nvme-lightnvm.c | 568 ++ drivers/lightnvm/Kconfig | 36 ++ drivers/lightnvm/Makefile |8 + drivers/lightnvm/bm_hb.c | 366 drivers/lightnvm/bm_hb.h | 46 ++ drivers/lightnvm/core.c | 591 +++ drivers/lightnvm/null_nvm.c | 481 +++ drivers/lightnvm/rrpc.c | 1296 + drivers/lightnvm/rrpc.h | 236 include/linux/lightnvm.h | 334 +++ include/linux/nvme.h |6 + include/uapi/linux/nvme.h |3 + 17 files changed, 4007 insertions(+), 4 deletions(-) create mode 100644 drivers/block/nvme-lightnvm.c create mode 100644 drivers/lightnvm/Kconfig create mode 100644 drivers/lightnvm/Makefile create mode 100644 drivers/lightnvm/bm_hb.c create mode 100644 drivers/lightnvm/bm_hb.h create mode 100644 drivers/lightnvm/core.c create mode 100644 drivers/lightnvm/null_nvm.c create mode 100644 drivers/lightnvm/rrpc.c create mode 100644 drivers/lightnvm/rrpc.h create mode 100644 include/linux/lightnvm.h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs
On 08/07/2015 10:29 PM, Matias Bjørling wrote: Open-channel SSDs are devices that share responsibilities with the host in order to implement and maintain features that typical SSDs keep strictly in firmware. These include (i) the Flash Translation Layer (FTL), (ii) bad block management, and (iii) hardware units such as the flash controller, the interface controller, and large amounts of flash chips. In this way, Open-channels SSDs exposes direct access to their physical flash storage, while keeping a subset of the internal features of SSDs. LightNVM is a specification that gives support to Open-channel SSDs LightNVM allows the host to manage data placement, garbage collection, and parallelism. Device specific responsibilities such as bad block management, FTL extensions to support atomic IOs, or metadata persistence are still handled by the device. The implementation of LightNVM consists of two parts: core and (multiple) targets. The core implements functionality shared across targets. This is initialization, teardown and statistics. The targets implement the interface that exposes physical flash to user-space applications. Examples of such targets include key-value store, object-store, as well as traditional block devices, which can be application-specific. Contributions in this patch from: Javier Gonzalez Jesper Madsen Signed-off-by: Matias Bjørling --- MAINTAINERS | 8 + drivers/Kconfig | 2 + drivers/Makefile | 5 + drivers/lightnvm/Kconfig | 16 ++ drivers/lightnvm/Makefile | 5 + drivers/lightnvm/core.c | 590 ++ include/linux/lightnvm.h | 335 ++ 7 files changed, 961 insertions(+) create mode 100644 drivers/lightnvm/Kconfig create mode 100644 drivers/lightnvm/Makefile create mode 100644 drivers/lightnvm/core.c create mode 100644 include/linux/lightnvm.h diff --git a/MAINTAINERS b/MAINTAINERS index 2d3d55c..d149104 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6162,6 +6162,14 @@ S: Supported F:drivers/nvdimm/pmem.c F:include/linux/pmem.h +LIGHTNVM PLATFORM SUPPORT +M: Matias Bjorling +W: http://github/OpenChannelSSD +S: Maintained +F: drivers/lightnvm/ +F: include/linux/lightnvm.h +F: include/uapi/linux/lightnvm.h + LINUX FOR IBM pSERIES (RS/6000) M:Paul Mackerras W:http://www.ibm.com/linux/ltc/projects/ppc diff --git a/drivers/Kconfig b/drivers/Kconfig index 6e973b8..3992902 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -42,6 +42,8 @@ source "drivers/net/Kconfig" source "drivers/isdn/Kconfig" +source "drivers/lightnvm/Kconfig" + # input before char - char/joystick depends on it. As does USB. source "drivers/input/Kconfig" diff --git a/drivers/Makefile b/drivers/Makefile index b64b49f..75978ab 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -63,6 +63,10 @@ obj-$(CONFIG_FB_I810) += video/fbdev/i810/ obj-$(CONFIG_FB_INTEL) += video/fbdev/intelfb/ obj-$(CONFIG_PARPORT) += parport/ + +# lightnvm/ comes before block to initialize bm before usage +obj-$(CONFIG_NVM) += lightnvm/ + obj-y += base/ block/ misc/ mfd/ nfc/ obj-$(CONFIG_LIBNVDIMM) += nvdimm/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ @@ -165,3 +169,4 @@ obj-$(CONFIG_RAS) += ras/ obj-$(CONFIG_THUNDERBOLT) += thunderbolt/ obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/ obj-$(CONFIG_ANDROID) += android/ + diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig new file mode 100644 index 000..1f8412c --- /dev/null +++ b/drivers/lightnvm/Kconfig @@ -0,0 +1,16 @@ +# +# Open-Channel SSD NVM configuration +# + +menuconfig NVM + bool "Open-Channel SSD target support" + depends on BLOCK + help + Say Y here to get to enable Open-channel SSDs. + + Open-Channel SSDs implement a set of extension to SSDs, that + exposes direct access to the underlying non-volatile memory. + + If you say N, all options in this submenu will be skipped and disabled + only do this if you know what you are doing. + diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile new file mode 100644 index 000..38185e9 --- /dev/null +++ b/drivers/lightnvm/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for Open-Channel SSDs. +# + +obj-$(CONFIG_NVM) := core.o diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c new file mode 100644 index 000..6499922 --- /dev/null +++ b/drivers/lightnvm/core.c @@ -0,0 +1,590 @@ +/* + * Copyright (C) 2015 IT University of Copenhagen + * Initial release: Matias Bjorling + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * This program is distributed in
Re: [PATCH v7 1/5] lightnvm: Support for Open-Channel SSDs
On 08/07/2015 10:29 PM, Matias Bjørling wrote: Open-channel SSDs are devices that share responsibilities with the host in order to implement and maintain features that typical SSDs keep strictly in firmware. These include (i) the Flash Translation Layer (FTL), (ii) bad block management, and (iii) hardware units such as the flash controller, the interface controller, and large amounts of flash chips. In this way, Open-channels SSDs exposes direct access to their physical flash storage, while keeping a subset of the internal features of SSDs. LightNVM is a specification that gives support to Open-channel SSDs LightNVM allows the host to manage data placement, garbage collection, and parallelism. Device specific responsibilities such as bad block management, FTL extensions to support atomic IOs, or metadata persistence are still handled by the device. The implementation of LightNVM consists of two parts: core and (multiple) targets. The core implements functionality shared across targets. This is initialization, teardown and statistics. The targets implement the interface that exposes physical flash to user-space applications. Examples of such targets include key-value store, object-store, as well as traditional block devices, which can be application-specific. Contributions in this patch from: Javier GonzalezJesper Madsen Signed-off-by: Matias Bjørling --- MAINTAINERS | 8 + drivers/Kconfig | 2 + drivers/Makefile | 5 + drivers/lightnvm/Kconfig | 16 ++ drivers/lightnvm/Makefile | 5 + drivers/lightnvm/core.c | 590 ++ include/linux/lightnvm.h | 335 ++ 7 files changed, 961 insertions(+) create mode 100644 drivers/lightnvm/Kconfig create mode 100644 drivers/lightnvm/Makefile create mode 100644 drivers/lightnvm/core.c create mode 100644 include/linux/lightnvm.h diff --git a/MAINTAINERS b/MAINTAINERS index 2d3d55c..d149104 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6162,6 +6162,14 @@ S: Supported F:drivers/nvdimm/pmem.c F:include/linux/pmem.h +LIGHTNVM PLATFORM SUPPORT +M: Matias Bjorling +W: http://github/OpenChannelSSD +S: Maintained +F: drivers/lightnvm/ +F: include/linux/lightnvm.h +F: include/uapi/linux/lightnvm.h + LINUX FOR IBM pSERIES (RS/6000) M:Paul Mackerras W:http://www.ibm.com/linux/ltc/projects/ppc diff --git a/drivers/Kconfig b/drivers/Kconfig index 6e973b8..3992902 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -42,6 +42,8 @@ source "drivers/net/Kconfig" source "drivers/isdn/Kconfig" +source "drivers/lightnvm/Kconfig" + # input before char - char/joystick depends on it. As does USB. source "drivers/input/Kconfig" diff --git a/drivers/Makefile b/drivers/Makefile index b64b49f..75978ab 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -63,6 +63,10 @@ obj-$(CONFIG_FB_I810) += video/fbdev/i810/ obj-$(CONFIG_FB_INTEL) += video/fbdev/intelfb/ obj-$(CONFIG_PARPORT) += parport/ + +# lightnvm/ comes before block to initialize bm before usage +obj-$(CONFIG_NVM) += lightnvm/ + obj-y += base/ block/ misc/ mfd/ nfc/ obj-$(CONFIG_LIBNVDIMM) += nvdimm/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ @@ -165,3 +169,4 @@ obj-$(CONFIG_RAS) += ras/ obj-$(CONFIG_THUNDERBOLT) += thunderbolt/ obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/ obj-$(CONFIG_ANDROID) += android/ + diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig new file mode 100644 index 000..1f8412c --- /dev/null +++ b/drivers/lightnvm/Kconfig @@ -0,0 +1,16 @@ +# +# Open-Channel SSD NVM configuration +# + +menuconfig NVM + bool "Open-Channel SSD target support" + depends on BLOCK + help + Say Y here to get to enable Open-channel SSDs. + + Open-Channel SSDs implement a set of extension to SSDs, that + exposes direct access to the underlying non-volatile memory. + + If you say N, all options in this submenu will be skipped and disabled + only do this if you know what you are doing. + diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile new file mode 100644 index 000..38185e9 --- /dev/null +++ b/drivers/lightnvm/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for Open-Channel SSDs. +# + +obj-$(CONFIG_NVM) := core.o diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c new file mode 100644 index 000..6499922 --- /dev/null +++ b/drivers/lightnvm/core.c @@ -0,0 +1,590 @@ +/* + * Copyright (C) 2015 IT University of Copenhagen + * Initial release: Matias Bjorling + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License
Re: [PATCH v7 0/5] Support for Open-Channel SSDs
On 08/07/2015 10:29 PM, Matias Bjørling wrote: These patches implement support for Open-Channel SSDs. Applies against axboe's linux-block/for-4.3/drivers and can be found in the lkml_v7 branch at https://github.com/OpenChannelSSD/linux Any feedback is greatly appreciated. Hi Matias, After a reading of your code, that's a great idea. I tried it with null_nvm and qemu-nvm. I have two questions here. (1), Why we name it lightnvm? IIUC, this framework can work for other flashes not only NVMe protocol. (2), There are gc and bm, but where is the wear leveling? In hardware? Thanx Yang Changes since v6: - Multipage support (Javier Gonzalez) - General code cleanups - Fixed memleak on register failure Changes since v5: Feedback from Christoph Hellwig. - Created new null_nvm from null_blk to register itself as a lightnvm device. - Changed the interface of register/unregister to only take disk_name. The gendisk alloc in nvme is kept. Most instantiations will involve the device gendisk, therefore wait with refactoring to a later time. - Renamed global parameters in core.c and rrpc.c Changes since v4: - Remove gendisk->nvm dependency - Remove device driver rq private field dependency. - Update submission and completion. The flow is now Target -> Block Manager -> Device Driver, replacing callbacks in device driver. - Abstracted out the block manager into its own module. Other block managers can now be implemented. For example to support fully host-based SSDs. - No longer exposes the device driver gendisk to user-space. - Management is moved into /sys/modules/lnvm/parameters/configure_debug Changes since v3: - Remove dependency on REQ_NVM_GC - Refactor nvme integration to use nvme_submit_sync_cmd for internal commands. - Fix race condition bug on multiple threads on RRPC target. - Rename sysfs entry under the block device from nvm to lightnvm. The configuration is found in /sys/block/*/lightnvm/ Changes since v2: Feedback from Paul Bolle: - Fix license to GPLv2, documentation, compilation. Feedback from Keith Busch: - nvme: Move lightnvm out and into nvme-lightnvm.c. - nvme: Set controller css on lightnvm command set. - nvme: Remove OACS. Feedback from Christoph Hellwig: - lightnvm: Move out of block layer into /drivers/lightnvm/core.c - lightnvm: refactor request->phys_sector into device drivers. - lightnvm: refactor prep/unprep into device drivers. - lightnvm: move nvm_dev from request_queue to gendisk. New - Bad block table support (From Javier). - Update maintainers file. Changes since v1: - Splitted LightNVM into two parts. A get/put interface for flash blocks and the respective targets that implement flash translation layer logic. - Updated the patches according to the LightNVM specification changes. - Added interface to add/remove targets for a block device. Thanks to Jens Axboe, Christoph Hellwig, Keith Busch, Paul Bolle, Javier Gonzalez and Jesper Madsen for discussions and contributions. Matias Bjørling (5): lightnvm: Support for Open-Channel SSDs lightnvm: Hybrid Open-Channel SSD RRPC target lightnvm: Hybrid Open-Channel SSD block manager null_nvm: Lightnvm test driver nvme: LightNVM support MAINTAINERS |8 + drivers/Kconfig |2 + drivers/Makefile |5 + drivers/block/Makefile|2 +- drivers/block/nvme-core.c | 23 +- drivers/block/nvme-lightnvm.c | 568 ++ drivers/lightnvm/Kconfig | 36 ++ drivers/lightnvm/Makefile |8 + drivers/lightnvm/bm_hb.c | 366 drivers/lightnvm/bm_hb.h | 46 ++ drivers/lightnvm/core.c | 591 +++ drivers/lightnvm/null_nvm.c | 481 +++ drivers/lightnvm/rrpc.c | 1296 + drivers/lightnvm/rrpc.h | 236 include/linux/lightnvm.h | 334 +++ include/linux/nvme.h |6 + include/uapi/linux/nvme.h |3 + 17 files changed, 4007 insertions(+), 4 deletions(-) create mode 100644 drivers/block/nvme-lightnvm.c create mode 100644 drivers/lightnvm/Kconfig create mode 100644 drivers/lightnvm/Makefile create mode 100644 drivers/lightnvm/bm_hb.c create mode 100644 drivers/lightnvm/bm_hb.h create mode 100644 drivers/lightnvm/core.c create mode 100644 drivers/lightnvm/null_nvm.c create mode 100644 drivers/lightnvm/rrpc.c create mode 100644 drivers/lightnvm/rrpc.h create mode 100644 include/linux/lightnvm.h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] ubifs: Remove dead xattr code
On 08/26/2015 10:15 PM, Josh Cartwright wrote: On Thu, Aug 20, 2015 at 10:48:38AM +0800, Dongsheng Yang wrote: On 08/20/2015 04:35 AM, Richard Weinberger wrote: This is a partial revert of commit d7f0b70d30ffb9bbe6b8a3e1035cf0b79965ef53 ("UBIFS: Add security.* XATTR support for the UBIFS"). Hi Richard, What about a full reverting of this commit. In ubifs, we *can* support any namespace of xattr including user, trusted, security or other anyone prefixed by any words. But we have a check_namespace() in xattr.c to limit what we want to support. That said, if we want to "Add security.* XATTR support for the UBIFS", what we need to do is just extending the check_namespace() to allow security namespace pass. And yes, check_namespace() have been supporting security namespace. Is this good enough? Yes, it'd mean that the xattrs end up on disk, but then who's responsible for invoking the selected LSMs inode_init_security() hooks? AFAICT, we'd still need to invoke security_inode_init_security for newly created inodes (which, Richard's proposed commit still does). OH, right. My bad I missed the security_inode_init_security(). Besides to allow security.* prefix in xattr, we still need to call security_inode_init_security() in ubifs_create(). That's true. So what we need to remove is only the ubifs_xattr_handlers. Thanx Josh, you are right. And Richard, sorry for my bad mind. Reviewed-by: Dongsheng Yang Thanx Yang Thanks, Josh (who, admittedly, is neither a filesystem nor security module guy :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] ubifs: Remove dead xattr code
On 08/26/2015 10:15 PM, Josh Cartwright wrote: On Thu, Aug 20, 2015 at 10:48:38AM +0800, Dongsheng Yang wrote: On 08/20/2015 04:35 AM, Richard Weinberger wrote: This is a partial revert of commit d7f0b70d30ffb9bbe6b8a3e1035cf0b79965ef53 (UBIFS: Add security.* XATTR support for the UBIFS). Hi Richard, What about a full reverting of this commit. In ubifs, we *can* support any namespace of xattr including user, trusted, security or other anyone prefixed by any words. But we have a check_namespace() in xattr.c to limit what we want to support. That said, if we want to Add security.* XATTR support for the UBIFS, what we need to do is just extending the check_namespace() to allow security namespace pass. And yes, check_namespace() have been supporting security namespace. Is this good enough? Yes, it'd mean that the xattrs end up on disk, but then who's responsible for invoking the selected LSMs inode_init_security() hooks? AFAICT, we'd still need to invoke security_inode_init_security for newly created inodes (which, Richard's proposed commit still does). OH, right. My bad I missed the security_inode_init_security(). Besides to allow security.* prefix in xattr, we still need to call security_inode_init_security() in ubifs_create(). That's true. So what we need to remove is only the ubifs_xattr_handlers. Thanx Josh, you are right. And Richard, sorry for my bad mind. Reviewed-by: Dongsheng Yang yangds.f...@cn.fujitsu.com Thanx Yang Thanks, Josh (who, admittedly, is neither a filesystem nor security module guy :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ubifs: Allow O_DIRECT
On 08/24/2015 04:03 PM, Christoph Hellwig wrote: On Mon, Aug 24, 2015 at 11:02:42AM +0300, Artem Bityutskiy wrote: Back when we were writing UBIFS, we did not need direct IO, so we did not implement it. But yes, probably someone who cares could just try implementing this feature. So I think the answer here is to implement a real version insted of adding hacks to pretend it is supported. Actually, I had a plan for it. But it's for 4.4 or 4.5 in my plan. Yang . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ubifs: Allow O_DIRECT
On 08/24/2015 03:17 PM, Dongsheng Yang wrote: On 08/24/2015 03:18 PM, Artem Bityutskiy wrote: On Thu, 2015-08-20 at 15:34 +0300, Artem Bityutskiy wrote: On Thu, 2015-08-20 at 13:40 +0200, Richard Weinberger wrote: Basically, we need to see what is the "common practice" here, and follow it. This requires a small research. What would be the most popular Linux FS which does not support direct I/O? Can we check what it does? All popular filesystems seem to support direct IO. That's the problem, application do not expect O_DIRECT to fail. My intention was to do it like exofs: Fair enough, thanks! Signed-off-by: Artem Bityutskiy Richard, you mention this was suggested by Dave, could you please pint to the discussion, if possible? http://lists.infradead.org/pipermail/linux-mtd/2015-August/060702.html That's in a discussion I want to introduce ubifs into xfstests. And Artem, I cc-ed that discussion to you, maybe you can find it in your Inbox. Yang Artem. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ubifs: Allow O_DIRECT
On 08/24/2015 03:18 PM, Artem Bityutskiy wrote: On Thu, 2015-08-20 at 15:34 +0300, Artem Bityutskiy wrote: On Thu, 2015-08-20 at 13:40 +0200, Richard Weinberger wrote: Basically, we need to see what is the "common practice" here, and follow it. This requires a small research. What would be the most popular Linux FS which does not support direct I/O? Can we check what it does? All popular filesystems seem to support direct IO. That's the problem, application do not expect O_DIRECT to fail. My intention was to do it like exofs: Fair enough, thanks! Signed-off-by: Artem Bityutskiy Richard, you mention this was suggested by Dave, could you please pint to the discussion, if possible? http://lists.infradead.org/pipermail/linux-mtd/2015-August/060702.html That's in a discussion I want to introduce ubifs into xfstests. Yang Artem. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ubifs: Allow O_DIRECT
On 08/24/2015 04:03 PM, Christoph Hellwig wrote: On Mon, Aug 24, 2015 at 11:02:42AM +0300, Artem Bityutskiy wrote: Back when we were writing UBIFS, we did not need direct IO, so we did not implement it. But yes, probably someone who cares could just try implementing this feature. So I think the answer here is to implement a real version insted of adding hacks to pretend it is supported. Actually, I had a plan for it. But it's for 4.4 or 4.5 in my plan. Yang . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ubifs: Allow O_DIRECT
On 08/24/2015 03:18 PM, Artem Bityutskiy wrote: On Thu, 2015-08-20 at 15:34 +0300, Artem Bityutskiy wrote: On Thu, 2015-08-20 at 13:40 +0200, Richard Weinberger wrote: Basically, we need to see what is the common practice here, and follow it. This requires a small research. What would be the most popular Linux FS which does not support direct I/O? Can we check what it does? All popular filesystems seem to support direct IO. That's the problem, application do not expect O_DIRECT to fail. My intention was to do it like exofs: Fair enough, thanks! Signed-off-by: Artem Bityutskiy artem.bityuts...@intel.com Richard, you mention this was suggested by Dave, could you please pint to the discussion, if possible? http://lists.infradead.org/pipermail/linux-mtd/2015-August/060702.html That's in a discussion I want to introduce ubifs into xfstests. Yang Artem. -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ubifs: Allow O_DIRECT
On 08/24/2015 03:17 PM, Dongsheng Yang wrote: On 08/24/2015 03:18 PM, Artem Bityutskiy wrote: On Thu, 2015-08-20 at 15:34 +0300, Artem Bityutskiy wrote: On Thu, 2015-08-20 at 13:40 +0200, Richard Weinberger wrote: Basically, we need to see what is the common practice here, and follow it. This requires a small research. What would be the most popular Linux FS which does not support direct I/O? Can we check what it does? All popular filesystems seem to support direct IO. That's the problem, application do not expect O_DIRECT to fail. My intention was to do it like exofs: Fair enough, thanks! Signed-off-by: Artem Bityutskiy artem.bityuts...@intel.com Richard, you mention this was suggested by Dave, could you please pint to the discussion, if possible? http://lists.infradead.org/pipermail/linux-mtd/2015-August/060702.html That's in a discussion I want to introduce ubifs into xfstests. And Artem, I cc-ed that discussion to you, maybe you can find it in your Inbox. Yang Artem. -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ubifs: Allow O_DIRECT
On 08/20/2015 02:42 PM, Richard Weinberger wrote: Yang, (Sorry if I've used your last name lately) Haha, that's fine. My friends in China all call me Dongsheng. :) Am 20.08.2015 um 05:00 schrieb Dongsheng Yang: On 08/20/2015 04:35 AM, Richard Weinberger wrote: Currently UBIFS does not support direct IO, but some applications blindly use the O_DIRECT flag. Instead of failing upon open() we can do better and fall back to buffered IO. H, to be honest, I am not sure we have to do it as Dave suggested. I think that's just a work-around for current fstests. IMHO, perform a buffered IO when user request direct IO without any warning sounds not a good idea. Maybe adding a warning would make it better. Well, how would you inform the user? A printk() to dmesg is useless are the vast majority of open() callers do not check dmesg... :) Major filesystems implement ->direct_IO these days and having a "return 0"-stub seems to be legit. For example exofs does too. So, I really don't consider it a work around. Hmmm, then I am okey with this idea now. I think we need more discussion about AIO in ubifs, and actually I have a plan for it. But I have not listed the all cons and pros of it so far. Sure, having a real ->direct_IO would be be best solution. My patch won't block this. Yes, agree. So let's return 0 currently. Yang Thanks, //richard . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ubifs: Allow O_DIRECT
On 08/20/2015 02:42 PM, Richard Weinberger wrote: Yang, (Sorry if I've used your last name lately) Haha, that's fine. My friends in China all call me Dongsheng. :) Am 20.08.2015 um 05:00 schrieb Dongsheng Yang: On 08/20/2015 04:35 AM, Richard Weinberger wrote: Currently UBIFS does not support direct IO, but some applications blindly use the O_DIRECT flag. Instead of failing upon open() we can do better and fall back to buffered IO. H, to be honest, I am not sure we have to do it as Dave suggested. I think that's just a work-around for current fstests. IMHO, perform a buffered IO when user request direct IO without any warning sounds not a good idea. Maybe adding a warning would make it better. Well, how would you inform the user? A printk() to dmesg is useless are the vast majority of open() callers do not check dmesg... :) Major filesystems implement -direct_IO these days and having a return 0-stub seems to be legit. For example exofs does too. So, I really don't consider it a work around. Hmmm, then I am okey with this idea now. I think we need more discussion about AIODIO in ubifs, and actually I have a plan for it. But I have not listed the all cons and pros of it so far. Sure, having a real -direct_IO would be be best solution. My patch won't block this. Yes, agree. So let's return 0 currently. Yang Thanks, //richard . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ubifs: Allow O_DIRECT
On 08/20/2015 04:35 AM, Richard Weinberger wrote: Currently UBIFS does not support direct IO, but some applications blindly use the O_DIRECT flag. Instead of failing upon open() we can do better and fall back to buffered IO. H, to be honest, I am not sure we have to do it as Dave suggested. I think that's just a work-around for current fstests. IMHO, perform a buffered IO when user request direct IO without any warning sounds not a good idea. Maybe adding a warning would make it better. I think we need more discussion about AIO in ubifs, and actually I have a plan for it. But I have not listed the all cons and pros of it so far. Artem, what's your opinion? Yang Cc: Dongsheng Yang Cc: dedeki...@gmail.com Suggested-by: Dave Chinner Signed-off-by: Richard Weinberger --- fs/ubifs/file.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index a3dfe2a..a61fe86 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1540,6 +1540,15 @@ static int ubifs_file_mmap(struct file *file, struct vm_area_struct *vma) return 0; } +/* + * For now fall back to buffered IO. + */ +static ssize_t ubifs_direct_IO(struct kiocb *iocb, struct iov_iter *iter, + loff_t offset) +{ + return 0; +} + const struct address_space_operations ubifs_file_address_operations = { .readpage = ubifs_readpage, .writepage = ubifs_writepage, @@ -1548,6 +1557,7 @@ const struct address_space_operations ubifs_file_address_operations = { .invalidatepage = ubifs_invalidatepage, .set_page_dirty = ubifs_set_page_dirty, .releasepage= ubifs_releasepage, + .direct_IO = ubifs_direct_IO, }; const struct inode_operations ubifs_file_inode_operations = { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] ubifs: Remove dead xattr code
On 08/20/2015 04:35 AM, Richard Weinberger wrote: This is a partial revert of commit d7f0b70d30ffb9bbe6b8a3e1035cf0b79965ef53 ("UBIFS: Add security.* XATTR support for the UBIFS"). Hi Richard, What about a full reverting of this commit. In ubifs, we *can* support any namespace of xattr including user, trusted, security or other anyone prefixed by any words. But we have a check_namespace() in xattr.c to limit what we want to support. That said, if we want to "Add security.* XATTR support for the UBIFS", what we need to do is just extending the check_namespace() to allow security namespace pass. And yes, check_namespace() have been supporting security namespace. So, IMHO, we do not depend on the generic mechanism at all, and we can fully revert this commit. But strange to me, why we picked this commit for ubifs? Artem, is there something I am missing? Yang As UBIFS does not use generic inode xattr inode operations the code behind sb->s_xattr is never called. Remove that dead code for now. Cc: Subodh Nijsure Cc: Marc Kleine-Budde Cc: Brad Mouring Cc: Gratian Crisan Cc: Artem Bityutskiy Reported-by: Andreas Grünbacher Signed-off-by: Richard Weinberger --- After the merge window (and my vacation) I'll have the time to re-introduce/work security xattr support. Thanks, //richard --- fs/ubifs/super.c | 1 - fs/ubifs/ubifs.h | 1 - fs/ubifs/xattr.c | 40 3 files changed, 42 deletions(-) diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c index 9547a278..c71edca 100644 --- a/fs/ubifs/super.c +++ b/fs/ubifs/super.c @@ -2037,7 +2037,6 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent) if (c->max_inode_sz > MAX_LFS_FILESIZE) sb->s_maxbytes = c->max_inode_sz = MAX_LFS_FILESIZE; sb->s_op = _super_operations; - sb->s_xattr = ubifs_xattr_handlers; mutex_lock(>umount_mutex); err = mount_ubifs(c); diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h index de75902..33b6ee7 100644 --- a/fs/ubifs/ubifs.h +++ b/fs/ubifs/ubifs.h @@ -1470,7 +1470,6 @@ extern spinlock_t ubifs_infos_lock; extern atomic_long_t ubifs_clean_zn_cnt; extern struct kmem_cache *ubifs_inode_slab; extern const struct super_operations ubifs_super_operations; -extern const struct xattr_handler *ubifs_xattr_handlers[]; extern const struct address_space_operations ubifs_file_address_operations; extern const struct file_operations ubifs_file_operations; extern const struct inode_operations ubifs_file_inode_operations; diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c index 96f3448..b512b14 100644 --- a/fs/ubifs/xattr.c +++ b/fs/ubifs/xattr.c @@ -582,46 +582,6 @@ out_free: return err; } -static size_t security_listxattr(struct dentry *d, char *list, size_t list_size, -const char *name, size_t name_len, int flags) -{ - const int prefix_len = XATTR_SECURITY_PREFIX_LEN; - const size_t total_len = prefix_len + name_len + 1; - - if (list && total_len <= list_size) { - memcpy(list, XATTR_SECURITY_PREFIX, prefix_len); - memcpy(list + prefix_len, name, name_len); - list[prefix_len + name_len] = '\0'; - } - - return total_len; -} - -static int security_getxattr(struct dentry *d, const char *name, void *buffer, - size_t size, int flags) -{ - return ubifs_getxattr(d, name, buffer, size); -} - -static int security_setxattr(struct dentry *d, const char *name, -const void *value, size_t size, int flags, -int handler_flags) -{ - return ubifs_setxattr(d, name, value, size, flags); -} - -static const struct xattr_handler ubifs_xattr_security_handler = { - .prefix = XATTR_SECURITY_PREFIX, - .list = security_listxattr, - .get= security_getxattr, - .set= security_setxattr, -}; - -const struct xattr_handler *ubifs_xattr_handlers[] = { - _xattr_security_handler, - NULL, -}; - static int init_xattrs(struct inode *inode, const struct xattr *xattr_array, void *fs_info) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] ubifs: Remove dead xattr code
On 08/20/2015 04:35 AM, Richard Weinberger wrote: This is a partial revert of commit d7f0b70d30ffb9bbe6b8a3e1035cf0b79965ef53 (UBIFS: Add security.* XATTR support for the UBIFS). Hi Richard, What about a full reverting of this commit. In ubifs, we *can* support any namespace of xattr including user, trusted, security or other anyone prefixed by any words. But we have a check_namespace() in xattr.c to limit what we want to support. That said, if we want to Add security.* XATTR support for the UBIFS, what we need to do is just extending the check_namespace() to allow security namespace pass. And yes, check_namespace() have been supporting security namespace. So, IMHO, we do not depend on the generic mechanism at all, and we can fully revert this commit. But strange to me, why we picked this commit for ubifs? Artem, is there something I am missing? Yang As UBIFS does not use generic inode xattr inode operations the code behind sb-s_xattr is never called. Remove that dead code for now. Cc: Subodh Nijsure snijs...@grid-net.com Cc: Marc Kleine-Budde m...@pengutronix.de Cc: Brad Mouring brad.mour...@ni.com Cc: Gratian Crisan gratian.cri...@ni.com Cc: Artem Bityutskiy artem.bityuts...@linux.intel.com Reported-by: Andreas Grünbacher andreas.gruenbac...@gmail.com Signed-off-by: Richard Weinberger rich...@nod.at --- After the merge window (and my vacation) I'll have the time to re-introduce/work security xattr support. Thanks, //richard --- fs/ubifs/super.c | 1 - fs/ubifs/ubifs.h | 1 - fs/ubifs/xattr.c | 40 3 files changed, 42 deletions(-) diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c index 9547a278..c71edca 100644 --- a/fs/ubifs/super.c +++ b/fs/ubifs/super.c @@ -2037,7 +2037,6 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent) if (c-max_inode_sz MAX_LFS_FILESIZE) sb-s_maxbytes = c-max_inode_sz = MAX_LFS_FILESIZE; sb-s_op = ubifs_super_operations; - sb-s_xattr = ubifs_xattr_handlers; mutex_lock(c-umount_mutex); err = mount_ubifs(c); diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h index de75902..33b6ee7 100644 --- a/fs/ubifs/ubifs.h +++ b/fs/ubifs/ubifs.h @@ -1470,7 +1470,6 @@ extern spinlock_t ubifs_infos_lock; extern atomic_long_t ubifs_clean_zn_cnt; extern struct kmem_cache *ubifs_inode_slab; extern const struct super_operations ubifs_super_operations; -extern const struct xattr_handler *ubifs_xattr_handlers[]; extern const struct address_space_operations ubifs_file_address_operations; extern const struct file_operations ubifs_file_operations; extern const struct inode_operations ubifs_file_inode_operations; diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c index 96f3448..b512b14 100644 --- a/fs/ubifs/xattr.c +++ b/fs/ubifs/xattr.c @@ -582,46 +582,6 @@ out_free: return err; } -static size_t security_listxattr(struct dentry *d, char *list, size_t list_size, -const char *name, size_t name_len, int flags) -{ - const int prefix_len = XATTR_SECURITY_PREFIX_LEN; - const size_t total_len = prefix_len + name_len + 1; - - if (list total_len = list_size) { - memcpy(list, XATTR_SECURITY_PREFIX, prefix_len); - memcpy(list + prefix_len, name, name_len); - list[prefix_len + name_len] = '\0'; - } - - return total_len; -} - -static int security_getxattr(struct dentry *d, const char *name, void *buffer, - size_t size, int flags) -{ - return ubifs_getxattr(d, name, buffer, size); -} - -static int security_setxattr(struct dentry *d, const char *name, -const void *value, size_t size, int flags, -int handler_flags) -{ - return ubifs_setxattr(d, name, value, size, flags); -} - -static const struct xattr_handler ubifs_xattr_security_handler = { - .prefix = XATTR_SECURITY_PREFIX, - .list = security_listxattr, - .get= security_getxattr, - .set= security_setxattr, -}; - -const struct xattr_handler *ubifs_xattr_handlers[] = { - ubifs_xattr_security_handler, - NULL, -}; - static int init_xattrs(struct inode *inode, const struct xattr *xattr_array, void *fs_info) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ubifs: Allow O_DIRECT
On 08/20/2015 04:35 AM, Richard Weinberger wrote: Currently UBIFS does not support direct IO, but some applications blindly use the O_DIRECT flag. Instead of failing upon open() we can do better and fall back to buffered IO. H, to be honest, I am not sure we have to do it as Dave suggested. I think that's just a work-around for current fstests. IMHO, perform a buffered IO when user request direct IO without any warning sounds not a good idea. Maybe adding a warning would make it better. I think we need more discussion about AIODIO in ubifs, and actually I have a plan for it. But I have not listed the all cons and pros of it so far. Artem, what's your opinion? Yang Cc: Dongsheng Yang yangds.f...@cn.fujitsu.com Cc: dedeki...@gmail.com Suggested-by: Dave Chinner da...@fromorbit.com Signed-off-by: Richard Weinberger rich...@nod.at --- fs/ubifs/file.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index a3dfe2a..a61fe86 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1540,6 +1540,15 @@ static int ubifs_file_mmap(struct file *file, struct vm_area_struct *vma) return 0; } +/* + * For now fall back to buffered IO. + */ +static ssize_t ubifs_direct_IO(struct kiocb *iocb, struct iov_iter *iter, + loff_t offset) +{ + return 0; +} + const struct address_space_operations ubifs_file_address_operations = { .readpage = ubifs_readpage, .writepage = ubifs_writepage, @@ -1548,6 +1557,7 @@ const struct address_space_operations ubifs_file_address_operations = { .invalidatepage = ubifs_invalidatepage, .set_page_dirty = ubifs_set_page_dirty, .releasepage= ubifs_releasepage, + .direct_IO = ubifs_direct_IO, }; const struct inode_operations ubifs_file_inode_operations = { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ubifs: Kill unneeded locking in ubifs_init_security
On 07/08/2015 05:46 PM, Richard Weinberger wrote: Fixes the following lockdep splat: [1.244527] = [1.245193] [ INFO: possible recursive locking detected ] [1.245193] 4.2.0-rc1+ #37 Not tainted [1.245193] - [1.245193] cp/742 is trying to acquire lock: [1.245193] (>s_type->i_mutex_key#9){+.+.+.}, at: [] ubifs_init_security+0x29/0xb0 [1.245193] [1.245193] but task is already holding lock: [1.245193] (>s_type->i_mutex_key#9){+.+.+.}, at: [] path_openat+0x3af/0x1280 [1.245193] [1.245193] other info that might help us debug this: [1.245193] Possible unsafe locking scenario: [1.245193] [1.245193]CPU0 [1.245193] [1.245193] lock(>s_type->i_mutex_key#9); [1.245193] lock(>s_type->i_mutex_key#9); [1.245193] [1.245193] *** DEADLOCK *** [1.245193] [1.245193] May be due to missing lock nesting notation [1.245193] [1.245193] 2 locks held by cp/742: [1.245193] #0: (sb_writers#5){.+.+.+}, at: [] mnt_want_write+0x1f/0x50 [1.245193] #1: (>s_type->i_mutex_key#9){+.+.+.}, at: [] path_openat+0x3af/0x1280 [1.245193] [1.245193] stack backtrace: [1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37 [1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014 [1.245193] 8252d530 88007b023a38 814f6f49 810b56c5 [1.245193] 88007c30cc80 88007b023af8 810a150d 88007b023a68 [1.245193] 8101302a 8800 0008f447e23f 8252d500 [1.245193] Call Trace: [1.245193] [] dump_stack+0x4c/0x65 [1.245193] [] ? console_unlock+0x1c5/0x510 [1.245193] [] __lock_acquire+0x1a6d/0x1ea0 [1.245193] [] ? __lock_is_held+0x58/0x80 [1.245193] [] lock_acquire+0xd3/0x270 [1.245193] [] ? ubifs_init_security+0x29/0xb0 [1.245193] [] mutex_lock_nested+0x6b/0x3a0 [1.245193] [] ? ubifs_init_security+0x29/0xb0 [1.245193] [] ? ubifs_init_security+0x29/0xb0 [1.245193] [] ubifs_init_security+0x29/0xb0 [1.245193] [] ubifs_create+0xa6/0x1f0 [1.245193] [] ? path_openat+0x3af/0x1280 [1.245193] [] vfs_create+0x95/0xc0 [1.245193] [] path_openat+0x7cc/0x1280 [1.245193] [] ? __lock_acquire+0x543/0x1ea0 [1.245193] [] ? sched_clock_cpu+0x90/0xc0 [1.245193] [] ? calc_global_load_tick+0x60/0x90 [1.245193] [] ? sched_clock_cpu+0x90/0xc0 [1.245193] [] ? __alloc_fd+0xaf/0x180 [1.245193] [] do_filp_open+0x75/0xd0 [1.245193] [] ? _raw_spin_unlock+0x26/0x40 [1.245193] [] ? __alloc_fd+0xaf/0x180 [1.245193] [] do_sys_open+0x129/0x200 [1.245193] [] SyS_open+0x19/0x20 [1.245193] [] entry_SYSCALL_64_fastpath+0x12/0x6f While the lockdep splat is a false positive, becuase path_openat holds i_mutex of the parent directory and ubifs_init_security() tries to acquire i_mutex of a new inode, it reveals that taking i_mutex in ubifs_init_security() is in vain because it is only being called in the inode allocation path and therefore nobody else can see the inode yet. Yes, makes sense to me. Reviewed and Tested. Thanx Yang Reported-by: Boris Brezillon Signed-off-by: Richard Weinberger --- fs/ubifs/xattr.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c index 96f3448..fd65b3f 100644 --- a/fs/ubifs/xattr.c +++ b/fs/ubifs/xattr.c @@ -652,11 +652,8 @@ int ubifs_init_security(struct inode *dentry, struct inode *inode, { int err; - mutex_lock(>i_mutex); err = security_inode_init_security(inode, dentry, qstr, _xattrs, 0); - mutex_unlock(>i_mutex); - if (err) { struct ubifs_info *c = dentry->i_sb->s_fs_info; ubifs_err(c, "cannot initialize security for inode %lu, error %d", -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ubifs: Kill unneeded locking in ubifs_init_security
On 07/08/2015 05:46 PM, Richard Weinberger wrote: Fixes the following lockdep splat: [1.244527] = [1.245193] [ INFO: possible recursive locking detected ] [1.245193] 4.2.0-rc1+ #37 Not tainted [1.245193] - [1.245193] cp/742 is trying to acquire lock: [1.245193] (sb-s_type-i_mutex_key#9){+.+.+.}, at: [812b3f69] ubifs_init_security+0x29/0xb0 [1.245193] [1.245193] but task is already holding lock: [1.245193] (sb-s_type-i_mutex_key#9){+.+.+.}, at: [81198e7f] path_openat+0x3af/0x1280 [1.245193] [1.245193] other info that might help us debug this: [1.245193] Possible unsafe locking scenario: [1.245193] [1.245193]CPU0 [1.245193] [1.245193] lock(sb-s_type-i_mutex_key#9); [1.245193] lock(sb-s_type-i_mutex_key#9); [1.245193] [1.245193] *** DEADLOCK *** [1.245193] [1.245193] May be due to missing lock nesting notation [1.245193] [1.245193] 2 locks held by cp/742: [1.245193] #0: (sb_writers#5){.+.+.+}, at: [811ad37f] mnt_want_write+0x1f/0x50 [1.245193] #1: (sb-s_type-i_mutex_key#9){+.+.+.}, at: [81198e7f] path_openat+0x3af/0x1280 [1.245193] [1.245193] stack backtrace: [1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37 [1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014 [1.245193] 8252d530 88007b023a38 814f6f49 810b56c5 [1.245193] 88007c30cc80 88007b023af8 810a150d 88007b023a68 [1.245193] 8101302a 8800 0008f447e23f 8252d500 [1.245193] Call Trace: [1.245193] [814f6f49] dump_stack+0x4c/0x65 [1.245193] [810b56c5] ? console_unlock+0x1c5/0x510 [1.245193] [810a150d] __lock_acquire+0x1a6d/0x1ea0 [1.245193] [8109fa78] ? __lock_is_held+0x58/0x80 [1.245193] [810a1a93] lock_acquire+0xd3/0x270 [1.245193] [812b3f69] ? ubifs_init_security+0x29/0xb0 [1.245193] [814fc83b] mutex_lock_nested+0x6b/0x3a0 [1.245193] [812b3f69] ? ubifs_init_security+0x29/0xb0 [1.245193] [812b3f69] ? ubifs_init_security+0x29/0xb0 [1.245193] [812b3f69] ubifs_init_security+0x29/0xb0 [1.245193] [8128e286] ubifs_create+0xa6/0x1f0 [1.245193] [81198e7f] ? path_openat+0x3af/0x1280 [1.245193] [81195d15] vfs_create+0x95/0xc0 [1.245193] [8119929c] path_openat+0x7cc/0x1280 [1.245193] [8109ffe3] ? __lock_acquire+0x543/0x1ea0 [1.245193] [81088f20] ? sched_clock_cpu+0x90/0xc0 [1.245193] [81088c00] ? calc_global_load_tick+0x60/0x90 [1.245193] [81088f20] ? sched_clock_cpu+0x90/0xc0 [1.245193] [811a9cef] ? __alloc_fd+0xaf/0x180 [1.245193] [8119ac55] do_filp_open+0x75/0xd0 [1.245193] [814ffd86] ? _raw_spin_unlock+0x26/0x40 [1.245193] [811a9cef] ? __alloc_fd+0xaf/0x180 [1.245193] [81189bd9] do_sys_open+0x129/0x200 [1.245193] [81189cc9] SyS_open+0x19/0x20 [1.245193] [81500717] entry_SYSCALL_64_fastpath+0x12/0x6f While the lockdep splat is a false positive, becuase path_openat holds i_mutex of the parent directory and ubifs_init_security() tries to acquire i_mutex of a new inode, it reveals that taking i_mutex in ubifs_init_security() is in vain because it is only being called in the inode allocation path and therefore nobody else can see the inode yet. Yes, makes sense to me. Reviewed and Tested. Thanx Yang Reported-by: Boris Brezillon boris.brezil...@free-electrons.com Signed-off-by: Richard Weinberger rich...@nod.at --- fs/ubifs/xattr.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c index 96f3448..fd65b3f 100644 --- a/fs/ubifs/xattr.c +++ b/fs/ubifs/xattr.c @@ -652,11 +652,8 @@ int ubifs_init_security(struct inode *dentry, struct inode *inode, { int err; - mutex_lock(inode-i_mutex); err = security_inode_init_security(inode, dentry, qstr, init_xattrs, 0); - mutex_unlock(inode-i_mutex); - if (err) { struct ubifs_info *c = dentry-i_sb-s_fs_info; ubifs_err(c, cannot initialize security for inode %lu, error %d, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] cgroup: Introduce cgroup_detach_task().
On Tue, Aug 26, 2014 at 9:35 AM, Li Zefan wrote: > On 2014/8/25 23:00, Dongsheng Yang wrote: >> On Mon, Aug 25, 2014 at 10:47 PM, Tejun Heo wrote: >>> On Mon, Aug 25, 2014 at 10:46:03PM +0800, Dongsheng Yang wrote: >>>> My point here is that attaching and detaching are a pair of operations. >>> >>> There is no detaching from a cgroup. A task is always attached to a >>> cgroup whether that's a root or non-root cgroup. >> >> Okey, I should not think it as attaching and detaching. Just treat them as >> a move between root and non-root cgroup. >> >> It sounds reasonable to me now. >> > > I from time to time have to explain this to other people. Ha, we usually want to find a detach function when we saw a function named as cgroup_attach_task(). > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] cgroup: Introduce cgroup_detach_task().
On Mon, Aug 25, 2014 at 10:47 PM, Tejun Heo wrote: > On Mon, Aug 25, 2014 at 10:46:03PM +0800, Dongsheng Yang wrote: >> My point here is that attaching and detaching are a pair of operations. > > There is no detaching from a cgroup. A task is always attached to a > cgroup whether that's a root or non-root cgroup. Okey, I should not think it as attaching and detaching. Just treat them as a move between root and non-root cgroup. It sounds reasonable to me now. Thanx. > > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cgroup: Enable controllers for default hierarchy by default.
Hi tj, On Mon, Aug 25, 2014 at 10:40 PM, Tejun Heo wrote: > Hello, Dongsheng. > > On Mon, Aug 25, 2014 at 07:27:32PM +0800, Dongsheng Yang wrote: >> When we create a cgroup in unified hierarchy, we have to enable >> controllers in cgrp_dfl_root.subtree_control manually. From >> my practice, I did not find the benefit we disable controllers >> in cgrp_dfl_root by default. > > Hehe, I actually enjoyed the frankness. No fudging around, basically > just "I used it for a bit and it didn't feel useful to me". > >> As I am a newbie to cgroup, please correct me if I am wrong. > > I don't think being new to cgroup is important here. Regardless of > the specific area, you should be able to build and argue the > rationales for the changes you propose. This is important because not > only the changes need to be justified but also the process will help > you actually think about and analyze the changes that you're > proposing. If you can't build strong enough rationales for a given > change, it probably shouldn't be proposed. > > Here, your rationale is nothing more than "I played with it 5 mins and > it seemed weird to me". There was no effort in understanding the > overall design, use cases or implications. You're just driving by > shooting random patches expecting other people to do what you should > have done before submitting the patches. If you want to change the > behavior, please first study and think about it and ask specific > questions when unsure. Yes, I met something unsure in my studying. And I thought sending a patch to show my question is an acceptable way. Sorry for that. > Submitting basically random patches expecting > others to argue for or against it is one of the worst ways to learn > about a subsystem. Please don't ever do this regardless of the > subsystem in question. Shame on me and sorry for troubling you. I really have some questions in my study, but I should not post them via a patch. I need more reading and thinking. If finally I can't answer my question by myself or google. Then I will ask you experts. Thanx > > Nacked-by: Tejun Heo > > Thanks. > > -- > tejun > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] cgroup: Introduce cgroup_detach_task().
Hi tj. On Mon, Aug 25, 2014 at 10:12 PM, Tejun Heo wrote: > On Mon, Aug 25, 2014 at 07:32:10PM +0800, Dongsheng Yang wrote: >> Currently, the only method to detach a task from a cgroup is moving >> it to others. It looks not natrual to me. > > Ummm... how is this different from moving it to the root cgroup? > "just because" usually isn't a good enough rationale. > Thanx for your reply :). My point here is that attaching and detaching are a pair of operations. And the things involved in these operations should be a process and a cgroup. But currently, when we want to detach a process from a cgroup (A), we have to attach it to another cgroup (B). I think it is not easy to understand. People would say: "Why I want to detach a process from A, I have to care about another cgroup." So I created this patch here wanting to make these operations more easy to understand. And make attaching and detaching to be antonymous in action. I think maybe we can use it in the Cgroup V2 which is in developing. :) That's all my reason for this patch. But you could think it makes no sense. It's okey to me. Thanx > Thanks. > > -- > tejun > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cgroup: Introduce cgroup_detach_task().
Sorry for the noise, :(. This patch is not a correct version. Please ignore it and I have sent a v2 for it. Thanx On 08/25/2014 07:28 PM, Dongsheng Yang wrote: Currently, the only method to detach a task from a cgroup is moving it to others. It looks not natrual to me. Inspired by cgroup_subtree_control_write(), this patch introduce allow user to at-detach a process to/from a cgroup by echo "+/-pid" to cgroup.procs. In addition, we keep the old method to allow user echo "pid" without "+/-" to cgroup.procs as a attaching behavior. Signed-off-by: Dongsheng Yang --- kernel/cgroup.c | 62 +++-- 1 file changed, 60 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 7dc8788..11ef4ac 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2348,6 +2348,43 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp, return ret; } +/** + * cgroup_detach_task - detach a task or a whole threadgroup to a cgroup + * @src_cgrp: the cgroup to detach from + * @leader: the task or the leader of the threadgroup to be attached + * @threadgroup: attach the whole threadgroup? + * + * Call holding cgroup_mutex and threadgroup_lock of @leader. + */ +static int cgroup_detach_task(struct cgroup *src_cgrp __maybe_unused, + struct task_struct *leader, bool threadgroup) +{ + LIST_HEAD(preloaded_csets); + struct task_struct *task; + int ret; + + /* look up all src csets */ + down_read(_set_rwsem); + rcu_read_lock(); + task = leader; + do { + cgroup_migrate_add_src(task_css_set(task), _dfl_root.cgrp, + _csets); + if (!threadgroup) + break; + } while_each_thread(leader, task); + rcu_read_unlock(); + up_read(_set_rwsem); + + /* prepare dst csets and commit */ + ret = cgroup_migrate_prepare_dst(_dfl_root.cgrp, _csets); + if (!ret) + ret = cgroup_migrate(_dfl_root.cgrp, leader, threadgroup); + + cgroup_migrate_finish(_csets); + return ret; +} + /* * Find the task_struct of the task to attach by vpid and pass it along to the * function to attach either it or all tasks in its threadgroup. Will lock @@ -2361,8 +2398,26 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, struct cgroup *cgrp; pid_t pid; int ret; + bool attach; - if (kstrtoint(strstrip(buf), 0, ) || pid < 0) + /* +* Parse input - space separated list of subsystem names prefixed +* with either + or -. +*/ + buf = strstrip(buf); + if (*buf == '+') { + attach = true; + buf++; + } else if (*buf == '-') { + attach = false; + buf++; + } else { + if (!isdigit(*buf)) + return -EINVAL; + attach = true; + } + + if (kstrtoint(buf, 0, ) || pid < 0) return -EINVAL; cgrp = cgroup_kn_lock_live(of->kn); @@ -2426,7 +2481,10 @@ retry_find_task: } } - ret = cgroup_attach_task(cgrp, tsk, threadgroup); + if (attach) + ret = cgroup_attach_task(cgrp, tsk, threadgroup); + else + ret = cgroup_detach_task(cgrp, tsk, threadgroup); threadgroup_unlock(tsk); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] cgroup: Introduce cgroup_detach_task().
Currently, the only method to detach a task from a cgroup is moving it to others. It looks not natrual to me. Inspired by cgroup_subtree_control_write(), this patch introduce allow user to at-detach a process to/from a cgroup by echo "+/-pid" to cgroup.procs. In addition, we keep the old method to allow user echo "pid" without "+/-" to cgroup.procs as a attaching behavior. Signed-off-by: Dongsheng Yang --- >From v1: *Sorry for a incorrect comment in v1. kernel/cgroup.c | 61 +++-- 1 file changed, 59 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 7dc8788..a4bb604 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2348,6 +2348,43 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp, return ret; } +/** + * cgroup_detach_task - detach a task or a whole threadgroup to a cgroup + * @src_cgrp: the cgroup to detach from + * @leader: the task or the leader of the threadgroup to be attached + * @threadgroup: attach the whole threadgroup? + * + * Call holding cgroup_mutex and threadgroup_lock of @leader. + */ +static int cgroup_detach_task(struct cgroup *src_cgrp __maybe_unused, + struct task_struct *leader, bool threadgroup) +{ + LIST_HEAD(preloaded_csets); + struct task_struct *task; + int ret; + + /* look up all src csets */ + down_read(_set_rwsem); + rcu_read_lock(); + task = leader; + do { + cgroup_migrate_add_src(task_css_set(task), _dfl_root.cgrp, + _csets); + if (!threadgroup) + break; + } while_each_thread(leader, task); + rcu_read_unlock(); + up_read(_set_rwsem); + + /* prepare dst csets and commit */ + ret = cgroup_migrate_prepare_dst(_dfl_root.cgrp, _csets); + if (!ret) + ret = cgroup_migrate(_dfl_root.cgrp, leader, threadgroup); + + cgroup_migrate_finish(_csets); + return ret; +} + /* * Find the task_struct of the task to attach by vpid and pass it along to the * function to attach either it or all tasks in its threadgroup. Will lock @@ -2361,8 +2398,25 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, struct cgroup *cgrp; pid_t pid; int ret; + bool attach; - if (kstrtoint(strstrip(buf), 0, ) || pid < 0) + /* +* Parse input - pid prefixed with either + or -. +*/ + buf = strstrip(buf); + if (*buf == '+') { + attach = true; + buf++; + } else if (*buf == '-') { + attach = false; + buf++; + } else { + if (!isdigit(*buf)) + return -EINVAL; + attach = true; + } + + if (kstrtoint(buf, 0, ) || pid < 0) return -EINVAL; cgrp = cgroup_kn_lock_live(of->kn); @@ -2426,7 +2480,10 @@ retry_find_task: } } - ret = cgroup_attach_task(cgrp, tsk, threadgroup); + if (attach) + ret = cgroup_attach_task(cgrp, tsk, threadgroup); + else + ret = cgroup_detach_task(cgrp, tsk, threadgroup); threadgroup_unlock(tsk); -- 1.8.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cgroup: Introduce cgroup_detach_task().
Currently, the only method to detach a task from a cgroup is moving it to others. It looks not natrual to me. Inspired by cgroup_subtree_control_write(), this patch introduce allow user to at-detach a process to/from a cgroup by echo "+/-pid" to cgroup.procs. In addition, we keep the old method to allow user echo "pid" without "+/-" to cgroup.procs as a attaching behavior. Signed-off-by: Dongsheng Yang --- kernel/cgroup.c | 62 +++-- 1 file changed, 60 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 7dc8788..11ef4ac 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2348,6 +2348,43 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp, return ret; } +/** + * cgroup_detach_task - detach a task or a whole threadgroup to a cgroup + * @src_cgrp: the cgroup to detach from + * @leader: the task or the leader of the threadgroup to be attached + * @threadgroup: attach the whole threadgroup? + * + * Call holding cgroup_mutex and threadgroup_lock of @leader. + */ +static int cgroup_detach_task(struct cgroup *src_cgrp __maybe_unused, + struct task_struct *leader, bool threadgroup) +{ + LIST_HEAD(preloaded_csets); + struct task_struct *task; + int ret; + + /* look up all src csets */ + down_read(_set_rwsem); + rcu_read_lock(); + task = leader; + do { + cgroup_migrate_add_src(task_css_set(task), _dfl_root.cgrp, + _csets); + if (!threadgroup) + break; + } while_each_thread(leader, task); + rcu_read_unlock(); + up_read(_set_rwsem); + + /* prepare dst csets and commit */ + ret = cgroup_migrate_prepare_dst(_dfl_root.cgrp, _csets); + if (!ret) + ret = cgroup_migrate(_dfl_root.cgrp, leader, threadgroup); + + cgroup_migrate_finish(_csets); + return ret; +} + /* * Find the task_struct of the task to attach by vpid and pass it along to the * function to attach either it or all tasks in its threadgroup. Will lock @@ -2361,8 +2398,26 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, struct cgroup *cgrp; pid_t pid; int ret; + bool attach; - if (kstrtoint(strstrip(buf), 0, ) || pid < 0) + /* +* Parse input - space separated list of subsystem names prefixed +* with either + or -. +*/ + buf = strstrip(buf); + if (*buf == '+') { + attach = true; + buf++; + } else if (*buf == '-') { + attach = false; + buf++; + } else { + if (!isdigit(*buf)) + return -EINVAL; + attach = true; + } + + if (kstrtoint(buf, 0, ) || pid < 0) return -EINVAL; cgrp = cgroup_kn_lock_live(of->kn); @@ -2426,7 +2481,10 @@ retry_find_task: } } - ret = cgroup_attach_task(cgrp, tsk, threadgroup); + if (attach) + ret = cgroup_attach_task(cgrp, tsk, threadgroup); + else + ret = cgroup_detach_task(cgrp, tsk, threadgroup); threadgroup_unlock(tsk); -- 1.8.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cgroup: Enable controllers for default hierarchy by default.
When we create a cgroup in unified hierarchy, we have to enable controllers in cgrp_dfl_root.subtree_control manually. From my practice, I did not find the benefit we disable controllers in cgrp_dfl_root by default. As I am a newbie to cgroup, please correct me if I am wrong. Thanx. Signed-off-by: Dongsheng Yang --- kernel/cgroup.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 7dc8788..70996ee 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1267,12 +1267,9 @@ static int rebind_subsystems(struct cgroup_root *dst_root, unsigned int ss_mask) src_root->cgrp.subtree_control &= ~(1 << ssid); cgroup_refresh_child_subsys_mask(_root->cgrp); - /* default hierarchy doesn't enable controllers by default */ dst_root->subsys_mask |= 1 << ssid; - if (dst_root != _dfl_root) { - dst_root->cgrp.subtree_control |= 1 << ssid; - cgroup_refresh_child_subsys_mask(_root->cgrp); - } + dst_root->cgrp.subtree_control |= 1 << ssid; + cgroup_refresh_child_subsys_mask(_root->cgrp); if (ss->bind) ss->bind(css); @@ -4990,6 +4987,8 @@ int __init cgroup_init(void) } } + cgrp_dfl_root.cgrp.subtree_control = cgrp_dfl_root.subsys_mask; + cgroup_kobj = kobject_create_and_add("cgroup", fs_kobj); if (!cgroup_kobj) return -ENOMEM; -- 1.8.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cgroup: fix a typo in comment.
There is no function named cgroup_enable_task_cg_links(). Instead, the correct function name in this comment should be cgroup_enabled_task_cg_lists(). Signed-off-by: Dongsheng Yang --- kernel/cgroup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 7dc8788..64bbb56 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5161,7 +5161,7 @@ void cgroup_post_fork(struct task_struct *child) int i; /* -* This may race against cgroup_enable_task_cg_links(). As that +* This may race against cgroup_enable_task_cg_lists(). As that * function sets use_task_css_set_links before grabbing * tasklist_lock and we just went through tasklist_lock to add * @child, it's guaranteed that either we see the set @@ -5176,7 +5176,7 @@ void cgroup_post_fork(struct task_struct *child) * when implementing operations which need to migrate all tasks of * a cgroup to another. * -* Note that if we lose to cgroup_enable_task_cg_links(), @child +* Note that if we lose to cgroup_enable_task_cg_lists(), @child * will remain in init_css_set. This is safe because all tasks are * in the init_css_set before cg_links is enabled and there's no * operation which transfers all tasks out of init_css_set. -- 1.8.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cgroup: Enable controllers for default hierarchy by default.
When we create a cgroup in unified hierarchy, we have to enable controllers in cgrp_dfl_root.subtree_control manually. From my practice, I did not find the benefit we disable controllers in cgrp_dfl_root by default. As I am a newbie to cgroup, please correct me if I am wrong. Thanx. Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com --- kernel/cgroup.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 7dc8788..70996ee 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1267,12 +1267,9 @@ static int rebind_subsystems(struct cgroup_root *dst_root, unsigned int ss_mask) src_root-cgrp.subtree_control = ~(1 ssid); cgroup_refresh_child_subsys_mask(src_root-cgrp); - /* default hierarchy doesn't enable controllers by default */ dst_root-subsys_mask |= 1 ssid; - if (dst_root != cgrp_dfl_root) { - dst_root-cgrp.subtree_control |= 1 ssid; - cgroup_refresh_child_subsys_mask(dst_root-cgrp); - } + dst_root-cgrp.subtree_control |= 1 ssid; + cgroup_refresh_child_subsys_mask(dst_root-cgrp); if (ss-bind) ss-bind(css); @@ -4990,6 +4987,8 @@ int __init cgroup_init(void) } } + cgrp_dfl_root.cgrp.subtree_control = cgrp_dfl_root.subsys_mask; + cgroup_kobj = kobject_create_and_add(cgroup, fs_kobj); if (!cgroup_kobj) return -ENOMEM; -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cgroup: fix a typo in comment.
There is no function named cgroup_enable_task_cg_links(). Instead, the correct function name in this comment should be cgroup_enabled_task_cg_lists(). Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com --- kernel/cgroup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 7dc8788..64bbb56 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5161,7 +5161,7 @@ void cgroup_post_fork(struct task_struct *child) int i; /* -* This may race against cgroup_enable_task_cg_links(). As that +* This may race against cgroup_enable_task_cg_lists(). As that * function sets use_task_css_set_links before grabbing * tasklist_lock and we just went through tasklist_lock to add * @child, it's guaranteed that either we see the set @@ -5176,7 +5176,7 @@ void cgroup_post_fork(struct task_struct *child) * when implementing operations which need to migrate all tasks of * a cgroup to another. * -* Note that if we lose to cgroup_enable_task_cg_links(), @child +* Note that if we lose to cgroup_enable_task_cg_lists(), @child * will remain in init_css_set. This is safe because all tasks are * in the init_css_set before cg_links is enabled and there's no * operation which transfers all tasks out of init_css_set. -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cgroup: Introduce cgroup_detach_task().
Currently, the only method to detach a task from a cgroup is moving it to others. It looks not natrual to me. Inspired by cgroup_subtree_control_write(), this patch introduce allow user to at-detach a process to/from a cgroup by echo +/-pid to cgroup.procs. In addition, we keep the old method to allow user echo pid without +/- to cgroup.procs as a attaching behavior. Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com --- kernel/cgroup.c | 62 +++-- 1 file changed, 60 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 7dc8788..11ef4ac 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2348,6 +2348,43 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp, return ret; } +/** + * cgroup_detach_task - detach a task or a whole threadgroup to a cgroup + * @src_cgrp: the cgroup to detach from + * @leader: the task or the leader of the threadgroup to be attached + * @threadgroup: attach the whole threadgroup? + * + * Call holding cgroup_mutex and threadgroup_lock of @leader. + */ +static int cgroup_detach_task(struct cgroup *src_cgrp __maybe_unused, + struct task_struct *leader, bool threadgroup) +{ + LIST_HEAD(preloaded_csets); + struct task_struct *task; + int ret; + + /* look up all src csets */ + down_read(css_set_rwsem); + rcu_read_lock(); + task = leader; + do { + cgroup_migrate_add_src(task_css_set(task), cgrp_dfl_root.cgrp, + preloaded_csets); + if (!threadgroup) + break; + } while_each_thread(leader, task); + rcu_read_unlock(); + up_read(css_set_rwsem); + + /* prepare dst csets and commit */ + ret = cgroup_migrate_prepare_dst(cgrp_dfl_root.cgrp, preloaded_csets); + if (!ret) + ret = cgroup_migrate(cgrp_dfl_root.cgrp, leader, threadgroup); + + cgroup_migrate_finish(preloaded_csets); + return ret; +} + /* * Find the task_struct of the task to attach by vpid and pass it along to the * function to attach either it or all tasks in its threadgroup. Will lock @@ -2361,8 +2398,26 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, struct cgroup *cgrp; pid_t pid; int ret; + bool attach; - if (kstrtoint(strstrip(buf), 0, pid) || pid 0) + /* +* Parse input - space separated list of subsystem names prefixed +* with either + or -. +*/ + buf = strstrip(buf); + if (*buf == '+') { + attach = true; + buf++; + } else if (*buf == '-') { + attach = false; + buf++; + } else { + if (!isdigit(*buf)) + return -EINVAL; + attach = true; + } + + if (kstrtoint(buf, 0, pid) || pid 0) return -EINVAL; cgrp = cgroup_kn_lock_live(of-kn); @@ -2426,7 +2481,10 @@ retry_find_task: } } - ret = cgroup_attach_task(cgrp, tsk, threadgroup); + if (attach) + ret = cgroup_attach_task(cgrp, tsk, threadgroup); + else + ret = cgroup_detach_task(cgrp, tsk, threadgroup); threadgroup_unlock(tsk); -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] cgroup: Introduce cgroup_detach_task().
Currently, the only method to detach a task from a cgroup is moving it to others. It looks not natrual to me. Inspired by cgroup_subtree_control_write(), this patch introduce allow user to at-detach a process to/from a cgroup by echo +/-pid to cgroup.procs. In addition, we keep the old method to allow user echo pid without +/- to cgroup.procs as a attaching behavior. Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com --- From v1: *Sorry for a incorrect comment in v1. kernel/cgroup.c | 61 +++-- 1 file changed, 59 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 7dc8788..a4bb604 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2348,6 +2348,43 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp, return ret; } +/** + * cgroup_detach_task - detach a task or a whole threadgroup to a cgroup + * @src_cgrp: the cgroup to detach from + * @leader: the task or the leader of the threadgroup to be attached + * @threadgroup: attach the whole threadgroup? + * + * Call holding cgroup_mutex and threadgroup_lock of @leader. + */ +static int cgroup_detach_task(struct cgroup *src_cgrp __maybe_unused, + struct task_struct *leader, bool threadgroup) +{ + LIST_HEAD(preloaded_csets); + struct task_struct *task; + int ret; + + /* look up all src csets */ + down_read(css_set_rwsem); + rcu_read_lock(); + task = leader; + do { + cgroup_migrate_add_src(task_css_set(task), cgrp_dfl_root.cgrp, + preloaded_csets); + if (!threadgroup) + break; + } while_each_thread(leader, task); + rcu_read_unlock(); + up_read(css_set_rwsem); + + /* prepare dst csets and commit */ + ret = cgroup_migrate_prepare_dst(cgrp_dfl_root.cgrp, preloaded_csets); + if (!ret) + ret = cgroup_migrate(cgrp_dfl_root.cgrp, leader, threadgroup); + + cgroup_migrate_finish(preloaded_csets); + return ret; +} + /* * Find the task_struct of the task to attach by vpid and pass it along to the * function to attach either it or all tasks in its threadgroup. Will lock @@ -2361,8 +2398,25 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, struct cgroup *cgrp; pid_t pid; int ret; + bool attach; - if (kstrtoint(strstrip(buf), 0, pid) || pid 0) + /* +* Parse input - pid prefixed with either + or -. +*/ + buf = strstrip(buf); + if (*buf == '+') { + attach = true; + buf++; + } else if (*buf == '-') { + attach = false; + buf++; + } else { + if (!isdigit(*buf)) + return -EINVAL; + attach = true; + } + + if (kstrtoint(buf, 0, pid) || pid 0) return -EINVAL; cgrp = cgroup_kn_lock_live(of-kn); @@ -2426,7 +2480,10 @@ retry_find_task: } } - ret = cgroup_attach_task(cgrp, tsk, threadgroup); + if (attach) + ret = cgroup_attach_task(cgrp, tsk, threadgroup); + else + ret = cgroup_detach_task(cgrp, tsk, threadgroup); threadgroup_unlock(tsk); -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/