Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling
Dima, I agree that the ploop barrier code is broken in many ways, but I don't think the patch actually fixes it. I hope you would agree that completion of REQ_FUA guarantees only landing that particular bio to the disk; it says nothing about flushing previously submitted (and completed) bio-s and it is also possible that power outage may catch us when this REQ_FUA is already landed to the disk, but previous bio-s are not yet. Hence, for RELOC_{A|S} requests we actually need something like that: RELOC_S: R1, W2, FLUSH:WB, WBI:FUA RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA (i.e. we do need to flush all previously submitted data before starting to update BAT on disk) not simply: RELOC_S: R1, W2, WBI:FUA RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we could remove them completely (along we that optimization delaying incoming FUA) and re-implement all this stuff from scratch: 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set REQ_FUA in preq->req_rw before calling ->submit(preq) 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for RELOC_A|S in ploop_index_update and map_wb_complete 3) For that optimization delaying incoming FUA (what we do now if ploop_req_delay_fua_possible() returns true) we could introduce new ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update and map_wb_complete (the same thing as 2) above). And, yes, let's WARN_ON if we somehow missed its processing. The only complication I foresee is about how to teach kaio to pre-flush in kaio_write_page -- it's doable, but involves kaio_resubmit that's already pretty convoluted. Btw, I accidentally noticed awful silly bug in kaio_complete_io_state(): we checks for REQ_FUA after clearing it! This makes all FUA-s on ordinary kaio_submit path silently lost... Thanks, Maxim On 06/15/2016 07:49 AM, Dmitry Monakhov wrote: barrier code is broken in many ways: Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly. But request also can goes though ->dio_submit_alloc()->dio_submit_pad and write_page (for indexes) So in case of grow_dev we have following sequance: E_RELOC_DATA_READ: ->set_bit(PLOOP_REQ_FORCE_FUA, >state); ->delta->allocate ->io->submit_allloc: dio_submit_alloc ->dio_submit_pad E_DATA_WBI : data written, time to update index ->delta->allocate_complete:ploop_index_update ->set_bit(PLOOP_REQ_FORCE_FUA, >state); ->write_page ->ploop_map_wb_complete ->ploop_wb_complete_post_process ->set_bit(PLOOP_REQ_FORCE_FUA, >state); E_RELOC_NULLIFY: ->submit() This patch unify barrier handling like follows: - Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state - Perform explicit FUA inside index_update for RELOC requests. This makes reloc sequence optimal: RELOC_S: R1, W2, WBI:FUA RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA https://jira.sw.ru/browse/PSBM-47107 Signed-off-by: Dmitry Monakhov--- drivers/block/ploop/dev.c | 10 +++--- drivers/block/ploop/map.c | 29 - 2 files changed, 19 insertions(+), 20 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 96f7850..998fe71 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -1224,6 +1224,11 @@ static void ploop_complete_request(struct ploop_request * preq) __TRACE("Z %p %u\n", preq, preq->req_cluster); + if (!preq->error) { + unsigned long state = READ_ONCE(preq->state); + WARN_ON(state & (1 << PLOOP_REQ_FORCE_FUA)); + WARN_ON(state & (1 < bl.head) { struct bio * bio = preq->bl.head; preq->bl.head = bio->bi_next; @@ -2530,9 +2535,8 @@ restart: top_delta = ploop_top_delta(plo); sbl.head = sbl.tail = preq->aux_bio; - /* Relocated data write required sync before BAT updatee */ - set_bit(PLOOP_REQ_FORCE_FUA, >state); - + /* Relocated data write required sync before BAT updatee +* this will happen inside index_update */ if (test_bit(PLOOP_REQ_RELOC_S, >state)) { preq->eng_state = PLOOP_E_DATA_WBI; plo->st.bio_out++; diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c index 3a6365d..c17e598 100644 --- a/drivers/block/ploop/map.c +++ b/drivers/block/ploop/map.c @@ -896,6 +896,7 @@ void ploop_index_update(struct ploop_request * preq) struct ploop_device * plo =
Re: [Devel] memcg: mem_cgroup_uncharge_page() kernel panic/lockup
Hi, Vladimir! Thanks for a quick response. I created JIRA issue and uploaded the dumps. All the information is included into JIRA issue: https://bugs.openvz.org/browse/OVZ-6756 On Wed, Jun 15, 2016 at 11:47 AM, Vladimir Davydovwrote: > Hi, > > Thanks for the report. > > Could you please > > - file a bug to bugzilla.openvz.org > > - upload the vmcore at >rsync://fe.sw.ru/f837d67c8e2ade8cee3367cb0f880268/ > > On Mon, Jun 13, 2016 at 09:24:33AM +0300, Anatoly Stepanov wrote: >> Hello everyone! >> >> We encounter an issue with mem_cgroup_uncharge_page() function, >> it appears quite often on our clients servers. >> >> Basically the issue sometimes leads to hard-lockup, sometimes to GP fault. >> >> Based on bug reports from clients, the problem shows up when a user >> process calls "execve" or "exit" syscalls. >> As we know in those cases kernel invokes "uncharging" for every page >> when its unmapped from all the mm's. >> >> Kernel dump analysis shows that at the moment of >> mem_cgroup_uncharge_page() "memcg" pointer >> (taken from page_cgroup) seems to be pointing to some random memory area. >> >> On the other hand, if we look at current->mm->css, then memcg instance >> exists and is "online". >> >> This led me to a thought that "page_cgroup->memcg" may be changed by >> some part of memcg code in parallel. >> As far as i understand, the only option here is "reclaim code path" >> (may be i'm wrong) >> >> So, i suppose there might be a race between "memcg uncharge code" and >> "memcg reclaim code". >> >> Please, give me your thoughts about it >> thanks >> >> P.S.: >> >> Additional info: >> >> Kernel: rh7-3.10.0-327.10.1.vz7.12.14 >> >> *1st >> BT >> >> PID: 972445 TASK: 88065d53d8d0 CPU: 0 COMMAND: "httpd" >> #0 [880224f37818] machine_kexec at 8105249b >> #1 [880224f37878] crash_kexec at 81103532 >> #2 [880224f37948] oops_end at 81641628 >> #3 [880224f37970] die at 810184cb >> #4 [880224f379a0] do_general_protection at 81640f24 >> #5 [880224f379d0] general_protection at 81640768 >> [exception RIP: mem_cgroup_charge_statistics+19] >> RIP: 811e7733 RSP: 880224f37a80 RFLAGS: 00010202 >> RAX: RBX: 8807b26f0110 RCX: >> RDX: 79726f6765746163 RSI: ea000c9c0440 RDI: 8806a55662f8 >> RBP: 880224f37a80 R8: R9: 03808000 >> R10: 00b8 R11: ea001eaa8980 R12: ea000c9c0440 >> R13: 0001 R14: R15: 8806a5566000 >> ORIG_RAX: CS: 0010 SS: 0018 >> #6 [880224f37a88] __mem_cgroup_uncharge_common at 811e9ddf >> #7 [880224f37ac8] mem_cgroup_uncharge_page at 811ee99a >> #8 [880224f37ad8] page_remove_rmap at 811b9ec9 >> #9 [880224f37b10] unmap_page_range at 811ab580 >> #10 [880224f37bf8] unmap_single_vma at 811aba11 >> #11 [880224f37c30] unmap_vmas at 811ace79 >> #12 [880224f37c68] exit_mmap at 811b663c >> #13 [880224f37d18] mmput at 8107853b >> #14 [880224f37d38] flush_old_exec at 81202547 >> #15 [880224f37d88] load_elf_binary at 8125883c >> #16 [880224f37e58] search_binary_handler at 81201c25 >> #17 [880224f37ea0] do_execve_common at 812032b7 >> #18 [880224f37f30] sys_execve at 81203619 >> #19 [880224f37f50] stub_execve at 81649369 >> RIP: 7f54284b3287 RSP: 7ffda57a0698 RFLAGS: 0297 >> RAX: 003b RBX: 037c5fe8 RCX: >> RDX: 037cf3f8 RSI: 037ce5f8 RDI: 7f5425fcabf1 >> RBP: 7ffda57a0750 R8: 0001 R9: >> >> >> ***2nd >> BT**: >> >> PID: 168440 TASK: 88001e31cc20 CPU: 18 COMMAND: "httpd" >> #0 [88007255f838] machine_kexec at 8105249b >> #1 [88007255f898] crash_kexec at 81103532 >> #2 [88007255f968] oops_end at 81641628 >> #3 [88007255f990] no_context at 8163222b >> #4 [88007255f9e0] __bad_area_nosemaphore at 816322c1 >> #5 [88007255fa30] bad_area_nosemaphore at 8163244a >> #6 [88007255fa40] __do_page_fault at 8164443e >> #7 [88007255faa0] trace_do_page_fault at 81644673 >> #8 [88007255fad8] do_async_page_fault at 81643d59 >> #9 [88007255faf0] async_page_fault at 816407f8 >> [exception RIP: memcg_check_events+435] >> RIP: 811e9b53 RSP: 88007255fba0 RFLAGS: 00010246 >> RAX: f81ef81e RBX: 8802106d5000 RCX: >> RDX: f81e RSI: 0002 RDI:
Re: [Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit
ACK-ed, but see a minor nit below On 06/15/2016 07:49 AM, Dmitry Monakhov wrote: Signed-off-by: Dmitry Monakhov--- drivers/block/ploop/io_direct.c | 22 +- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c index b844a80..74a554a 100644 --- a/drivers/block/ploop/io_direct.c +++ b/drivers/block/ploop/io_direct.c @@ -517,16 +517,18 @@ dio_post_submit(struct ploop_io *io, struct ploop_request * preq) struct ploop_device *plo = preq->plo; sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log; loff_t clu_siz = 1 << (preq->plo->cluster_log + 9); + int force_sync = preq->req_rw & REQ_FUA; int err; file_start_write(io->files.file); - /* Here io->io_count is even ... */ - spin_lock_irq(>lock); - io->io_count++; - set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state); - spin_unlock_irq(>lock); - + if (!force_sync) { + /* Here io->io_count is even ... */ + spin_lock_irq(>lock); + io->io_count++; + set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state); + spin_unlock_irq(>lock); + } err = io->files.file->f_op->fallocate(io->files.file, FALLOC_FL_CONVERT_UNWRITTEN, (loff_t)sec << 9, clu_siz); @@ -535,9 +537,11 @@ dio_post_submit(struct ploop_io *io, struct ploop_request * preq) if (!err && (preq->req_rw & REQ_FUA)) s/(preq->req_rw & REQ_FUA)/force_sync Thanks, Max err = io->ops->sync(io); - spin_lock_irq(>lock); - io->io_count++; - spin_unlock_irq(>lock); + if (!force_sync) { + spin_lock_irq(>lock); + io->io_count++; + spin_unlock_irq(>lock); + } /* and here io->io_count is even (+2) again. */ file_end_write(io->files.file); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 2/3] ploop: deadcode cleanup
Acked-by: Maxim PatlasovOn 06/15/2016 07:49 AM, Dmitry Monakhov wrote: (rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above. Logic was moved to ploop_req_delay_fua_possible() long time ago. Signed-off-by: Dmitry Monakhov --- drivers/block/ploop/io_direct.c | 9 - 1 file changed, 9 deletions(-) diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c index 74a554a..10d2314 100644 --- a/drivers/block/ploop/io_direct.c +++ b/drivers/block/ploop/io_direct.c @@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq, rw &= ~(REQ_FLUSH | REQ_FUA); - /* In case of eng_state != COMPLETE, we'll do FUA in -* ploop_index_update(). Otherwise, we should mark -* last bio as FUA here. */ - if (rw & REQ_FUA) { - rw &= ~REQ_FUA; - if (preq->eng_state == PLOOP_E_COMPLETE) - postfua = 1; - } - bio_list_init(); if (iblk == PLOOP_ZERO_INDEX) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling
barrier code is broken in many ways: Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly. But request also can goes though ->dio_submit_alloc()->dio_submit_pad and write_page (for indexes) So in case of grow_dev we have following sequance: E_RELOC_DATA_READ: ->set_bit(PLOOP_REQ_FORCE_FUA, >state); ->delta->allocate ->io->submit_allloc: dio_submit_alloc ->dio_submit_pad E_DATA_WBI : data written, time to update index ->delta->allocate_complete:ploop_index_update ->set_bit(PLOOP_REQ_FORCE_FUA, >state); ->write_page ->ploop_map_wb_complete ->ploop_wb_complete_post_process ->set_bit(PLOOP_REQ_FORCE_FUA, >state); E_RELOC_NULLIFY: ->submit() This patch unify barrier handling like follows: - Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state - Perform explicit FUA inside index_update for RELOC requests. This makes reloc sequence optimal: RELOC_S: R1, W2, WBI:FUA RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA https://jira.sw.ru/browse/PSBM-47107 Signed-off-by: Dmitry Monakhov--- drivers/block/ploop/dev.c | 10 +++--- drivers/block/ploop/map.c | 29 - 2 files changed, 19 insertions(+), 20 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 96f7850..998fe71 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -1224,6 +1224,11 @@ static void ploop_complete_request(struct ploop_request * preq) __TRACE("Z %p %u\n", preq, preq->req_cluster); + if (!preq->error) { + unsigned long state = READ_ONCE(preq->state); + WARN_ON(state & (1 << PLOOP_REQ_FORCE_FUA)); + WARN_ON(state & (1 < bl.head) { struct bio * bio = preq->bl.head; preq->bl.head = bio->bi_next; @@ -2530,9 +2535,8 @@ restart: top_delta = ploop_top_delta(plo); sbl.head = sbl.tail = preq->aux_bio; - /* Relocated data write required sync before BAT updatee */ - set_bit(PLOOP_REQ_FORCE_FUA, >state); - + /* Relocated data write required sync before BAT updatee +* this will happen inside index_update */ if (test_bit(PLOOP_REQ_RELOC_S, >state)) { preq->eng_state = PLOOP_E_DATA_WBI; plo->st.bio_out++; diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c index 3a6365d..c17e598 100644 --- a/drivers/block/ploop/map.c +++ b/drivers/block/ploop/map.c @@ -896,6 +896,7 @@ void ploop_index_update(struct ploop_request * preq) struct ploop_device * plo = preq->plo; struct map_node * m = preq->map; struct ploop_delta * top_delta = map_top_delta(m->parent); + int fua = !!(preq->req_rw & REQ_FUA); u32 idx; map_index_t blk; int old_level; @@ -953,13 +954,13 @@ void ploop_index_update(struct ploop_request * preq) __TRACE("wbi %p %u %p\n", preq, preq->req_cluster, m); plo->st.map_single_writes++; top_delta->ops->map_index(top_delta, m->mn_start, ); - /* Relocate requires consistent writes, mark such reqs appropriately */ + /* Relocate requires consistent index update */ if (test_bit(PLOOP_REQ_RELOC_A, >state) || test_bit(PLOOP_REQ_RELOC_S, >state)) - set_bit(PLOOP_REQ_FORCE_FUA, >state); - - top_delta->io.ops->write_page(_delta->io, preq, page, sec, - !!(preq->req_rw & REQ_FUA)); + fua = 1; + if (fua) + clear_bit(PLOOP_REQ_FORCE_FLUSH, >state); + top_delta->io.ops->write_page(_delta->io, preq, page, sec, fua); put_page(page); return; @@ -1078,7 +1079,7 @@ static void map_wb_complete(struct map_node * m, int err) int delayed = 0; unsigned int idx; sector_t sec; - int fua, force_fua; + int fua; /* First, complete processing of written back indices, * finally instantiate indices in mapping cache. @@ -1149,7 +1150,6 @@ static void map_wb_complete(struct map_node * m, int err) main_preq = NULL; fua = 0; - force_fua = 0; list_for_each_safe(cursor, tmp, >io_queue) { struct ploop_request * preq; @@ -1168,13 +1168,12 @@ static void map_wb_complete(struct map_node * m, int err) break; } - if (preq->req_rw & REQ_FUA) + if (preq->req_rw & REQ_FUA || + test_bit(PLOOP_REQ_RELOC_A, >state) || + test_bit(PLOOP_REQ_RELOC_S, >state)) { +
[Devel] [PATCH 2/3] ploop: deadcode cleanup
(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above. Logic was moved to ploop_req_delay_fua_possible() long time ago. Signed-off-by: Dmitry Monakhov--- drivers/block/ploop/io_direct.c | 9 - 1 file changed, 9 deletions(-) diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c index 74a554a..10d2314 100644 --- a/drivers/block/ploop/io_direct.c +++ b/drivers/block/ploop/io_direct.c @@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq, rw &= ~(REQ_FLUSH | REQ_FUA); - /* In case of eng_state != COMPLETE, we'll do FUA in -* ploop_index_update(). Otherwise, we should mark -* last bio as FUA here. */ - if (rw & REQ_FUA) { - rw &= ~REQ_FUA; - if (preq->eng_state == PLOOP_E_COMPLETE) - postfua = 1; - } - bio_list_init(); if (iblk == PLOOP_ZERO_INDEX) -- 1.8.3.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit
Signed-off-by: Dmitry Monakhov--- drivers/block/ploop/io_direct.c | 22 +- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c index b844a80..74a554a 100644 --- a/drivers/block/ploop/io_direct.c +++ b/drivers/block/ploop/io_direct.c @@ -517,16 +517,18 @@ dio_post_submit(struct ploop_io *io, struct ploop_request * preq) struct ploop_device *plo = preq->plo; sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log; loff_t clu_siz = 1 << (preq->plo->cluster_log + 9); + int force_sync = preq->req_rw & REQ_FUA; int err; file_start_write(io->files.file); - /* Here io->io_count is even ... */ - spin_lock_irq(>lock); - io->io_count++; - set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state); - spin_unlock_irq(>lock); - + if (!force_sync) { + /* Here io->io_count is even ... */ + spin_lock_irq(>lock); + io->io_count++; + set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state); + spin_unlock_irq(>lock); + } err = io->files.file->f_op->fallocate(io->files.file, FALLOC_FL_CONVERT_UNWRITTEN, (loff_t)sec << 9, clu_siz); @@ -535,9 +537,11 @@ dio_post_submit(struct ploop_io *io, struct ploop_request * preq) if (!err && (preq->req_rw & REQ_FUA)) err = io->ops->sync(io); - spin_lock_irq(>lock); - io->io_count++; - spin_unlock_irq(>lock); + if (!force_sync) { + spin_lock_irq(>lock); + io->io_count++; + spin_unlock_irq(>lock); + } /* and here io->io_count is even (+2) again. */ file_end_write(io->files.file); -- 1.8.3.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [NEW KERNEL] 3.10.0-327.18.2.vz7.14.15 (rhel7)
Changelog: OpenVZ kernel rh7-3.10.0-327.18.2.vz7.14.15 * vtty: Container's offline console can be opened before a Container start, survives start/stop/cpt/rst cycles * fs: deny umounting rootfs * sysrq: correct the fix to avoid cpu soft lockups on long print triggered by sysrq * ploop: fix gendisk disk_stats to be seen on a partition * module licenses and authors cleanup Generated changelog: * Wed Jun 15 2016 Konstantin Khorenko[3.10.0-327.18.2.vz7.14.15] - ve/vtty: Don't free console mapping until no clients left (Cyrill Gorcunov) [PSBM-39463] - fs: do not allow rootfs umount (Vasily Averin) [PSBM-46437] - ms/kernel/sysrq: restore touch_nmi_watchdog() in show_state_filter() (Andrey Ryabinin) [PSBM-47486] - ploop: fix gendisk disk_stats to be seen on partition (Maxim Patlasov) [PSBM-48266] - modules: set module author for Virtuozzo modules (Konstantin Khorenko) [PSBM-43847] - ploop: "Parallels loopback device" -> "Virtuozzo loopback device" (Konstantin Khorenko) [PSBM-43847] - license: put correct copyrights into file headers (Konstantin Khorenko) [PSBM-43847] - license: drop COPYING.Parallels file (Konstantin Khorenko) [PSBM-43847] Built packages: http://kojistorage.eng.sw.ru/packages/vzkernel/3.10.0/327.18.2.vz7.14.15/ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/vtty: Don't free console mapping until no clients left
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.14 --> commit a9532c96e6a0c64fbb4a128ba6ca99b9081e85cc Author: Cyrill GorcunovDate: Wed Jun 15 13:32:23 2016 +0400 ve/vtty: Don't free console mapping until no clients left Currently on container's stop we free vtty mapping in a force way so that if there is active console hooked from the node it become unusable since then. It was easier to work with when we've been reworking virtual console code. Now lets make console fully functional as it was in pcs6: when opened it must survive container start/stop cycle and checkpoint/restore as well. For this sake we: - drop ve_hook code, it no longer needed - free console @map on final close of the last tty opened https://jira.sw.ru/browse/PSBM-39463 Signed-off-by: Cyrill Gorcunov Reviewed-by: Vladimir Davydov CC: Konstantin Khorenko CC: Igor Sukhih CC: Pavel Emelyanov --- drivers/tty/pty.c | 48 ++-- kernel/ve/vecalls.c | 6 +++--- 2 files changed, 17 insertions(+), 37 deletions(-) diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c index a68102b..1644fdf 100644 --- a/drivers/tty/pty.c +++ b/drivers/tty/pty.c @@ -901,6 +901,13 @@ static void vtty_map_set(vtty_map_t *map, struct tty_struct *tty) map->vttys[tty->index] = tty; } +static void vtty_map_free(vtty_map_t *map) +{ + lockdep_assert_held(_mutex); + idr_remove(_idr, map->veid); + kfree(map); +} + static void vtty_map_clear(struct tty_struct *tty) { vtty_map_t *map = tty->driver_data; @@ -908,28 +915,20 @@ static void vtty_map_clear(struct tty_struct *tty) lockdep_assert_held(_mutex); if (map) { struct tty_struct *p = map->vttys[tty->index]; + int i; WARN_ON(p != (tty->driver == vttys_driver ? tty : tty->link)); map->vttys[tty->index] = NULL; tty->driver_data = tty->link->driver_data = NULL; - } -} -static void vtty_map_free(vtty_map_t *map) -{ - int i; - - lockdep_assert_held(_mutex); + for (i = 0; i < MAX_NR_VTTY_CONSOLES; i++) { + if (map->vttys[i]) + break; + } - for (i = 0; i < MAX_NR_VTTY_CONSOLES; i++) { - struct tty_struct *tty = map->vttys[i]; - if (!tty) - continue; - tty->driver_data = tty->link->driver_data = NULL; + if (i >= MAX_NR_VTTY_CONSOLES) + vtty_map_free(map); } - - idr_remove(_idr, map->veid); - kfree(map); } static vtty_map_t *vtty_map_alloc(envid_t veid) @@ -1209,24 +1208,6 @@ void vtty_release(struct tty_struct *tty, struct tty_struct *o_tty, *o_tty_closing = 0; } -static void ve_vtty_fini(void *data) -{ - struct ve_struct *ve = data; - vtty_map_t *map; - - mutex_lock(_mutex); - map = vtty_map_lookup(ve->veid); - if (map) - vtty_map_free(map); - mutex_unlock(_mutex); -} - -static struct ve_hook vtty_hook = { - .fini = ve_vtty_fini, - .priority = HOOK_PRIO_DEFAULT, - .owner = THIS_MODULE, -}; - static int __init vtty_init(void) { #define VTTY_DRIVER_ALLOC_FLAGS\ @@ -1279,7 +1260,6 @@ static int __init vtty_init(void) if (tty_register_driver(vttys_driver)) panic(pr_fmt("Can't register slave vtty driver\n")); - ve_hook_register(VE_SS_CHAIN, _hook); tty_default_fops(_fops); return 0; } diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c index 457d690..5aa9722 100644 --- a/kernel/ve/vecalls.c +++ b/kernel/ve/vecalls.c @@ -990,6 +990,9 @@ static int ve_configure(envid_t veid, unsigned int key, struct ve_struct *ve; int err = -ENOKEY; + if (key == VE_CONFIGURE_OPEN_TTY) + return vtty_open_master(veid, val); + ve = get_ve_by_id(veid); if (!ve) return -EINVAL; @@ -998,9 +1001,6 @@ static int ve_configure(envid_t veid, unsigned int key, case VE_CONFIGURE_OS_RELEASE: err = init_ve_osrelease(ve, data); break; - case VE_CONFIGURE_OPEN_TTY: - err = vtty_open_master(ve->veid, val); - break; } put_ve(ve); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] fs: do not allow rootfs umount
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.14 --> commit 0ae9e4e30b14404b570f62f83220637506be6376 Author: Vasily AverinDate: Wed Jun 15 13:16:38 2016 +0400 fs: do not allow rootfs umount In mainline rootfs is marked always as MNT_LOCKED, sys_umount checks this flag and fails its processing. Our kernels lacks for MNT_LOCKED flag, so we use another kind of check to prevent incorrect operation. v2: use mnt_has_parent() https://jira.sw.ru/browse/PSBM-46437 Signed-off-by: Vasily Averin Acked-by: Andrey Vagin --- fs/namespace.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/namespace.c b/fs/namespace.c index 988320b..4fb935a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1355,6 +1355,8 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags) goto dput_and_out; if (!check_mnt(mnt)) goto dput_and_out; + if (!mnt_has_parent(mnt)) + goto dput_and_out; retval = do_umount(mnt, flags); dput_and_out: ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] memcg: mem_cgroup_uncharge_page() kernel panic/lockup
Hi, Thanks for the report. Could you please - file a bug to bugzilla.openvz.org - upload the vmcore at rsync://fe.sw.ru/f837d67c8e2ade8cee3367cb0f880268/ On Mon, Jun 13, 2016 at 09:24:33AM +0300, Anatoly Stepanov wrote: > Hello everyone! > > We encounter an issue with mem_cgroup_uncharge_page() function, > it appears quite often on our clients servers. > > Basically the issue sometimes leads to hard-lockup, sometimes to GP fault. > > Based on bug reports from clients, the problem shows up when a user > process calls "execve" or "exit" syscalls. > As we know in those cases kernel invokes "uncharging" for every page > when its unmapped from all the mm's. > > Kernel dump analysis shows that at the moment of > mem_cgroup_uncharge_page() "memcg" pointer > (taken from page_cgroup) seems to be pointing to some random memory area. > > On the other hand, if we look at current->mm->css, then memcg instance > exists and is "online". > > This led me to a thought that "page_cgroup->memcg" may be changed by > some part of memcg code in parallel. > As far as i understand, the only option here is "reclaim code path" > (may be i'm wrong) > > So, i suppose there might be a race between "memcg uncharge code" and > "memcg reclaim code". > > Please, give me your thoughts about it > thanks > > P.S.: > > Additional info: > > Kernel: rh7-3.10.0-327.10.1.vz7.12.14 > > *1st > BT > > PID: 972445 TASK: 88065d53d8d0 CPU: 0 COMMAND: "httpd" > #0 [880224f37818] machine_kexec at 8105249b > #1 [880224f37878] crash_kexec at 81103532 > #2 [880224f37948] oops_end at 81641628 > #3 [880224f37970] die at 810184cb > #4 [880224f379a0] do_general_protection at 81640f24 > #5 [880224f379d0] general_protection at 81640768 > [exception RIP: mem_cgroup_charge_statistics+19] > RIP: 811e7733 RSP: 880224f37a80 RFLAGS: 00010202 > RAX: RBX: 8807b26f0110 RCX: > RDX: 79726f6765746163 RSI: ea000c9c0440 RDI: 8806a55662f8 > RBP: 880224f37a80 R8: R9: 03808000 > R10: 00b8 R11: ea001eaa8980 R12: ea000c9c0440 > R13: 0001 R14: R15: 8806a5566000 > ORIG_RAX: CS: 0010 SS: 0018 > #6 [880224f37a88] __mem_cgroup_uncharge_common at 811e9ddf > #7 [880224f37ac8] mem_cgroup_uncharge_page at 811ee99a > #8 [880224f37ad8] page_remove_rmap at 811b9ec9 > #9 [880224f37b10] unmap_page_range at 811ab580 > #10 [880224f37bf8] unmap_single_vma at 811aba11 > #11 [880224f37c30] unmap_vmas at 811ace79 > #12 [880224f37c68] exit_mmap at 811b663c > #13 [880224f37d18] mmput at 8107853b > #14 [880224f37d38] flush_old_exec at 81202547 > #15 [880224f37d88] load_elf_binary at 8125883c > #16 [880224f37e58] search_binary_handler at 81201c25 > #17 [880224f37ea0] do_execve_common at 812032b7 > #18 [880224f37f30] sys_execve at 81203619 > #19 [880224f37f50] stub_execve at 81649369 > RIP: 7f54284b3287 RSP: 7ffda57a0698 RFLAGS: 0297 > RAX: 003b RBX: 037c5fe8 RCX: > RDX: 037cf3f8 RSI: 037ce5f8 RDI: 7f5425fcabf1 > RBP: 7ffda57a0750 R8: 0001 R9: > > > ***2nd > BT**: > > PID: 168440 TASK: 88001e31cc20 CPU: 18 COMMAND: "httpd" > #0 [88007255f838] machine_kexec at 8105249b > #1 [88007255f898] crash_kexec at 81103532 > #2 [88007255f968] oops_end at 81641628 > #3 [88007255f990] no_context at 8163222b > #4 [88007255f9e0] __bad_area_nosemaphore at 816322c1 > #5 [88007255fa30] bad_area_nosemaphore at 8163244a > #6 [88007255fa40] __do_page_fault at 8164443e > #7 [88007255faa0] trace_do_page_fault at 81644673 > #8 [88007255fad8] do_async_page_fault at 81643d59 > #9 [88007255faf0] async_page_fault at 816407f8 > [exception RIP: memcg_check_events+435] > RIP: 811e9b53 RSP: 88007255fba0 RFLAGS: 00010246 > RAX: f81ef81e RBX: 8802106d5000 RCX: > RDX: f81e RSI: 0002 RDI: 8807aa2642e8 > RBP: 88007255fbf0 R8: 0202 R9: > R10: 0010 R11: 88007255ffd8 R12: 8807aa2642e0 > R13: 0410 R14: 8802073de700 R15: 8802106d5000 > ORIG_RAX: CS: 0010 SS: 0018 > #10 [88007255fbf8] __mem_cgroup_uncharge_common at
Re: [Devel] [PATCH rh7 0/6] ploop: push_backup: implement expiration timeout
Dima, please review the patchset. -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 06/15/2016 03:50 AM, Maxim Patlasov wrote: If a ploop request waits for userspace backup tool attention for more then plo->tune.push_backup_timeout (42 secs by default), the whole push_backup operation is aborted, initial CBT mask is merged back to CBT. https://jira.sw.ru/browse/PSBM-48082 --- Maxim Patlasov (6): ploop: push_backup: introduce pb_set structure ploop: push_backup: factor rb_erase() out ploop: push_backup: extend pb_set ploop: push_backup: add timeout tunable ploop: push_backup: health monitor thread ploop: push_backup: implement timeout functions drivers/block/ploop/push_backup.c | 261 + drivers/block/ploop/sysfs.c |2 include/linux/ploop/ploop.h |4 - 3 files changed, 240 insertions(+), 27 deletions(-) -- Signature . ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/kernel/sysrq: restore touch_nmi_watchdog() in show_state_filter()
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.14 --> commit 3eada65cb70c5f2c773fcac7aabc72de5f768bac Author: Andrey RyabininDate: Wed Jun 15 12:41:37 2016 +0400 ms/kernel/sysrq: restore touch_nmi_watchdog() in show_state_filter() Commit 60c21d9f08bf ("kernel/sysrq: reset watchdog on all cpus while during sysrq-w") shouldn't remove touch_nmi_watchdog() call because touch_all_softlockup_watchdogs() resets only softlockup watchdogs, but doesn't reset NMI watchdog used in hard lockup detector. So, bring it back. Plus, remove the second touch_all_softlockup_watchdogs() call which becomes redundant, and add a comment. This patch is delta between v2-v1 version of the upstream patch: http://lkml.kernel.org/g/1465474805-14641-1-git-send-email-aryabi...@virtuozzo.com https://jira.sw.ru/browse/PSBM-47486 Fixes: 60c21d9f08bf ("kernel/sysrq: reset watchdog on all cpus while during sysrq-w") Signed-off-by: Andrey Ryabinin --- kernel/sched/core.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d21ccf0..1a3ff8c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5176,14 +5176,16 @@ void show_state_filter(unsigned long state_filter) /* * reset the NMI-timeout, listing all files on a slow * console might take a lot of time: +* Also, reset softlockup watchdogs on all CPUs, because +* another CPU might be blocked waiting for us to process +* an IPI. */ + touch_nmi_watchdog(); touch_all_softlockup_watchdogs(); if (!state_filter || (p->state & state_filter)) sched_show_task(p); } while_each_thread(g, p); - touch_all_softlockup_watchdogs(); - #if 0 /* * This results in soft lockups, because it writes too much data to ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: fix gendisk disk_stats to be seen on partition
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.14 --> commit fc4dd8e6d4df14e5e09b6dacac74fe903d95c929 Author: Maxim PatlasovDate: Wed Jun 15 12:38:54 2016 +0400 ploop: fix gendisk disk_stats to be seen on partition Before this patch, an I/O on top of /dev/ploopNp1 was always accounted on main partition (/sys/block/ploopN/stat). The counters for p1 remained zero. The patch fixes the problem by calculating partition properly. https://jira.sw.ru/browse/PSBM-48266 Signed-off-by: Maxim Patlasov --- drivers/block/ploop/dev.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index f87209d..01a5189 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -800,6 +800,7 @@ static void ploop_make_request(struct request_queue *q, struct bio *bio) struct bio * nbio; struct ploop_device * plo = q->queuedata; unsigned long rw = bio_data_dir(bio); + struct hd_struct *part; int cpu; LIST_HEAD(drop_list); @@ -811,8 +812,9 @@ static void ploop_make_request(struct request_queue *q, struct bio *bio) BUG_ON(bio->bi_size & 511); cpu = part_stat_lock(); - part_stat_inc(cpu, >disk->part0, ios[rw]); - part_stat_add(cpu, >disk->part0, sectors[rw], bio_sectors(bio)); + part = disk_map_sector_rcu(plo->disk, bio->bi_sector); + part_stat_inc(cpu, part, ios[rw]); + part_stat_add(cpu, part, sectors[rw], bio_sectors(bio)); part_stat_unlock(); if (unlikely(bio->bi_size == 0)) { ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/3] netlink/diag: report flags for netlink sockets
We need to know flags for dumping and restoring netlink sockets. All flags except NDIAG_FLAG_CB_RUNNING can be received with help of getsockopt(), but in this case we need a socket descriptor and we need to call getsockopt() to get each flag. With this chages we will be able to show netlink sockets flags from the ss tool. In criu, we need to know if a callback is running now or not. When a socket has some data in a receive queue and doesn't have running callbacks, we can save all data from a receive queue on dump and queue them back on restore. If a socket has a running callback, a receive queue contains only a part of data, and as soon as we read them, the callback will generate a new portion. In this case, we can't be sure that all data will not exceed a buffer limit on restore. Now we are going to dump and restore sockets without running callbacks. --- include/uapi/linux/netlink_diag.h | 10 ++ net/netlink/af_netlink.c | 9 - net/netlink/af_netlink.h | 9 + net/netlink/diag.c| 28 +++- 4 files changed, 46 insertions(+), 10 deletions(-) diff --git a/include/uapi/linux/netlink_diag.h b/include/uapi/linux/netlink_diag.h index 4e31db4..6a9108f 100644 --- a/include/uapi/linux/netlink_diag.h +++ b/include/uapi/linux/netlink_diag.h @@ -37,6 +37,7 @@ enum { NETLINK_DIAG_GROUPS, NETLINK_DIAG_RX_RING, NETLINK_DIAG_TX_RING, + NETLINK_DIAG_FLAGS, __NETLINK_DIAG_MAX, }; @@ -48,5 +49,14 @@ enum { #define NDIAG_SHOW_MEMINFO 0x0001 /* show memory info of a socket */ #define NDIAG_SHOW_GROUPS 0x0002 /* show groups of a netlink socket */ #define NDIAG_SHOW_RING_CFG0x0004 /* show ring configuration */ +#define NDIAG_SHOW_FLAGS 0x0008 /* show flags of a netlink socket */ + +/* flags */ +#define NDIAG_FLAG_CB_RUNNING 0x0001 +#define NDIAG_FLAG_PKTINFO 0x0002 +#define NDIAG_FLAG_BROADCAST_ERROR 0x0004 +#define NDIAG_FLAG_NO_ENOBUFS 0x0008 +#define NDIAG_FLAG_LISTEN_ALL_NSID 0x0010 +#define NDIAG_FLAG_CAP_ACK 0x0020 #endif diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 113e2ae..ba75f32 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -77,15 +77,6 @@ struct listeners { /* state bits */ #define NETLINK_S_CONGESTED0x0 -/* flags */ -#define NETLINK_F_KERNEL_SOCKET0x1 -#define NETLINK_F_RECV_PKTINFO 0x2 -#define NETLINK_F_BROADCAST_SEND_ERROR 0x4 -#define NETLINK_F_RECV_NO_ENOBUFS 0x8 -#define NETLINK_F_LISTEN_ALL_NSID 0x10 -#define NETLINK_F_CAP_ACK 0x20 -#define NETLINK_F_REPAIR 0x40 - static inline int netlink_is_kernel(struct sock *sk) { return nlk_sk(sk)->flags & NETLINK_F_KERNEL_SOCKET; diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h index 577fddf..b3ce345 100644 --- a/net/netlink/af_netlink.h +++ b/net/netlink/af_netlink.h @@ -4,6 +4,15 @@ #include #include +/* flags */ +#define NETLINK_F_KERNEL_SOCKET0x1 +#define NETLINK_F_RECV_PKTINFO 0x2 +#define NETLINK_F_BROADCAST_SEND_ERROR 0x4 +#define NETLINK_F_RECV_NO_ENOBUFS 0x8 +#define NETLINK_F_LISTEN_ALL_NSID 0x10 +#define NETLINK_F_CAP_ACK 0x20 +#define NETLINK_F_REPAIR 0x40 + #define NLGRPSZ(x) (ALIGN(x, sizeof(unsigned long) * 8) / 8) #define NLGRPLONGS(x) (NLGRPSZ(x)/sizeof(unsigned long)) diff --git a/net/netlink/diag.c b/net/netlink/diag.c index de8c74a..0aa8744e 100644 --- a/net/netlink/diag.c +++ b/net/netlink/diag.c @@ -54,6 +54,27 @@ static int sk_diag_dump_groups(struct sock *sk, struct sk_buff *nlskb) nlk->groups); } +static int sk_diag_put_flags(struct sock *sk, struct sk_buff *skb) +{ + struct netlink_sock *nlk = nlk_sk(sk); + u32 flags = 0; + + if (nlk->cb_running) + flags |= NDIAG_FLAG_CB_RUNNING; + if (nlk->flags & NETLINK_F_RECV_PKTINFO) + flags |= NDIAG_FLAG_PKTINFO; + if (nlk->flags & NETLINK_F_BROADCAST_SEND_ERROR) + flags |= NDIAG_FLAG_BROADCAST_ERROR; + if (nlk->flags & NETLINK_F_RECV_NO_ENOBUFS) + flags |= NDIAG_FLAG_NO_ENOBUFS; + if (nlk->flags & NETLINK_F_LISTEN_ALL_NSID) + flags |= NDIAG_FLAG_LISTEN_ALL_NSID; + if (nlk->flags & NETLINK_F_CAP_ACK) + flags |= NDIAG_FLAG_CAP_ACK; + + return nla_put_u32(skb, NETLINK_DIAG_FLAGS, flags); +} + static int sk_diag_fill(struct sock *sk, struct sk_buff *skb, struct netlink_diag_req *req, u32 portid, u32 seq, u32 flags, int sk_ino) @@ -91,7 +112,12 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb, sk_diag_put_rings_cfg(sk, skb)) goto out_nlmsg_trim; - return nlmsg_end(skb, nlh); +
[Devel] [PATCH 2/3] netlink: add an ability to restore messages in a receive queue
This patch adds an repair mode for netlink sockets. sendmsg queues messages into a receive queue if a socket is in the repair mode. --- include/uapi/linux/netlink.h | 19 ++--- net/netlink/af_netlink.c | 51 +++- 2 files changed, 47 insertions(+), 23 deletions(-) diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h index 3e34b7d..56ddadf 100644 --- a/include/uapi/linux/netlink.h +++ b/include/uapi/linux/netlink.h @@ -101,14 +101,17 @@ struct nlmsgerr { struct nlmsghdr msg; }; -#define NETLINK_ADD_MEMBERSHIP 1 -#define NETLINK_DROP_MEMBERSHIP2 -#define NETLINK_PKTINFO3 -#define NETLINK_BROADCAST_ERROR4 -#define NETLINK_NO_ENOBUFS 5 -#define NETLINK_RX_RING6 -#define NETLINK_TX_RING7 -#define NETLINK_LISTEN_ALL_NSID8 +#define NETLINK_ADD_MEMBERSHIP 1 +#define NETLINK_DROP_MEMBERSHIP2 +#define NETLINK_PKTINFO3 +#define NETLINK_BROADCAST_ERROR4 +#define NETLINK_NO_ENOBUFS 5 +#define NETLINK_RX_RING6 +#define NETLINK_TX_RING7 +#define NETLINK_LISTEN_ALL_NSID8 +#define NETLINK_LIST_MEMBERSHIPS 9 +#define NETLINK_CAP_ACK10 +#define NETLINK_REPAIR 11 struct nl_pktinfo { __u32 group; diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 79526e5..113e2ae 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -83,6 +83,8 @@ struct listeners { #define NETLINK_F_BROADCAST_SEND_ERROR 0x4 #define NETLINK_F_RECV_NO_ENOBUFS 0x8 #define NETLINK_F_LISTEN_ALL_NSID 0x10 +#define NETLINK_F_CAP_ACK 0x20 +#define NETLINK_F_REPAIR 0x40 static inline int netlink_is_kernel(struct sock *sk) { @@ -1744,6 +1746,7 @@ static int netlink_unicast_kernel(struct sock *sk, struct sk_buff *skb, int netlink_unicast(struct sock *ssk, struct sk_buff *skb, u32 portid, int nonblock) { + struct netlink_sock *nlk = nlk_sk(ssk); struct sock *sk; int err; long timeo; @@ -1752,19 +1755,24 @@ int netlink_unicast(struct sock *ssk, struct sk_buff *skb, timeo = sock_sndtimeo(ssk, nonblock); retry: - sk = netlink_getsockbyportid(ssk, portid); - if (IS_ERR(sk)) { - kfree_skb(skb); - return PTR_ERR(sk); - } - if (netlink_is_kernel(sk)) - return netlink_unicast_kernel(sk, skb, ssk); + if (nlk->flags & NETLINK_F_REPAIR) { + sk = ssk; + sock_hold(sk); + } else { + sk = netlink_getsockbyportid(ssk, portid); + if (IS_ERR(sk)) { + kfree_skb(skb); + return PTR_ERR(sk); + } + if (netlink_is_kernel(sk)) + return netlink_unicast_kernel(sk, skb, ssk); - if (sk_filter(sk, skb)) { - err = skb->len; - kfree_skb(skb); - sock_put(sk); - return err; + if (sk_filter(sk, skb)) { + err = skb->len; + kfree_skb(skb); + sock_put(sk); + return err; + } } err = netlink_attachskb(sk, skb, , ssk); @@ -2126,6 +2134,13 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname, return -EFAULT; switch (optname) { + case NETLINK_REPAIR: + if (val) + nlk->flags |= NETLINK_F_REPAIR; + else + nlk->flags &= ~NETLINK_F_REPAIR; + err = 0; + break; case NETLINK_PKTINFO: if (val) nlk->flags |= NETLINK_F_RECV_PKTINFO; @@ -2288,6 +2303,7 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock, int err; struct scm_cookie scm; u32 netlink_skb_flags = 0; + bool repair = nlk->flags & NETLINK_F_REPAIR; if (msg->msg_flags_OOB) return -EOPNOTSUPP; @@ -2307,7 +2323,8 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock, dst_group = ffs(addr->nl_groups); err = -EPERM; if ((dst_group || dst_portid) && - !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND)) + !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND && + !repair)) goto out; netlink_skb_flags |= NETLINK_SKB_DST; } else { @@ -2336,7 +2353,11 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock, if (skb == NULL) goto out; - NETLINK_CB(skb).portid = nlk->portid; +
[Devel] [PATCH net-next 0/3] [RFC] netlink: prepare to dump and restore data from a receive queue
CRIU can dump queued data for unix and tcp sockets, now it's time for netlink sockets. Here are there questions. * How to dump data from a receive queue We can set peeking offset like we do for unix sockets. * How to restore data back to a receive queue I suggest to add a repair mode like we do for tcp sockets. * When we can dump data from a receive queue. I think we can do this only if a socket doesn't have a running callback. Andrey Vagin (3): netlink: allow to set peeking offset for sockets netlink: add an ability to restore messages in a receive queue netlink/diag: report flags for netlink sockets include/uapi/linux/netlink.h | 1 + include/uapi/linux/netlink_diag.h | 10 + net/netlink/af_netlink.c | 82 ++- net/netlink/af_netlink.h | 9 + net/netlink/diag.c| 25 5 files changed, 99 insertions(+), 28 deletions(-) -- 2.5.5 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/3] netlink: allow to set peeking offset for sockets
This allows us to read socket's queue without removing skbs from it. The same logic was implemented for unix and inet sockets and we use this to dump and restore sockets in CRIU. Here is a question whether sk_peek_off has to be protected by locks. Currently it isn't protected and an user who uses sk_peek_off has to be sure that nobody calls recvmsg for a socket except him. --- net/netlink/af_netlink.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index ad65bdd..79526e5 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2372,17 +2372,18 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct socket *sock, struct scm_cookie scm; struct sock *sk = sock->sk; struct netlink_sock *nlk = nlk_sk(sk); - int noblock = flags_DONTWAIT; size_t copied; struct sk_buff *skb, *data_skb; + int peeked, skip; int err, ret; if (flags_OOB) return -EOPNOTSUPP; copied = 0; + skip = sk_peek_offset(sk, flags); - skb = skb_recv_datagram(sk, flags, noblock, ); + skb = __skb_recv_datagram(sk, flags, , , ); if (skb == NULL) goto out; @@ -2410,14 +2411,19 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct socket *sock, nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len, 16384); - copied = data_skb->len; + copied = data_skb->len - skip; if (len < copied) { msg->msg_flags |= MSG_TRUNC; copied = len; } skb_reset_transport_header(data_skb); - err = skb_copy_datagram_iovec(data_skb, 0, msg->msg_iov, copied); + err = skb_copy_datagram_iovec(data_skb, skip, msg->msg_iov, copied); + + if (flags & MSG_PEEK) + sk_peek_offset_fwd(sk, copied); + else + sk_peek_offset_bwd(sk, skb->len); if (msg->msg_name) { struct sockaddr_nl *addr = (struct sockaddr_nl *)msg->msg_name; @@ -2439,7 +2445,7 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct socket *sock, } siocb->scm->creds = *NETLINK_CREDS(skb); if (flags & MSG_TRUNC) - copied = data_skb->len; + copied = data_skb->len - skip; skb_free_datagram(sk, skb); @@ -3086,6 +3092,13 @@ int netlink_unregister_notifier(struct notifier_block *nb) } EXPORT_SYMBOL(netlink_unregister_notifier); +static int netlink_set_peek_off(struct sock *sk, int val) +{ + sk->sk_peek_off = val; + + return 0; +} + static const struct proto_ops netlink_ops = { .family = PF_NETLINK, .owner =THIS_MODULE, @@ -3105,6 +3118,7 @@ static const struct proto_ops netlink_ops = { .recvmsg = netlink_recvmsg, .mmap = netlink_mmap, .sendpage = sock_no_sendpage, + .set_peek_off = netlink_set_peek_off, }; static const struct net_proto_family netlink_family_ops = { -- 2.5.5 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel