Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-15 Thread Maxim Patlasov

Dima,

I agree that the ploop barrier code is broken in many ways, but I don't 
think the patch actually fixes it. I hope you would agree that 
completion of REQ_FUA guarantees only landing that particular bio to the 
disk; it says nothing about flushing previously submitted (and 
completed) bio-s and it is also possible that power outage may catch us 
when this REQ_FUA is already landed to the disk, but previous bio-s are 
not yet.


Hence, for RELOC_{A|S} requests we actually need something like that:

 RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
 RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA

(i.e. we do need to flush all previously submitted data before starting 
to update BAT on disk)


not simply:


RELOC_S: R1, W2, WBI:FUA
RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA


Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and 
PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we 
could remove them completely (along we that optimization delaying 
incoming FUA) and re-implement all this stuff from scratch:


1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set 
REQ_FUA in preq->req_rw before calling ->submit(preq)


2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating 
BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for 
RELOC_A|S in ploop_index_update and map_wb_complete


3) For that optimization delaying incoming FUA (what we do now if 
ploop_req_delay_fua_possible() returns true) we could introduce new 
ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update 
and map_wb_complete (the same thing as 2) above). And, yes, let's 
WARN_ON if we somehow missed its processing.


The only complication I foresee is about how to teach kaio to pre-flush 
in kaio_write_page -- it's doable, but involves kaio_resubmit that's 
already pretty convoluted.


Btw, I accidentally noticed awful silly bug in kaio_complete_io_state(): 
we checks for REQ_FUA after clearing it! This makes all FUA-s on 
ordinary kaio_submit path silently lost...


Thanks,
Maxim


On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
  ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
   ->delta->allocate
  ->io->submit_allloc: dio_submit_alloc
->dio_submit_pad
E_DATA_WBI : data written, time to update index
   ->delta->allocate_complete:ploop_index_update
 ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
 ->write_page
 ->ploop_map_wb_complete
   ->ploop_wb_complete_post_process
 ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
E_RELOC_NULLIFY:

->submit()

This patch unify barrier handling like follows:
- Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state
- Perform explicit FUA inside index_update for RELOC requests.

This makes reloc sequence optimal:
RELOC_S: R1, W2, WBI:FUA
RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/dev.c | 10 +++---
  drivers/block/ploop/map.c | 29 -
  2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..998fe71 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,11 @@ static void ploop_complete_request(struct ploop_request 
* preq)
  
  	__TRACE("Z %p %u\n", preq, preq->req_cluster);
  
+	if (!preq->error) {

+   unsigned long state =  READ_ONCE(preq->state);
+   WARN_ON(state & (1 << PLOOP_REQ_FORCE_FUA));
+   WARN_ON(state & (1 <bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2535,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
  
-		/* Relocated data write required sync before BAT updatee */

-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, >state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 3a6365d..c17e598 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -896,6 +896,7 @@ void ploop_index_update(struct ploop_request * preq)
struct ploop_device * plo = 

Re: [Devel] memcg: mem_cgroup_uncharge_page() kernel panic/lockup

2016-06-15 Thread Anatoly Stepanov
Hi, Vladimir!

Thanks for a quick response.

I created JIRA issue and uploaded the dumps.

All the information is included into JIRA issue:
https://bugs.openvz.org/browse/OVZ-6756


On Wed, Jun 15, 2016 at 11:47 AM, Vladimir Davydov
 wrote:
> Hi,
>
> Thanks for the report.
>
> Could you please
>
>  - file a bug to bugzilla.openvz.org
>
>  - upload the vmcore at
>rsync://fe.sw.ru/f837d67c8e2ade8cee3367cb0f880268/
>
> On Mon, Jun 13, 2016 at 09:24:33AM +0300, Anatoly Stepanov wrote:
>> Hello everyone!
>>
>> We encounter an issue with mem_cgroup_uncharge_page() function,
>> it appears quite often on our clients servers.
>>
>> Basically the issue sometimes leads to hard-lockup, sometimes to GP fault.
>>
>> Based on bug reports from clients, the problem shows up when a user
>> process calls "execve" or "exit" syscalls.
>> As we know in those cases kernel invokes "uncharging" for every page
>> when its unmapped from all the mm's.
>>
>> Kernel dump analysis shows that at the moment of
>> mem_cgroup_uncharge_page() "memcg" pointer
>> (taken from page_cgroup) seems to be pointing to some random memory area.
>>
>> On the other hand, if we look at current->mm->css, then memcg instance
>> exists and is "online".
>>
>> This led me to a thought that "page_cgroup->memcg" may be changed by
>> some part of memcg code in parallel.
>> As far as i understand, the only option here is "reclaim code path"
>> (may be i'm wrong)
>>
>> So, i suppose there might be a race between "memcg uncharge code" and
>> "memcg reclaim code".
>>
>> Please, give me your thoughts about it
>> thanks
>>
>> P.S.:
>>
>> Additional info:
>>
>> Kernel: rh7-3.10.0-327.10.1.vz7.12.14
>>
>> *1st
>> BT
>>
>> PID: 972445  TASK: 88065d53d8d0  CPU: 0   COMMAND: "httpd"
>>  #0 [880224f37818] machine_kexec at 8105249b
>>  #1 [880224f37878] crash_kexec at 81103532
>>  #2 [880224f37948] oops_end at 81641628
>>  #3 [880224f37970] die at 810184cb
>>  #4 [880224f379a0] do_general_protection at 81640f24
>>  #5 [880224f379d0] general_protection at 81640768
>> [exception RIP: mem_cgroup_charge_statistics+19]
>> RIP: 811e7733  RSP: 880224f37a80  RFLAGS: 00010202
>> RAX:   RBX: 8807b26f0110  RCX: 
>> RDX: 79726f6765746163  RSI: ea000c9c0440  RDI: 8806a55662f8
>> RBP: 880224f37a80   R8:    R9: 03808000
>> R10: 00b8  R11: ea001eaa8980  R12: ea000c9c0440
>> R13: 0001  R14:   R15: 8806a5566000
>> ORIG_RAX:   CS: 0010  SS: 0018
>>  #6 [880224f37a88] __mem_cgroup_uncharge_common at 811e9ddf
>>  #7 [880224f37ac8] mem_cgroup_uncharge_page at 811ee99a
>>  #8 [880224f37ad8] page_remove_rmap at 811b9ec9
>>  #9 [880224f37b10] unmap_page_range at 811ab580
>> #10 [880224f37bf8] unmap_single_vma at 811aba11
>> #11 [880224f37c30] unmap_vmas at 811ace79
>> #12 [880224f37c68] exit_mmap at 811b663c
>> #13 [880224f37d18] mmput at 8107853b
>> #14 [880224f37d38] flush_old_exec at 81202547
>> #15 [880224f37d88] load_elf_binary at 8125883c
>> #16 [880224f37e58] search_binary_handler at 81201c25
>> #17 [880224f37ea0] do_execve_common at 812032b7
>> #18 [880224f37f30] sys_execve at 81203619
>> #19 [880224f37f50] stub_execve at 81649369
>> RIP: 7f54284b3287  RSP: 7ffda57a0698  RFLAGS: 0297
>> RAX: 003b  RBX: 037c5fe8  RCX: 
>> RDX: 037cf3f8  RSI: 037ce5f8  RDI: 7f5425fcabf1
>> RBP: 7ffda57a0750   R8: 0001   R9: 
>>
>>
>> ***2nd
>> BT**:
>>
>> PID: 168440  TASK: 88001e31cc20  CPU: 18  COMMAND: "httpd"
>>  #0 [88007255f838] machine_kexec at 8105249b
>>  #1 [88007255f898] crash_kexec at 81103532
>>  #2 [88007255f968] oops_end at 81641628
>>  #3 [88007255f990] no_context at 8163222b
>>  #4 [88007255f9e0] __bad_area_nosemaphore at 816322c1
>>  #5 [88007255fa30] bad_area_nosemaphore at 8163244a
>>  #6 [88007255fa40] __do_page_fault at 8164443e
>>  #7 [88007255faa0] trace_do_page_fault at 81644673
>>  #8 [88007255fad8] do_async_page_fault at 81643d59
>>  #9 [88007255faf0] async_page_fault at 816407f8
>> [exception RIP: memcg_check_events+435]
>> RIP: 811e9b53  RSP: 88007255fba0  RFLAGS: 00010246
>> RAX: f81ef81e  RBX: 8802106d5000  RCX: 
>> RDX: f81e  RSI: 0002  RDI: 

Re: [Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit

2016-06-15 Thread Maxim Patlasov

ACK-ed, but see a minor nit below

On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:

Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/io_direct.c | 22 +-
  1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index b844a80..74a554a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -517,16 +517,18 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
loff_t clu_siz = 1 << (preq->plo->cluster_log + 9);
+   int force_sync = preq->req_rw & REQ_FUA;
int err;
  
  	file_start_write(io->files.file);
  
-	/* Here io->io_count is even ... */

-   spin_lock_irq(>lock);
-   io->io_count++;
-   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
-   spin_unlock_irq(>lock);
-
+   if (!force_sync) {
+   /* Here io->io_count is even ... */
+   spin_lock_irq(>lock);
+   io->io_count++;
+   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
+   spin_unlock_irq(>lock);
+   }
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
@@ -535,9 +537,11 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
if (!err && (preq->req_rw & REQ_FUA))


s/(preq->req_rw & REQ_FUA)/force_sync

Thanks,
Max


err = io->ops->sync(io);
  
-	spin_lock_irq(>lock);

-   io->io_count++;
-   spin_unlock_irq(>lock);
+   if (!force_sync) {
+   spin_lock_irq(>lock);
+   io->io_count++;
+   spin_unlock_irq(>lock);
+   }
/* and here io->io_count is even (+2) again. */
  
  	file_end_write(io->files.file);


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 2/3] ploop: deadcode cleanup

2016-06-15 Thread Maxim Patlasov

Acked-by: Maxim Patlasov 

On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:

(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/io_direct.c | 9 -
  1 file changed, 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 74a554a..10d2314 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
rw &= ~(REQ_FLUSH | REQ_FUA);
  
  
-	/* In case of eng_state != COMPLETE, we'll do FUA in

-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init();
  
  	if (iblk == PLOOP_ZERO_INDEX)


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-15 Thread Dmitry Monakhov
barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
 ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
  ->delta->allocate
 ->io->submit_allloc: dio_submit_alloc
   ->dio_submit_pad
E_DATA_WBI : data written, time to update index
  ->delta->allocate_complete:ploop_index_update
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
->write_page
->ploop_map_wb_complete
  ->ploop_wb_complete_post_process
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
E_RELOC_NULLIFY:

   ->submit()

This patch unify barrier handling like follows:
- Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state
- Perform explicit FUA inside index_update for RELOC requests.

This makes reloc sequence optimal:
RELOC_S: R1, W2, WBI:FUA
RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/dev.c | 10 +++---
 drivers/block/ploop/map.c | 29 -
 2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..998fe71 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,11 @@ static void ploop_complete_request(struct ploop_request 
* preq)
 
__TRACE("Z %p %u\n", preq, preq->req_cluster);
 
+   if (!preq->error) {
+   unsigned long state =  READ_ONCE(preq->state);
+   WARN_ON(state & (1 << PLOOP_REQ_FORCE_FUA));
+   WARN_ON(state & (1 <bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2535,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
 
-   /* Relocated data write required sync before BAT updatee */
-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, >state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 3a6365d..c17e598 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -896,6 +896,7 @@ void ploop_index_update(struct ploop_request * preq)
struct ploop_device * plo = preq->plo;
struct map_node * m = preq->map;
struct ploop_delta * top_delta = map_top_delta(m->parent);
+   int fua = !!(preq->req_rw & REQ_FUA);
u32 idx;
map_index_t blk;
int old_level;
@@ -953,13 +954,13 @@ void ploop_index_update(struct ploop_request * preq)
__TRACE("wbi %p %u %p\n", preq, preq->req_cluster, m);
plo->st.map_single_writes++;
top_delta->ops->map_index(top_delta, m->mn_start, );
-   /* Relocate requires consistent writes, mark such reqs appropriately */
+   /* Relocate requires consistent index update */
if (test_bit(PLOOP_REQ_RELOC_A, >state) ||
test_bit(PLOOP_REQ_RELOC_S, >state))
-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
-
-   top_delta->io.ops->write_page(_delta->io, preq, page, sec,
- !!(preq->req_rw & REQ_FUA));
+   fua = 1;
+   if (fua)
+   clear_bit(PLOOP_REQ_FORCE_FLUSH, >state);
+   top_delta->io.ops->write_page(_delta->io, preq, page, sec, fua);
put_page(page);
return;
 
@@ -1078,7 +1079,7 @@ static void map_wb_complete(struct map_node * m, int err)
int delayed = 0;
unsigned int idx;
sector_t sec;
-   int fua, force_fua;
+   int fua;
 
/* First, complete processing of written back indices,
 * finally instantiate indices in mapping cache.
@@ -1149,7 +1150,6 @@ static void map_wb_complete(struct map_node * m, int err)
 
main_preq = NULL;
fua = 0;
-   force_fua = 0;
 
list_for_each_safe(cursor, tmp, >io_queue) {
struct ploop_request * preq;
@@ -1168,13 +1168,12 @@ static void map_wb_complete(struct map_node * m, int 
err)
break;
}
 
-   if (preq->req_rw & REQ_FUA)
+   if (preq->req_rw & REQ_FUA ||
+   test_bit(PLOOP_REQ_RELOC_A, >state) ||
+   test_bit(PLOOP_REQ_RELOC_S, >state)) {
+   

[Devel] [PATCH 2/3] ploop: deadcode cleanup

2016-06-15 Thread Dmitry Monakhov
(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/io_direct.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 74a554a..10d2314 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
-   /* In case of eng_state != COMPLETE, we'll do FUA in
-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init();
 
if (iblk == PLOOP_ZERO_INDEX)
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit

2016-06-15 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/io_direct.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index b844a80..74a554a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -517,16 +517,18 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
loff_t clu_siz = 1 << (preq->plo->cluster_log + 9);
+   int force_sync = preq->req_rw & REQ_FUA;
int err;
 
file_start_write(io->files.file);
 
-   /* Here io->io_count is even ... */
-   spin_lock_irq(>lock);
-   io->io_count++;
-   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
-   spin_unlock_irq(>lock);
-
+   if (!force_sync) {
+   /* Here io->io_count is even ... */
+   spin_lock_irq(>lock);
+   io->io_count++;
+   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
+   spin_unlock_irq(>lock);
+   }
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
@@ -535,9 +537,11 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
if (!err && (preq->req_rw & REQ_FUA))
err = io->ops->sync(io);
 
-   spin_lock_irq(>lock);
-   io->io_count++;
-   spin_unlock_irq(>lock);
+   if (!force_sync) {
+   spin_lock_irq(>lock);
+   io->io_count++;
+   spin_unlock_irq(>lock);
+   }
/* and here io->io_count is even (+2) again. */
 
file_end_write(io->files.file);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [NEW KERNEL] 3.10.0-327.18.2.vz7.14.15 (rhel7)

2016-06-15 Thread builder
Changelog:

OpenVZ kernel rh7-3.10.0-327.18.2.vz7.14.15

* vtty: Container's offline console can be opened before a Container start,
  survives start/stop/cpt/rst cycles
* fs: deny umounting rootfs
* sysrq: correct the fix to avoid cpu soft lockups on long print triggered by
  sysrq
* ploop: fix gendisk disk_stats to be seen on a partition
* module licenses and authors cleanup


Generated changelog:

* Wed Jun 15 2016 Konstantin Khorenko  
[3.10.0-327.18.2.vz7.14.15]
- ve/vtty: Don't free console mapping until no clients left (Cyrill Gorcunov) 
[PSBM-39463]
- fs: do not allow rootfs umount (Vasily Averin) [PSBM-46437]
- ms/kernel/sysrq: restore touch_nmi_watchdog() in show_state_filter() (Andrey 
Ryabinin) [PSBM-47486]
- ploop: fix gendisk disk_stats to be seen on partition (Maxim Patlasov) 
[PSBM-48266]
- modules: set module author for Virtuozzo modules (Konstantin Khorenko) 
[PSBM-43847]
- ploop: "Parallels loopback device" -> "Virtuozzo loopback device" (Konstantin 
Khorenko) [PSBM-43847]
- license: put correct copyrights into file headers (Konstantin Khorenko) 
[PSBM-43847]
- license: drop COPYING.Parallels file (Konstantin Khorenko) [PSBM-43847]


Built packages: 
http://kojistorage.eng.sw.ru/packages/vzkernel/3.10.0/327.18.2.vz7.14.15/
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/vtty: Don't free console mapping until no clients left

2016-06-15 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.14
-->
commit a9532c96e6a0c64fbb4a128ba6ca99b9081e85cc
Author: Cyrill Gorcunov 
Date:   Wed Jun 15 13:32:23 2016 +0400

ve/vtty: Don't free console mapping until no clients left

Currently on container's stop we free vtty mapping in a force way
so that if there is active console hooked from the node it become
unusable since then. It was easier to work with when we've been
reworking virtual console code.

Now lets make console fully functional as it was in pcs6:
when opened it must survive container start/stop cycle
and checkpoint/restore as well.

For this sake we:

 - drop ve_hook code, it no longer needed
 - free console @map on final close of the last tty opened

https://jira.sw.ru/browse/PSBM-39463

Signed-off-by: Cyrill Gorcunov 
Reviewed-by: Vladimir Davydov 

CC: Konstantin Khorenko 
CC: Igor Sukhih 
CC: Pavel Emelyanov 
---
 drivers/tty/pty.c   | 48 ++--
 kernel/ve/vecalls.c |  6 +++---
 2 files changed, 17 insertions(+), 37 deletions(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index a68102b..1644fdf 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -901,6 +901,13 @@ static void vtty_map_set(vtty_map_t *map, struct 
tty_struct *tty)
map->vttys[tty->index] = tty;
 }
 
+static void vtty_map_free(vtty_map_t *map)
+{
+   lockdep_assert_held(_mutex);
+   idr_remove(_idr, map->veid);
+   kfree(map);
+}
+
 static void vtty_map_clear(struct tty_struct *tty)
 {
vtty_map_t *map = tty->driver_data;
@@ -908,28 +915,20 @@ static void vtty_map_clear(struct tty_struct *tty)
lockdep_assert_held(_mutex);
if (map) {
struct tty_struct *p = map->vttys[tty->index];
+   int i;
 
WARN_ON(p != (tty->driver == vttys_driver ? tty : tty->link));
map->vttys[tty->index] = NULL;
tty->driver_data = tty->link->driver_data = NULL;
-   }
-}
 
-static void vtty_map_free(vtty_map_t *map)
-{
-   int i;
-
-   lockdep_assert_held(_mutex);
+   for (i = 0; i < MAX_NR_VTTY_CONSOLES; i++) {
+   if (map->vttys[i])
+   break;
+   }
 
-   for (i = 0; i < MAX_NR_VTTY_CONSOLES; i++) {
-   struct tty_struct *tty = map->vttys[i];
-   if (!tty)
-   continue;
-   tty->driver_data = tty->link->driver_data = NULL;
+   if (i >= MAX_NR_VTTY_CONSOLES)
+   vtty_map_free(map);
}
-
-   idr_remove(_idr, map->veid);
-   kfree(map);
 }
 
 static vtty_map_t *vtty_map_alloc(envid_t veid)
@@ -1209,24 +1208,6 @@ void vtty_release(struct tty_struct *tty, struct 
tty_struct *o_tty,
*o_tty_closing = 0;
 }
 
-static void ve_vtty_fini(void *data)
-{
-   struct ve_struct *ve = data;
-   vtty_map_t *map;
-
-   mutex_lock(_mutex);
-   map = vtty_map_lookup(ve->veid);
-   if (map)
-   vtty_map_free(map);
-   mutex_unlock(_mutex);
-}
-
-static struct ve_hook vtty_hook = {
-   .fini   = ve_vtty_fini,
-   .priority   = HOOK_PRIO_DEFAULT,
-   .owner  = THIS_MODULE,
-};
-
 static int __init vtty_init(void)
 {
 #define VTTY_DRIVER_ALLOC_FLAGS\
@@ -1279,7 +1260,6 @@ static int __init vtty_init(void)
if (tty_register_driver(vttys_driver))
panic(pr_fmt("Can't register slave vtty driver\n"));
 
-   ve_hook_register(VE_SS_CHAIN, _hook);
tty_default_fops(_fops);
return 0;
 }
diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c
index 457d690..5aa9722 100644
--- a/kernel/ve/vecalls.c
+++ b/kernel/ve/vecalls.c
@@ -990,6 +990,9 @@ static int ve_configure(envid_t veid, unsigned int key,
struct ve_struct *ve;
int err = -ENOKEY;
 
+   if (key == VE_CONFIGURE_OPEN_TTY)
+   return vtty_open_master(veid, val);
+
ve = get_ve_by_id(veid);
if (!ve)
return -EINVAL;
@@ -998,9 +1001,6 @@ static int ve_configure(envid_t veid, unsigned int key,
case VE_CONFIGURE_OS_RELEASE:
err = init_ve_osrelease(ve, data);
break;
-   case VE_CONFIGURE_OPEN_TTY:
-   err = vtty_open_master(ve->veid, val);
-   break;
}
 
put_ve(ve);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] fs: do not allow rootfs umount

2016-06-15 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.14
-->
commit 0ae9e4e30b14404b570f62f83220637506be6376
Author: Vasily Averin 
Date:   Wed Jun 15 13:16:38 2016 +0400

fs: do not allow rootfs umount

In mainline rootfs is marked always as MNT_LOCKED,
sys_umount checks this flag and fails its processing.
Our kernels lacks for MNT_LOCKED flag, so we use another kind of check
to prevent incorrect operation.

v2: use mnt_has_parent()

https://jira.sw.ru/browse/PSBM-46437

Signed-off-by: Vasily Averin 
Acked-by: Andrey Vagin 
---
 fs/namespace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 988320b..4fb935a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1355,6 +1355,8 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
goto dput_and_out;
if (!check_mnt(mnt))
goto dput_and_out;
+   if (!mnt_has_parent(mnt))
+   goto dput_and_out;
 
retval = do_umount(mnt, flags);
 dput_and_out:
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] memcg: mem_cgroup_uncharge_page() kernel panic/lockup

2016-06-15 Thread Vladimir Davydov
Hi,

Thanks for the report.

Could you please

 - file a bug to bugzilla.openvz.org

 - upload the vmcore at
   rsync://fe.sw.ru/f837d67c8e2ade8cee3367cb0f880268/

On Mon, Jun 13, 2016 at 09:24:33AM +0300, Anatoly Stepanov wrote:
> Hello everyone!
> 
> We encounter an issue with mem_cgroup_uncharge_page() function,
> it appears quite often on our clients servers.
> 
> Basically the issue sometimes leads to hard-lockup, sometimes to GP fault.
> 
> Based on bug reports from clients, the problem shows up when a user
> process calls "execve" or "exit" syscalls.
> As we know in those cases kernel invokes "uncharging" for every page
> when its unmapped from all the mm's.
> 
> Kernel dump analysis shows that at the moment of
> mem_cgroup_uncharge_page() "memcg" pointer
> (taken from page_cgroup) seems to be pointing to some random memory area.
> 
> On the other hand, if we look at current->mm->css, then memcg instance
> exists and is "online".
> 
> This led me to a thought that "page_cgroup->memcg" may be changed by
> some part of memcg code in parallel.
> As far as i understand, the only option here is "reclaim code path"
> (may be i'm wrong)
> 
> So, i suppose there might be a race between "memcg uncharge code" and
> "memcg reclaim code".
> 
> Please, give me your thoughts about it
> thanks
> 
> P.S.:
> 
> Additional info:
> 
> Kernel: rh7-3.10.0-327.10.1.vz7.12.14
> 
> *1st
> BT
> 
> PID: 972445  TASK: 88065d53d8d0  CPU: 0   COMMAND: "httpd"
>  #0 [880224f37818] machine_kexec at 8105249b
>  #1 [880224f37878] crash_kexec at 81103532
>  #2 [880224f37948] oops_end at 81641628
>  #3 [880224f37970] die at 810184cb
>  #4 [880224f379a0] do_general_protection at 81640f24
>  #5 [880224f379d0] general_protection at 81640768
> [exception RIP: mem_cgroup_charge_statistics+19]
> RIP: 811e7733  RSP: 880224f37a80  RFLAGS: 00010202
> RAX:   RBX: 8807b26f0110  RCX: 
> RDX: 79726f6765746163  RSI: ea000c9c0440  RDI: 8806a55662f8
> RBP: 880224f37a80   R8:    R9: 03808000
> R10: 00b8  R11: ea001eaa8980  R12: ea000c9c0440
> R13: 0001  R14:   R15: 8806a5566000
> ORIG_RAX:   CS: 0010  SS: 0018
>  #6 [880224f37a88] __mem_cgroup_uncharge_common at 811e9ddf
>  #7 [880224f37ac8] mem_cgroup_uncharge_page at 811ee99a
>  #8 [880224f37ad8] page_remove_rmap at 811b9ec9
>  #9 [880224f37b10] unmap_page_range at 811ab580
> #10 [880224f37bf8] unmap_single_vma at 811aba11
> #11 [880224f37c30] unmap_vmas at 811ace79
> #12 [880224f37c68] exit_mmap at 811b663c
> #13 [880224f37d18] mmput at 8107853b
> #14 [880224f37d38] flush_old_exec at 81202547
> #15 [880224f37d88] load_elf_binary at 8125883c
> #16 [880224f37e58] search_binary_handler at 81201c25
> #17 [880224f37ea0] do_execve_common at 812032b7
> #18 [880224f37f30] sys_execve at 81203619
> #19 [880224f37f50] stub_execve at 81649369
> RIP: 7f54284b3287  RSP: 7ffda57a0698  RFLAGS: 0297
> RAX: 003b  RBX: 037c5fe8  RCX: 
> RDX: 037cf3f8  RSI: 037ce5f8  RDI: 7f5425fcabf1
> RBP: 7ffda57a0750   R8: 0001   R9: 
> 
> 
> ***2nd
> BT**:
> 
> PID: 168440  TASK: 88001e31cc20  CPU: 18  COMMAND: "httpd"
>  #0 [88007255f838] machine_kexec at 8105249b
>  #1 [88007255f898] crash_kexec at 81103532
>  #2 [88007255f968] oops_end at 81641628
>  #3 [88007255f990] no_context at 8163222b
>  #4 [88007255f9e0] __bad_area_nosemaphore at 816322c1
>  #5 [88007255fa30] bad_area_nosemaphore at 8163244a
>  #6 [88007255fa40] __do_page_fault at 8164443e
>  #7 [88007255faa0] trace_do_page_fault at 81644673
>  #8 [88007255fad8] do_async_page_fault at 81643d59
>  #9 [88007255faf0] async_page_fault at 816407f8
> [exception RIP: memcg_check_events+435]
> RIP: 811e9b53  RSP: 88007255fba0  RFLAGS: 00010246
> RAX: f81ef81e  RBX: 8802106d5000  RCX: 
> RDX: f81e  RSI: 0002  RDI: 8807aa2642e8
> RBP: 88007255fbf0   R8: 0202   R9: 
> R10: 0010  R11: 88007255ffd8  R12: 8807aa2642e0
> R13: 0410  R14: 8802073de700  R15: 8802106d5000
> ORIG_RAX:   CS: 0010  SS: 0018
> #10 [88007255fbf8] __mem_cgroup_uncharge_common at 

Re: [Devel] [PATCH rh7 0/6] ploop: push_backup: implement expiration timeout

2016-06-15 Thread Konstantin Khorenko

Dima, please review the patchset.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 06/15/2016 03:50 AM, Maxim Patlasov wrote:

If a ploop request waits for userspace backup tool attention for more
then plo->tune.push_backup_timeout (42 secs by default), the whole
push_backup operation is aborted, initial CBT mask is merged back to CBT.

https://jira.sw.ru/browse/PSBM-48082

---

Maxim Patlasov (6):
   ploop: push_backup: introduce pb_set structure
   ploop: push_backup: factor rb_erase() out
   ploop: push_backup: extend pb_set
   ploop: push_backup: add timeout tunable
   ploop: push_backup: health monitor thread
   ploop: push_backup: implement timeout functions


  drivers/block/ploop/push_backup.c |  261 +
  drivers/block/ploop/sysfs.c   |2
  include/linux/ploop/ploop.h   |4 -
  3 files changed, 240 insertions(+), 27 deletions(-)

--
Signature
.


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ms/kernel/sysrq: restore touch_nmi_watchdog() in show_state_filter()

2016-06-15 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.14
-->
commit 3eada65cb70c5f2c773fcac7aabc72de5f768bac
Author: Andrey Ryabinin 
Date:   Wed Jun 15 12:41:37 2016 +0400

ms/kernel/sysrq: restore touch_nmi_watchdog() in show_state_filter()

Commit 60c21d9f08bf ("kernel/sysrq: reset watchdog on all cpus while during 
sysrq-w")
shouldn't remove touch_nmi_watchdog() call because 
touch_all_softlockup_watchdogs()
resets only softlockup watchdogs, but doesn't reset NMI watchdog used in 
hard lockup
detector.

So, bring it back. Plus, remove the second touch_all_softlockup_watchdogs() 
call
which becomes redundant, and add a comment.

This patch is delta between v2-v1 version of the upstream patch:

http://lkml.kernel.org/g/1465474805-14641-1-git-send-email-aryabi...@virtuozzo.com

https://jira.sw.ru/browse/PSBM-47486

Fixes: 60c21d9f08bf ("kernel/sysrq: reset watchdog on all cpus while during 
sysrq-w")
Signed-off-by: Andrey Ryabinin 
---
 kernel/sched/core.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d21ccf0..1a3ff8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5176,14 +5176,16 @@ void show_state_filter(unsigned long state_filter)
/*
 * reset the NMI-timeout, listing all files on a slow
 * console might take a lot of time:
+* Also, reset softlockup watchdogs on all CPUs, because
+* another CPU might be blocked waiting for us to process
+* an IPI.
 */
+   touch_nmi_watchdog();
touch_all_softlockup_watchdogs();
if (!state_filter || (p->state & state_filter))
sched_show_task(p);
} while_each_thread(g, p);
 
-   touch_all_softlockup_watchdogs();
-
 #if 0
/*
 * This results in soft lockups, because it writes too much data to
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: fix gendisk disk_stats to be seen on partition

2016-06-15 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.14
-->
commit fc4dd8e6d4df14e5e09b6dacac74fe903d95c929
Author: Maxim Patlasov 
Date:   Wed Jun 15 12:38:54 2016 +0400

ploop: fix gendisk disk_stats to be seen on partition

Before this patch, an I/O on top of /dev/ploopNp1 was always accounted
on main partition (/sys/block/ploopN/stat). The counters for p1 remained
zero. The patch fixes the problem by calculating partition properly.

https://jira.sw.ru/browse/PSBM-48266

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index f87209d..01a5189 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -800,6 +800,7 @@ static void ploop_make_request(struct request_queue *q, 
struct bio *bio)
struct bio * nbio;
struct ploop_device * plo = q->queuedata;
unsigned long rw = bio_data_dir(bio);
+   struct hd_struct *part;
int cpu;
LIST_HEAD(drop_list);
 
@@ -811,8 +812,9 @@ static void ploop_make_request(struct request_queue *q, 
struct bio *bio)
BUG_ON(bio->bi_size & 511);
 
cpu = part_stat_lock();
-   part_stat_inc(cpu, >disk->part0, ios[rw]);
-   part_stat_add(cpu, >disk->part0, sectors[rw], bio_sectors(bio));
+   part = disk_map_sector_rcu(plo->disk, bio->bi_sector);
+   part_stat_inc(cpu, part, ios[rw]);
+   part_stat_add(cpu, part, sectors[rw], bio_sectors(bio));
part_stat_unlock();
 
if (unlikely(bio->bi_size == 0)) {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/3] netlink/diag: report flags for netlink sockets

2016-06-15 Thread Andrey Vagin
We need to know flags for dumping and restoring netlink sockets.

All flags except NDIAG_FLAG_CB_RUNNING can be received with help of
getsockopt(), but in this case we need a socket descriptor and we need
to call getsockopt() to get each flag.

With this chages we will be able to show netlink sockets flags from the
ss tool.

In criu, we need to know if a callback is running now or not.

When a socket has some data in a receive queue and doesn't have running
callbacks, we can save all data from a receive queue on dump and queue
them back on restore.

If a socket has a running callback, a receive queue contains only a part
of data, and as soon as we read them, the callback will generate a new
portion. In this case, we can't be sure that all data will not exceed a
buffer limit on restore.

Now we are going to dump and restore sockets without running callbacks.
---
 include/uapi/linux/netlink_diag.h | 10 ++
 net/netlink/af_netlink.c  |  9 -
 net/netlink/af_netlink.h  |  9 +
 net/netlink/diag.c| 28 +++-
 4 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/include/uapi/linux/netlink_diag.h 
b/include/uapi/linux/netlink_diag.h
index 4e31db4..6a9108f 100644
--- a/include/uapi/linux/netlink_diag.h
+++ b/include/uapi/linux/netlink_diag.h
@@ -37,6 +37,7 @@ enum {
NETLINK_DIAG_GROUPS,
NETLINK_DIAG_RX_RING,
NETLINK_DIAG_TX_RING,
+   NETLINK_DIAG_FLAGS,
 
__NETLINK_DIAG_MAX,
 };
@@ -48,5 +49,14 @@ enum {
 #define NDIAG_SHOW_MEMINFO 0x0001 /* show memory info of a socket */
 #define NDIAG_SHOW_GROUPS  0x0002 /* show groups of a netlink socket */
 #define NDIAG_SHOW_RING_CFG0x0004 /* show ring configuration */
+#define NDIAG_SHOW_FLAGS   0x0008 /* show flags of a netlink socket */
+
+/* flags */
+#define NDIAG_FLAG_CB_RUNNING  0x0001
+#define NDIAG_FLAG_PKTINFO 0x0002
+#define NDIAG_FLAG_BROADCAST_ERROR 0x0004
+#define NDIAG_FLAG_NO_ENOBUFS  0x0008
+#define NDIAG_FLAG_LISTEN_ALL_NSID 0x0010
+#define NDIAG_FLAG_CAP_ACK 0x0020
 
 #endif
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 113e2ae..ba75f32 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -77,15 +77,6 @@ struct listeners {
 /* state bits */
 #define NETLINK_S_CONGESTED0x0
 
-/* flags */
-#define NETLINK_F_KERNEL_SOCKET0x1
-#define NETLINK_F_RECV_PKTINFO 0x2
-#define NETLINK_F_BROADCAST_SEND_ERROR 0x4
-#define NETLINK_F_RECV_NO_ENOBUFS  0x8
-#define NETLINK_F_LISTEN_ALL_NSID  0x10
-#define NETLINK_F_CAP_ACK  0x20
-#define NETLINK_F_REPAIR   0x40
-
 static inline int netlink_is_kernel(struct sock *sk)
 {
return nlk_sk(sk)->flags & NETLINK_F_KERNEL_SOCKET;
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 577fddf..b3ce345 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -4,6 +4,15 @@
 #include 
 #include 
 
+/* flags */
+#define NETLINK_F_KERNEL_SOCKET0x1
+#define NETLINK_F_RECV_PKTINFO 0x2
+#define NETLINK_F_BROADCAST_SEND_ERROR 0x4
+#define NETLINK_F_RECV_NO_ENOBUFS  0x8
+#define NETLINK_F_LISTEN_ALL_NSID  0x10
+#define NETLINK_F_CAP_ACK  0x20
+#define NETLINK_F_REPAIR   0x40
+
 #define NLGRPSZ(x) (ALIGN(x, sizeof(unsigned long) * 8) / 8)
 #define NLGRPLONGS(x)  (NLGRPSZ(x)/sizeof(unsigned long))
 
diff --git a/net/netlink/diag.c b/net/netlink/diag.c
index de8c74a..0aa8744e 100644
--- a/net/netlink/diag.c
+++ b/net/netlink/diag.c
@@ -54,6 +54,27 @@ static int sk_diag_dump_groups(struct sock *sk, struct 
sk_buff *nlskb)
   nlk->groups);
 }
 
+static int sk_diag_put_flags(struct sock *sk, struct sk_buff *skb)
+{
+   struct netlink_sock *nlk = nlk_sk(sk);
+   u32 flags = 0;
+
+   if (nlk->cb_running)
+   flags |= NDIAG_FLAG_CB_RUNNING;
+   if (nlk->flags & NETLINK_F_RECV_PKTINFO)
+   flags |= NDIAG_FLAG_PKTINFO;
+   if (nlk->flags & NETLINK_F_BROADCAST_SEND_ERROR)
+   flags |= NDIAG_FLAG_BROADCAST_ERROR;
+   if (nlk->flags & NETLINK_F_RECV_NO_ENOBUFS)
+   flags |= NDIAG_FLAG_NO_ENOBUFS;
+   if (nlk->flags & NETLINK_F_LISTEN_ALL_NSID)
+   flags |= NDIAG_FLAG_LISTEN_ALL_NSID;
+   if (nlk->flags & NETLINK_F_CAP_ACK)
+   flags |= NDIAG_FLAG_CAP_ACK;
+
+   return nla_put_u32(skb, NETLINK_DIAG_FLAGS, flags);
+}
+
 static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
struct netlink_diag_req *req,
u32 portid, u32 seq, u32 flags, int sk_ino)
@@ -91,7 +112,12 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff 
*skb,
sk_diag_put_rings_cfg(sk, skb))
goto out_nlmsg_trim;
 
-   return nlmsg_end(skb, nlh);
+

[Devel] [PATCH 2/3] netlink: add an ability to restore messages in a receive queue

2016-06-15 Thread Andrey Vagin
This patch adds an repair mode for netlink sockets. sendmsg queues
messages into a receive queue if a socket is in the repair mode.
---
 include/uapi/linux/netlink.h | 19 ++---
 net/netlink/af_netlink.c | 51 +++-
 2 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
index 3e34b7d..56ddadf 100644
--- a/include/uapi/linux/netlink.h
+++ b/include/uapi/linux/netlink.h
@@ -101,14 +101,17 @@ struct nlmsgerr {
struct nlmsghdr msg;
 };
 
-#define NETLINK_ADD_MEMBERSHIP 1
-#define NETLINK_DROP_MEMBERSHIP2
-#define NETLINK_PKTINFO3
-#define NETLINK_BROADCAST_ERROR4
-#define NETLINK_NO_ENOBUFS 5
-#define NETLINK_RX_RING6
-#define NETLINK_TX_RING7
-#define NETLINK_LISTEN_ALL_NSID8
+#define NETLINK_ADD_MEMBERSHIP 1
+#define NETLINK_DROP_MEMBERSHIP2
+#define NETLINK_PKTINFO3
+#define NETLINK_BROADCAST_ERROR4
+#define NETLINK_NO_ENOBUFS 5
+#define NETLINK_RX_RING6
+#define NETLINK_TX_RING7
+#define NETLINK_LISTEN_ALL_NSID8
+#define NETLINK_LIST_MEMBERSHIPS   9
+#define NETLINK_CAP_ACK10
+#define NETLINK_REPAIR 11
 
 struct nl_pktinfo {
__u32   group;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 79526e5..113e2ae 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -83,6 +83,8 @@ struct listeners {
 #define NETLINK_F_BROADCAST_SEND_ERROR 0x4
 #define NETLINK_F_RECV_NO_ENOBUFS  0x8
 #define NETLINK_F_LISTEN_ALL_NSID  0x10
+#define NETLINK_F_CAP_ACK  0x20
+#define NETLINK_F_REPAIR   0x40
 
 static inline int netlink_is_kernel(struct sock *sk)
 {
@@ -1744,6 +1746,7 @@ static int netlink_unicast_kernel(struct sock *sk, struct 
sk_buff *skb,
 int netlink_unicast(struct sock *ssk, struct sk_buff *skb,
u32 portid, int nonblock)
 {
+   struct netlink_sock *nlk = nlk_sk(ssk);
struct sock *sk;
int err;
long timeo;
@@ -1752,19 +1755,24 @@ int netlink_unicast(struct sock *ssk, struct sk_buff 
*skb,
 
timeo = sock_sndtimeo(ssk, nonblock);
 retry:
-   sk = netlink_getsockbyportid(ssk, portid);
-   if (IS_ERR(sk)) {
-   kfree_skb(skb);
-   return PTR_ERR(sk);
-   }
-   if (netlink_is_kernel(sk))
-   return netlink_unicast_kernel(sk, skb, ssk);
+   if (nlk->flags & NETLINK_F_REPAIR) {
+   sk = ssk;
+   sock_hold(sk);
+   } else {
+   sk = netlink_getsockbyportid(ssk, portid);
+   if (IS_ERR(sk)) {
+   kfree_skb(skb);
+   return PTR_ERR(sk);
+   }
+   if (netlink_is_kernel(sk))
+   return netlink_unicast_kernel(sk, skb, ssk);
 
-   if (sk_filter(sk, skb)) {
-   err = skb->len;
-   kfree_skb(skb);
-   sock_put(sk);
-   return err;
+   if (sk_filter(sk, skb)) {
+   err = skb->len;
+   kfree_skb(skb);
+   sock_put(sk);
+   return err;
+   }
}
 
err = netlink_attachskb(sk, skb, , ssk);
@@ -2126,6 +2134,13 @@ static int netlink_setsockopt(struct socket *sock, int 
level, int optname,
return -EFAULT;
 
switch (optname) {
+   case NETLINK_REPAIR:
+   if (val)
+   nlk->flags |= NETLINK_F_REPAIR;
+   else
+   nlk->flags &= ~NETLINK_F_REPAIR;
+   err = 0;
+   break;
case NETLINK_PKTINFO:
if (val)
nlk->flags |= NETLINK_F_RECV_PKTINFO;
@@ -2288,6 +2303,7 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct 
socket *sock,
int err;
struct scm_cookie scm;
u32 netlink_skb_flags = 0;
+   bool repair = nlk->flags & NETLINK_F_REPAIR;
 
if (msg->msg_flags_OOB)
return -EOPNOTSUPP;
@@ -2307,7 +2323,8 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct 
socket *sock,
dst_group = ffs(addr->nl_groups);
err =  -EPERM;
if ((dst_group || dst_portid) &&
-   !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
+   !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND &&
+   !repair))
goto out;
netlink_skb_flags |= NETLINK_SKB_DST;
} else {
@@ -2336,7 +2353,11 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct 
socket *sock,
if (skb == NULL)
goto out;
 
-   NETLINK_CB(skb).portid  = nlk->portid;
+  

[Devel] [PATCH net-next 0/3] [RFC] netlink: prepare to dump and restore data from a receive queue

2016-06-15 Thread Andrey Vagin
CRIU can dump queued data for unix and tcp sockets,
now it's time for netlink sockets.

Here are there questions.
* How to dump data from a receive queue
  We can set peeking offset like we do for unix sockets.

* How to restore data back to a receive queue
  I suggest to add a repair mode like we do for tcp sockets.

* When we can dump data from a receive queue.
  I think we can do this only if a socket doesn't have a running callback.

Andrey Vagin (3):
  netlink: allow to set peeking offset for sockets
  netlink: add an ability to restore messages in a receive queue
  netlink/diag: report flags for netlink sockets

 include/uapi/linux/netlink.h  |  1 +
 include/uapi/linux/netlink_diag.h | 10 +
 net/netlink/af_netlink.c  | 82 ++-
 net/netlink/af_netlink.h  |  9 +
 net/netlink/diag.c| 25 
 5 files changed, 99 insertions(+), 28 deletions(-)

-- 
2.5.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/3] netlink: allow to set peeking offset for sockets

2016-06-15 Thread Andrey Vagin
This allows us to read socket's queue without removing skbs from it.

The same logic was implemented for unix and inet sockets and we use this
to dump and restore sockets in CRIU.

Here is a question whether sk_peek_off has to be protected by locks.
Currently it isn't protected and an user who uses sk_peek_off has to be
sure that nobody calls recvmsg for a socket except him.
---
 net/netlink/af_netlink.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ad65bdd..79526e5 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2372,17 +2372,18 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct 
socket *sock,
struct scm_cookie scm;
struct sock *sk = sock->sk;
struct netlink_sock *nlk = nlk_sk(sk);
-   int noblock = flags_DONTWAIT;
size_t copied;
struct sk_buff *skb, *data_skb;
+   int peeked, skip;
int err, ret;
 
if (flags_OOB)
return -EOPNOTSUPP;
 
copied = 0;
+   skip = sk_peek_offset(sk, flags);
 
-   skb = skb_recv_datagram(sk, flags, noblock, );
+   skb = __skb_recv_datagram(sk, flags, , , );
if (skb == NULL)
goto out;
 
@@ -2410,14 +2411,19 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct 
socket *sock,
nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len,
 16384);
 
-   copied = data_skb->len;
+   copied = data_skb->len - skip;
if (len < copied) {
msg->msg_flags |= MSG_TRUNC;
copied = len;
}
 
skb_reset_transport_header(data_skb);
-   err = skb_copy_datagram_iovec(data_skb, 0, msg->msg_iov, copied);
+   err = skb_copy_datagram_iovec(data_skb, skip, msg->msg_iov, copied);
+
+   if (flags & MSG_PEEK)
+   sk_peek_offset_fwd(sk, copied);
+   else
+   sk_peek_offset_bwd(sk, skb->len);
 
if (msg->msg_name) {
struct sockaddr_nl *addr = (struct sockaddr_nl *)msg->msg_name;
@@ -2439,7 +2445,7 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct 
socket *sock,
}
siocb->scm->creds = *NETLINK_CREDS(skb);
if (flags & MSG_TRUNC)
-   copied = data_skb->len;
+   copied = data_skb->len - skip;
 
skb_free_datagram(sk, skb);
 
@@ -3086,6 +3092,13 @@ int netlink_unregister_notifier(struct notifier_block 
*nb)
 }
 EXPORT_SYMBOL(netlink_unregister_notifier);
 
+static int netlink_set_peek_off(struct sock *sk, int val)
+{
+   sk->sk_peek_off = val;
+
+   return 0;
+}
+
 static const struct proto_ops netlink_ops = {
.family =   PF_NETLINK,
.owner =THIS_MODULE,
@@ -3105,6 +3118,7 @@ static const struct proto_ops netlink_ops = {
.recvmsg =  netlink_recvmsg,
.mmap = netlink_mmap,
.sendpage = sock_no_sendpage,
+   .set_peek_off = netlink_set_peek_off,
 };
 
 static const struct net_proto_family netlink_family_ops = {
-- 
2.5.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel