Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests

2016-07-06 Thread Maxim Patlasov

Dima,

On 07/06/2016 04:58 AM, Dmitry Monakhov wrote:


Maxim Patlasov  writes:


Commit 9f860e606 introduced an engine to delay fsync: doing
fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
later, when incoming FLUSH|FUA comes.

That was deemed as important because (PSBM-47026):


This optimization becomes more important due to the fact that customers tend to 
use pcompact heavily => ploop images grow each day.

Now, we can easily re-use the engine to delay fsync for reloc
requests as well. As explained in the description of commit
5aa3fe09:


 1->read_data_from_old_post
 2->write_to_new_pos
   ->sumbit_alloc
  ->submit_pad
  ->post_submit->convert_unwritten
 3->update_index ->write_page with FLUSH|FUA
 4->nullify_old_pos
5->issue_flush

by the time of step 3 extent coversion is not yet stable because
belongs to uncommitted transaction. But instead of doing fsync
inside ->post_submit, we can fsync later, as the very first step
of write_page for index_update.

NAK from me. What is advantage of this patch?


The advantage is the following: in case of BAT multi-updates, instead of 
doing many fsync-s (one per dio_post_submit), we'll do only one (when 
final ->write_page is called).



Does it makes code more optimal? No


Yes, it does. In the same sense as 9f860e606: saving some fsync-s.


Does it makes main ploop more asynchronous? No.


Correct, the patch optimizes ploop in the other way. It's not about 
making ploop more asynchronous.





If you want to make optimization then it is reasonable to
queue preq with PLOOP_IO_FSYNC_DELAYED to top_io->fsync_queue
before processing PLOOP_E_DATA_WBI  state for  preq with FUA
So sequence will looks like follows:
->sumbit_alloc
   ->submit_pad
   ->post_submit->convert_unwritten-> tag PLOOP_IO_FSYNC_DELAYED
->ploop_req_state_process
   case PLOOP_E_DATA_WBI:
   if (preq->start & PLOOP_IO_FSYNC_DELAYED_FL) {
   preq->start &= ~PLOOP_IO_FSYNC_DELAYED_FL
   list_add_tail(&preq->list, &top_io->fsync_queue)
   return;
}
##Let fsync_thread do it's work
->ploop_req_state_process
case LOOP_E_DATA_WBI:
update_index->write_page with FUA (FLUSH is not required because we  
already done fsync)


That's another type of optimization: making ploop more asynchronous. I 
thought about it, but didn't come to conclusion whether it's worthy 
w.r.t. adding more complexity to ploop-state-machine and possible bugs 
introduced with that.


Thanks,
Maxim




https://jira.sw.ru/browse/PSBM-47026

Signed-off-by: Maxim Patlasov 
---
  drivers/block/ploop/dev.c   |4 ++--
  drivers/block/ploop/io_direct.c |   25 -
  drivers/block/ploop/io_kaio.c   |3 ++-
  drivers/block/ploop/map.c   |   17 -
  include/linux/ploop/ploop.h |3 ++-
  5 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e5f010b..40768b6 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = 0;
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
preq->error = 0;
@@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = ~0U; /* uninitialized */
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
preq->error = 0;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 1086850..0a5fb15 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1494,13 +1494,36 @@ dio_read_page(struct ploop_io * io, struct 
ploop_request * preq,
  
  static void

  dio_write_page(struct ploop_io * io, struct ploop_request * preq,
-  struct page * page, sector_t sec, unsigned long rw)
+  struct page * page, sector_t sec, unsigned long rw,
+  int do_fsync_if_delayed)
  {
if (!(io->files.file->f_mode & FMODE_WRITE)) {
PLOOP_FAIL_REQUEST(preq, -EBADF);
return;
}
  
+	if (do_fsync_if_delayed &&

+   test_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state)) {
+   struct ploop_device * plo = io->plo;
+   u64 io_count;
+   int err;
+
+   spin_lock_irq(&plo->lock);
+   io_count = io->io_count;
+   spin_unlock_irq(&plo->lock)

Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default

2016-07-06 Thread Maxim Patlasov

On 07/06/2016 02:26 AM, Vladimir Davydov wrote:


On Tue, Jul 05, 2016 at 04:45:10PM -0700, Maxim Patlasov wrote:

Vova,


On 07/04/2016 11:03 AM, Maxim Patlasov wrote:

On 07/04/2016 08:53 AM, Vladimir Davydov wrote:


On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote:
...

@@ -643,6 +643,7 @@ static struct cgroup_subsys_state
*ve_create(struct cgroup *cg)
ve->odirect_enable = 2;
  ve->fsync_enable = 2;
+ve->experimental_fs_enable = 2;

For odirect_enable and fsync_enable, 2 means follow the host's config, 1
means enable unconditionally, and 0 means disable unconditionally. But
we don't want to allow a user inside a CT to enable this feature, right?

I thought it's OK to allow user inside CT to enable it if host sysadmin is
OK about it. The same logic as for odirect: by default
ve0->experimental_fs_enable = 0, so whatever user inside CT writes to this
knob, the feature is disabled. If sysadmin writes '1' to ve0->..., the
feature becomes enabled. If an user wants to voluntarily disable it inside
CT, that's OK too.


This is confusing. May be, we'd better add a new VE_FEATURE for the
purpose?

Not sure right now. I'll look at it and let you know later.

Technically, it is very easy to implement new VE_FEATURE for overlayfs. But
this approach is less flexible because we return EPERM from ve_write_64 if
CT is running, and we'll need to involve userspace team to make the feature
configurable and (possibly) persistent. Do you think it's worthy for
something we'll get rid of soon anyway (I mean as soon as PSBM-47981
resolved)?

Fair enough, not much point in introducing yet another feature for the
purpose, at least right now, sysctl should do for the beginning.

Come to think of it, do we really need this sysctl inside containers? I
mean, by enabling this sysctl on the host we open a possible system-wide
security hole, which a CT admin won't be able to mitigate by disabling
overlayfs inside her CT. So why would she need it for? To prevent
non-privileged CT users from mounting overlayfs inside a user ns? But
overlayfs is not permitted to be mounted by a userns root anyway AFAICS.
May be, just drop in-CT sysctl then?


Currently, anyone who can login into CT as root may mount overlayfs, 
then try to exploit its weak sides. This is a problem.


Until we ensure that overlayfs is production-ready (at least does not 
have obvious breaches), let's disable it by default (of course, if ve != 
ve0). Those who want to play with overlayfs at their own risk will 
enable it by turning on some knob on host system (ve == ve0).


I don't think that mixing trusted (overlayfs-enabled) CTs and not 
trusted (overlayfs-disabled) CTs on the same physical node is important 
use-case for now. So, any simple system-wide knob must work. 
Essentially, the same scheme with odirect: by default it is '0' in ve0 
and the root inside CT cannot turn it on; and if it is manually set to 
'1' in ve0, the behavior will depend on per-CT root willing.


Thanks,
Maxim
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests

2016-07-06 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> Commit 9f860e606 introduced an engine to delay fsync: doing
> fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
> io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
> later, when incoming FLUSH|FUA comes.
>
> That was deemed as important because (PSBM-47026):
>
>> This optimization becomes more important due to the fact that customers tend 
>> to use pcompact heavily => ploop images grow each day.
>
> Now, we can easily re-use the engine to delay fsync for reloc
> requests as well. As explained in the description of commit
> 5aa3fe09:
>
>> 1->read_data_from_old_post
>> 2->write_to_new_pos
>>   ->sumbit_alloc
>>  ->submit_pad
>>  ->post_submit->convert_unwritten
>> 3->update_index ->write_page with FLUSH|FUA
>> 4->nullify_old_pos
>>5->issue_flush
>
> by the time of step 3 extent coversion is not yet stable because
> belongs to uncommitted transaction. But instead of doing fsync
> inside ->post_submit, we can fsync later, as the very first step
> of write_page for index_update.
NAK from me. What is advantage of this patch?
Does it makes code more optimal? No
Does it makes main ploop more asynchronous? No.

If you want to make optimization then it is reasonable to
queue preq with PLOOP_IO_FSYNC_DELAYED to top_io->fsync_queue
before processing PLOOP_E_DATA_WBI  state for  preq with FUA
So sequence will looks like follows:
->sumbit_alloc
  ->submit_pad
  ->post_submit->convert_unwritten-> tag PLOOP_IO_FSYNC_DELAYED
->ploop_req_state_process
  case PLOOP_E_DATA_WBI:
  if (preq->start & PLOOP_IO_FSYNC_DELAYED_FL) {
  preq->start &= ~PLOOP_IO_FSYNC_DELAYED_FL
  list_add_tail(&preq->list, &top_io->fsync_queue)
  return;
   }
##Let fsync_thread do it's work
->ploop_req_state_process
   case LOOP_E_DATA_WBI:
   update_index->write_page with FUA (FLUSH is not required because we  already 
done fsync)

>
> https://jira.sw.ru/browse/PSBM-47026
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |4 ++--
>  drivers/block/ploop/io_direct.c |   25 -
>  drivers/block/ploop/io_kaio.c   |3 ++-
>  drivers/block/ploop/map.c   |   17 -
>  include/linux/ploop/ploop.h |3 ++-
>  5 files changed, 42 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index e5f010b..40768b6 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo)
>   preq->bl.tail = preq->bl.head = NULL;
>   preq->req_cluster = 0;
>   preq->req_size = 0;
> - preq->req_rw = WRITE_SYNC|REQ_FUA;
> + preq->req_rw = WRITE_SYNC;
>   preq->eng_state = PLOOP_E_ENTRY;
>   preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
>   preq->error = 0;
> @@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device 
> *plo)
>   preq->bl.tail = preq->bl.head = NULL;
>   preq->req_cluster = ~0U; /* uninitialized */
>   preq->req_size = 0;
> - preq->req_rw = WRITE_SYNC|REQ_FUA;
> + preq->req_rw = WRITE_SYNC;
>   preq->eng_state = PLOOP_E_ENTRY;
>   preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
>   preq->error = 0;
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 1086850..0a5fb15 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -1494,13 +1494,36 @@ dio_read_page(struct ploop_io * io, struct 
> ploop_request * preq,
>  
>  static void
>  dio_write_page(struct ploop_io * io, struct ploop_request * preq,
> -struct page * page, sector_t sec, unsigned long rw)
> +struct page * page, sector_t sec, unsigned long rw,
> +int do_fsync_if_delayed)
>  {
>   if (!(io->files.file->f_mode & FMODE_WRITE)) {
>   PLOOP_FAIL_REQUEST(preq, -EBADF);
>   return;
>   }
>  
> + if (do_fsync_if_delayed &&
> + test_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state)) {
> + struct ploop_device * plo = io->plo;
> + u64 io_count;
> + int err;
> +
> + spin_lock_irq(&plo->lock);
> + io_count = io->io_count;
> + spin_unlock_irq(&plo->lock);
> +
> + err = io->ops->sync(io);
> + if (err) {
> + PLOOP_FAIL_REQUEST(preq, -EBADF);
> + return;
> + }
> +
> + spin_lock_irq(&plo->lock);
> + if (io_count == io->io_count && !(io_count & 1))
> + clear_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
> + spin_unlock_irq(&plo->lock);
> + }
> +
>   dio_io_page(io, rw | WRITE | REQ_SYNC, preq, page, sec);
>  }
>  
> diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
> in

[Devel] [PATCH] propogate_mnt: Handle the first propogated copy being a slave

2016-07-06 Thread Vladimir Davydov
From: "Eric W. Biederman" 

When the first propgated copy was a slave the following oops would result:
> BUG: unable to handle kernel NULL pointer dereference at 0010
> IP: [] propagate_one+0xbe/0x1c0
> PGD bacd4067 PUD bac66067 PMD 0
> Oops:  [#1] SMP
> Modules linked in:
> CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> task: 8800bb0a8000 ti: 8800bac3c000 task.ti: 8800bac3c000
> RIP: 0010:[]  [] propagate_one+0xbe/0x1c0
> RSP: 0018:8800bac3fd38  EFLAGS: 00010283
> RAX:  RBX: 8800bb77ec00 RCX: 0010
> RDX:  RSI: 8800bb58c000 RDI: 8800bb58c480
> RBP: 8800bac3fd48 R08: 0001 R09: 
> R10: 1ca1 R11: 1c9d R12: 
> R13: 8800ba713800 R14: 8800bac3fda0 R15: 8800bb77ec00
> FS:  7f3c0cd9b7e0() GS:8800bfb0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0010 CR3: bb79d000 CR4: 06e0
> Stack:
>  8800bb77ec00  8800bac3fd88 811fbf85
>  8800bac3fd98 8800bb77f080 8800ba713800 8800bb262b40
>    8800bac3fdd8 811f1da0
> Call Trace:
>  [] propagate_mnt+0x105/0x140
>  [] attach_recursive_mnt+0x120/0x1e0
>  [] graft_tree+0x63/0x70
>  [] do_add_mount+0x9b/0x100
>  [] do_mount+0x2aa/0xdf0
>  [] ? strndup_user+0x4e/0x70
>  [] SyS_mount+0x75/0xc0
>  [] do_syscall_64+0x4b/0xa0
>  [] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 
> 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 
> 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
> RIP  [] propagate_one+0xbe/0x1c0
>  RSP 
> CR2: 0010
> ---[ end trace 2725ecd95164f217 ]---

This oops happens with the namespace_sem held and can be triggered by
non-root users.  An all around not pleasant experience.

To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.

Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.

The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.

To avoid other kinds of confusion last_dest is not changed when
computing last_source.  last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.

Cc: sta...@vger.kernel.org
fixes: f2ebb3a921c1ca1e2ddd9242e95a1989a50c4c68 ("smarter propagate_mnt()")
Reported-by: Tycho Andersen 
Reviewed-by: Seth Forshee 
Tested-by: Seth Forshee 
Signed-off-by: "Eric W. Biederman" 
(cherry picked from commit 5ec0811d30378ae104f250bfc9b3640242d81e3f)
Signed-off-by: Vladimir Davydov 
Fixes: CVE-2016-4581

Conflicts:
fs/pnode.c
---
 fs/pnode.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 74f10c0e7e00..cc9ac074ba00 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -198,7 +198,7 @@ static struct mount *next_group(struct mount *m, struct 
mount *origin)
 
 /* all accesses are serialized by namespace_sem */
 static struct user_namespace *user_ns;
-static struct mount *last_dest, *last_source, *dest_master;
+static struct mount *last_dest, *first_source, *last_source, *dest_master;
 static struct mountpoint *mp;
 static struct list_head *list;
 
@@ -216,22 +216,23 @@ static int propagate_one(struct mount *m)
type = CL_MAKE_SHARED;
} else {
struct mount *n, *p;
+   bool done;
for (n = m; ; n = p) {
p = n->mnt_master;
-   if (p == dest_master || IS_MNT_MARKED(p)) {
-   while (last_dest->mnt_master != p) {
-   last_source = last_source->mnt_master;
-   last_dest = last_source->mnt_parent;
-   }
-   if (n->mnt_group_id != last_dest->mnt_group_id 
||
-   (!n->mnt_group_id &&
-!last_dest->mnt_group_id)) {
-   last_source = last_source->mnt_master;
-   last_dest = last_source->mnt_parent;
-   }
+   if (p == dest_master || IS_MNT_MARKED(p))
break;
-   }
}
+   do {
+   struct mount *parent = last_source-

Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default

2016-07-06 Thread Vladimir Davydov
On Tue, Jul 05, 2016 at 04:45:10PM -0700, Maxim Patlasov wrote:
> Vova,
> 
> 
> On 07/04/2016 11:03 AM, Maxim Patlasov wrote:
> >On 07/04/2016 08:53 AM, Vladimir Davydov wrote:
> >
> >>On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote:
> >>...
> >>>@@ -643,6 +643,7 @@ static struct cgroup_subsys_state
> >>>*ve_create(struct cgroup *cg)
> >>>ve->odirect_enable = 2;
> >>>  ve->fsync_enable = 2;
> >>>+ve->experimental_fs_enable = 2;
> >>For odirect_enable and fsync_enable, 2 means follow the host's config, 1
> >>means enable unconditionally, and 0 means disable unconditionally. But
> >>we don't want to allow a user inside a CT to enable this feature, right?
> >
> >I thought it's OK to allow user inside CT to enable it if host sysadmin is
> >OK about it. The same logic as for odirect: by default
> >ve0->experimental_fs_enable = 0, so whatever user inside CT writes to this
> >knob, the feature is disabled. If sysadmin writes '1' to ve0->..., the
> >feature becomes enabled. If an user wants to voluntarily disable it inside
> >CT, that's OK too.
> >
> >>This is confusing. May be, we'd better add a new VE_FEATURE for the
> >>purpose?
> >
> >Not sure right now. I'll look at it and let you know later.
> 
> Technically, it is very easy to implement new VE_FEATURE for overlayfs. But
> this approach is less flexible because we return EPERM from ve_write_64 if
> CT is running, and we'll need to involve userspace team to make the feature
> configurable and (possibly) persistent. Do you think it's worthy for
> something we'll get rid of soon anyway (I mean as soon as PSBM-47981
> resolved)?

Fair enough, not much point in introducing yet another feature for the
purpose, at least right now, sysctl should do for the beginning.

Come to think of it, do we really need this sysctl inside containers? I
mean, by enabling this sysctl on the host we open a possible system-wide
security hole, which a CT admin won't be able to mitigate by disabling
overlayfs inside her CT. So why would she need it for? To prevent
non-privileged CT users from mounting overlayfs inside a user ns? But
overlayfs is not permitted to be mounted by a userns root anyway AFAICS.
May be, just drop in-CT sysctl then?
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v2] netfilter: nf_log: fix error on write NONE to logger choice sysctl

2016-07-06 Thread Pablo Neira Ayuso
On Fri, Jul 01, 2016 at 04:53:54PM +0300, Pavel Tikhomirov wrote:
> It is hard to unbind nf-logger:
> 
>   echo NONE > /proc/sys/net/netfilter/nf_log/0
>   bash: echo: write error: No such file or directory
> 
>   sysctl -w net.netfilter.nf_log.0=NONE
>   sysctl: setting key "net.netfilter.nf_log.0": No such file or directory
>   net.netfilter.nf_log.0 = NONE
> 
> You need explicitly send '\0', for instance like:
> 
>   echo -e "NONE\0" > /proc/sys/net/netfilter/nf_log/0
> 
> That seem to be strange, so fix it using proc_dostring.
> 
> Now it works fine:
>modprobe nfnetlink_log
>echo nfnetlink_log > /proc/sys/net/netfilter/nf_log/0
>cat /proc/sys/net/netfilter/nf_log/0
>nfnetlink_log
>echo NONE > /proc/sys/net/netfilter/nf_log/0
>cat /proc/sys/net/netfilter/nf_log/0
>NONE

Applied, thanks.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel