Re: [Devel] [PATCH rh7 0/3] ext4: speedup shrinking non-delay extents

2018-04-13 Thread Dmitry Monakhov
Konstantin Khorenko  writes:

> We faced a situation when all (32) cpus on a node content on sbi->s_es_lock
> shrinking extents on a single superblock and
> shrinking extents goes very slow (180 sec in average!).
>
> crash> struct ext4_sb_info 0x882fcb7ca800 -p
>
>   s_es_nr_inode = 3173832,
>   s_es_stats = {
> es_stats_shrunk = 70,
> es_stats_cache_hits = 35182748,
> es_stats_cache_misses = 2622931,
> es_stats_scan_time = 182642303461,
> es_stats_max_scan_time = 276290979674,
>
> This patchset speeds up parallel shrink a bit.
> If we findout this is not enough, next step is to limit the number of 
> shrinkers
> working on a single superslock in parallel.
>
> https://jira.sw.ru/browse/PSBM-83335
>
> Jan Kara (1):
>   ms/ext4: move handling of list of shrinkable inodes into extent status
> code
>
> Konstantin Khorenko (1):
>   ext4: don't iterate over sbi->s_es_list more than the number of
> elements
>
> Waiman Long (1):
>   ext4: Make cache hits/misses per-cpu counts
ACK.
>
>  fs/ext4/extents.c|  2 --
>  fs/ext4/extents_status.c | 56 
> +---
>  fs/ext4/extents_status.h |  6 ++
>  fs/ext4/inode.c  |  2 --
>  fs/ext4/ioctl.c  |  2 --
>  fs/ext4/super.c  |  1 -
>  6 files changed, 45 insertions(+), 24 deletions(-)
>
> -- 
> 2.15.1


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] ext4: release leaked posix acl in ext4_acl_chmod

2018-02-07 Thread Dmitry Monakhov
Stanislav Kinsburskiy  writes:

> Note: only rh7-3.10.0-693.17.1.el7-based kernels are affected.
> I.e. starting from rh7-3.10.0-693.17.1.vz7.43.1.
>
> Posix acl is used to convert of an extended attribute, provided by user to
> ext4 attributes. In particular to i_mode in case of ACL_TYPE_ACCESS
> request.
> IOW, this object is allocated, used for convertion, not stored anywhere
> and
> must be freed.
> However posix_acl_update_mode() can zerofy the pointer to support
> ext4_set_acl() logic, but then the object is leaked.
> So, fix it by releasing new temporary pointer with the same value instead
> of
> acl pointer.
>
> In scope of https://jira.sw.ru/browse/PSBM-81384
>
> RHEL bug URL: https://bugzilla.redhat.com/show_bug.cgi?id=1543020
ACK.
>
> Signed-off-by: Stanislav Kinsburskiy 
> ---
>  fs/ext4/acl.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index f8a38a2..046b338 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -297,7 +297,7 @@ ext4_init_acl(handle_t *handle, struct inode *inode, 
> struct inode *dir)
>  int
>  ext4_acl_chmod(struct inode *inode)
>  {
> - struct posix_acl *acl;
> + struct posix_acl *acl, *real_acl;
>   handle_t *handle;
>   int retries = 0;
>   int error;
> @@ -315,6 +315,8 @@ ext4_acl_chmod(struct inode *inode)
>   error = posix_acl_chmod(, GFP_KERNEL, inode->i_mode);
>   if (error)
>   return error;
> +
> + real_acl = acl;
>  retry:
>   handle = ext4_journal_start(inode, EXT4_HT_XATTR,
>   ext4_jbd2_credits_xattr(inode));
> @@ -341,7 +343,7 @@ ext4_acl_chmod(struct inode *inode)
>   ext4_should_retry_alloc(inode->i_sb, ))
>   goto retry;
>  out:
> - posix_acl_release(acl);
> + posix_acl_release(real_acl);
>   return error;
>  }
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v2] ext4: release leaked posix acl in ext4_xattr_set_acl

2018-02-07 Thread Dmitry Monakhov
Stanislav Kinsburskiy  writes:

> Note: only rh7-3.10.0-693.17.1.el7-based kernels are affcted.
> I.e. starting from rh7-3.10.0-693.17.1.vz7.43.1.
>
> Posix acl is used to convert of an extended attribute, provided by user to
> ext4 attributes. In particular to i_mode in case of ACL_TYPE_ACCESS request.
> IOW, this object is allocated, used for convertion, not stored anywhere and
> must be freed.
> However posix_acl_update_mode() can zerofy the pointer to support
> ext4_set_acl() logic, but then the object is leaked.
> So, fix it by releasing new temporary pointer with the same value instead of
> acl pointer.
>
> https://jira.sw.ru/browse/PSBM-81384
>
> RHEL bug URL: https://bugzilla.redhat.com/show_bug.cgi?id=1543020
>
> v2: Added affected kernel version + RHEL bug URL
ACK.
>
> Signed-off-by: Stanislav Kinsburskiy 
> ---
>  fs/ext4/acl.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index 917e819..f8a38a2 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -403,7 +403,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>  {
>   struct inode *inode = dentry->d_inode;
>   handle_t *handle;
> - struct posix_acl *acl;
> + struct posix_acl *acl, *real_acl;
>   int error, retries = 0;
>   int update_mode = 0;
>   umode_t mode = inode->i_mode;
> @@ -416,7 +416,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   return -EPERM;
>  
>   if (value) {
> - acl = posix_acl_from_xattr(_user_ns, value, size);
> + acl = real_acl = posix_acl_from_xattr(_user_ns, value, 
> size);
>   if (IS_ERR(acl))
>   return PTR_ERR(acl);
>   else if (acl) {
> @@ -425,7 +425,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   goto release_and_out;
>   }
>   } else
> - acl = NULL;
> + acl = real_acl = NULL;
>  
>  retry:
>   handle = ext4_journal_start(inode, EXT4_HT_XATTR,
> @@ -452,7 +452,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   goto retry;
>  
>  release_and_out:
> - posix_acl_release(acl);
> + posix_acl_release(real_acl);
>   return error;
>  }
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] ext4: release leaked posix acl in ext4_xattr_set_acl

2018-02-07 Thread Dmitry Monakhov
Stanislav Kinsburskiy  writes:

> Posix acl is used to convert of an extended attribute, provided by user to
> ext4 attributes. In particular to I-mode in case of ACL_TYPE_ACCESS request.
> IOW, this object is allocated, used for convertion, not stored anywhere and
> must be freed.
> However posix_acl_update_mode() can zerofy the pointer to support
> ext4_set_acl() logic, but then the object is leaked.
> So, fix it by releasing new temporary pointer with the same value instead of
> acl pointer.
So you are telling me that:
ext4_xattr_set_acl
L1 acl = posix_acl_from_xattr 
L2 -> ext4_set_acl(handle, inode, type, acl)
L3->posix_acl_update_mode(inode, >i_mode, )
  *acl = NULL;
  You are saying that instruction above can affect value at L1?
  HOW? acl passed to ext4_set_acl() by value, so
  posix_acl_update_mode() can affect value only in L2 and L3 but not L1. 

Stas, have you drunk a lousy beer today?
>
> https://jira.sw.ru/browse/PSBM-81384
>
> Signed-off-by: Stanislav Kinsburskiy 
> ---
>  fs/ext4/acl.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index 917e819..2640d7b 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -403,7 +403,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>  {
>   struct inode *inode = dentry->d_inode;
>   handle_t *handle;
> - struct posix_acl *acl;
> + struct posix_acl *acl, *tmp;
>   int error, retries = 0;
>   int update_mode = 0;
>   umode_t mode = inode->i_mode;
> @@ -416,7 +416,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   return -EPERM;
>  
>   if (value) {
> - acl = posix_acl_from_xattr(_user_ns, value, size);
> + acl = tmp = posix_acl_from_xattr(_user_ns, value, size);
>   if (IS_ERR(acl))
>   return PTR_ERR(acl);
>   else if (acl) {
> @@ -425,7 +425,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   goto release_and_out;
>   }
>   } else
> - acl = NULL;
> + acl = tmp = NULL;
>  
>  retry:
>   handle = ext4_journal_start(inode, EXT4_HT_XATTR,
> @@ -452,7 +452,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   goto retry;
>  
>  release_and_out:
> - posix_acl_release(acl);
> + posix_acl_release(tmp);
>   return error;
>  }
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 1/2] fuse: add a new async operation to unmap regions

2018-02-07 Thread Dmitry Monakhov
Andrei Vagin  writes:

> On Tue, Feb 06, 2018 at 11:49:30PM +0300, Konstantin Khorenko wrote:
>> Andrey, this seems to be a feature and it should be tested.
>> 
>> Please post here a jira id with the feature description, QA task, etc.
>
> 1. Feature
>
> Add support of discard requests via punch-holes for plain ploops
> https://pmc.acronis.com/browse/VSTOR-6962
>
> 2. Description
>
> When ploop receives a discard request, it calls fallocate() to make a
> punch hole in a ploop image file. It allows to drop useless data from a
> storage.
>
> 4. Testing
>
> [root@localhost ploop]# cat test/ploop-fdiscard.sh
> set -e -x
>
> path=$1
> mkdir -p $path
> ploop init $path/root -s 1G -f raw --sparse -t none
> out=$(ploop mount $path/DiskDescriptor.xml)
> echo $out
> dev=$(echo $out | sed "s/.*dev=\(\S*\).*/\1/")
> echo $dev
> filefrag -sv $path/root
> dd if=/dev/urandom of=$dev bs=1M count=1
> dd if=/dev/urandom of=$dev bs=1M count=1 seek=512
> fout1="$(filefrag -sv $path/root | wc -l)"
> filefrag -sv $path/root
> blkdiscard -l 1M -o 512M $dev
> filefrag -sv $path/root
> fout2="$(filefrag -sv $path/root | wc -l)"
> if [ "$fout1" -le "$fout2" ]; then
>   echo FAIL
>   exit 1
> fi
> blkdiscard $dev
> filefrag -sv $path/root
> fout3="$(filefrag -sv $path/root | wc -l)"
> if [ "$fout2" -le "$fout3" ]; then
>   echo FAIL
>   exit 1
> fi
> ploop umount -d $dev
> rm -rf $path
>
> 5. Known issues
>
> Works only for raw images on a fuse file system (vstorage)
>
> 7. Feature owner
> Andrei Vagin (avagin@)
>
>
>> 
>> And whom to review?
>
> Dima, could you review this patch set?
Ack, with minor request.
It is good moment to add stress test for rw-io vs discard
via fio. I can imagine two types of tests:
1) simple stress read/write/trim
2) integrity test via trimwrite, and  read verify after
>
>> 
>> --
>> Best regards,
>> 
>> Konstantin Khorenko,
>> Virtuozzo Linux Kernel Team
>> 
>> On 02/06/2018 03:25 AM, Andrei Vagin wrote:
>> > The fuse interface allows to run any operation asynchronously, because
>> > the kernel redirect all operations to an user daemon and then waits an
>> > answer.
>> > 
>> > In ploop, we want to handle discard requests via fallocate and
>> > a simplest way to do this is to run fallocate(FALLOC_FL_PUNCH_HOLE)
>> > asynchronously like the write command.
>> > 
>> > This patch adds a new async command IOCB_CMD_UNMAP_ITER, which sends
>> > fallocate(FALLOC_FL_PUNCH_HOLE) to a fuse user daemon.
>> > 
>> > Signed-off-by: Andrei Vagin 
>> > ---
>> >  fs/aio.c |  1 +
>> >  fs/fuse/file.c   | 63 
>> > ++--
>> >  fs/fuse/fuse_i.h |  3 +++
>> >  include/uapi/linux/aio_abi.h |  1 +
>> >  4 files changed, 60 insertions(+), 8 deletions(-)
>> > 
>> > diff --git a/fs/aio.c b/fs/aio.c
>> > index 3a6a9b0..cdc7558 100644
>> > --- a/fs/aio.c
>> > +++ b/fs/aio.c
>> > @@ -1492,6 +1492,7 @@ rw_common:
>> >ret = aio_read_iter(req);
>> >break;
>> > 
>> > +  case IOCB_CMD_UNMAP_ITER:
>> >case IOCB_CMD_WRITE_ITER:
>> >ret = aio_write_iter(req);
>> >break;
>> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> > index 877c41f..83ea9da 100644
>> > --- a/fs/fuse/file.c
>> > +++ b/fs/fuse/file.c
>> > @@ -920,6 +920,19 @@ static void fuse_aio_complete_req(struct fuse_conn 
>> > *fc, struct fuse_req *req)
>> >if (!req->bvec)
>> >fuse_release_user_pages(req, !io->write);
>> > 
>> > +  if (req->in.h.opcode == FUSE_FALLOCATE) {
>> > +  if (req->out.h.error)
>> > +  printk("fuse_aio_complete_req: request (fallocate 
>> > fh=0x%llx "
>> > + "offset=%lld length=%lld mode=%x) completed with 
>> > err=%d\n",
>> > + req->misc.fallocate.in.fh,
>> > + req->misc.fallocate.in.offset,
>> > + req->misc.fallocate.in.length,
>> > + req->misc.fallocate.in.mode,
>> > + req->out.h.error);
>> > +  fuse_aio_complete(io, req->out.h.error, -1);
>> > +  return;
>> > +  }
>> > +
>> >if (io->write) {
>> >if (req->misc.write.in.size != req->misc.write.out.size)
>> >pos = req->misc.write.in.offset - io->offset +
>> > @@ -1322,6 +1335,33 @@ static void fuse_write_fill(struct fuse_req *req, 
>> > struct fuse_file *ff,
>> >req->out.args[0].value = outarg;
>> >  }
>> > 
>> > +static size_t fuse_send_unmap(struct fuse_req *req, struct fuse_io_priv 
>> > *io,
>> > +loff_t pos, size_t count, fl_owner_t owner)
>> > +{
>> > +  struct file *file = io->file;
>> > +  struct fuse_file *ff = file->private_data;
>> > +  struct fuse_conn *fc = ff->fc;
>> > +  struct fuse_fallocate_in *inarg = >misc.fallocate.in;
>> > +
>> > +  inarg->fh = ff->fh;
>> > +  inarg->offset = pos;
>> > +  inarg->length = count;
>> > +  inarg->mode = 

Re: [Devel] [PATCH RFC] mm: Limit number of busy-looped shrinking processes

2017-09-05 Thread Dmitry Monakhov
Kirill Tkhai  writes:

> When a FUSE process is making shrink, it must not wait
> on page writeback. Otherwise, it may meet a page,
> that is being writebacked by him, and the process will stall.
>
> So, our kernel does not wait writeback after commit a9707947010d
> "mm: vmscan: never wait on writeback pages".
>
> But in case of huge number of writebacked pages and
> memory pressure, this lead to busy loop: many process
> in the system are trying to shrink memory and have
> no success. And the node shows high time, spent in kernel.
>
> This patch reduces the number of processes, which may
> busy looping on shrink. Only one userspace process --
> vstorage -- will be allowed not to sleep on writeback.
> Other processes will sleep up to 5 seconds to wait
> writeback completion on every page.
>
> The detection of vstorage is very simple and it based
> on process name. It seems, there is no a way to detect
NAK. Detection by name is very very bad design style.
fused and others should mark iself as writeback-proof explicitly
via API similar ioctl/madvice/ionice/ulimit,
may be it is reasonable to place such app to speciffic cgroup,
you may pick any recepy you like. But please do not do comm-name
matching.

> all FUSE processes, especially from !ve0, because FUSE
> mount is tricky, and a process doing mount may not be
> a FUSE daemon. So, we remain the vanila kernel behaviour,
> but we don't wait forever, just 5 second. This will save
> us from lookup messages from kernel and will allow
> to kill FUSE daemon if necessary.
>
> https://jira.sw.ru/browse/PSBM-69296
>
> Signed-off-by: Kirill Tkhai 
> ---
>  mm/vmscan.c |   19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a5db5940bb1..e72d515c111 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -959,8 +959,16 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>  
>   /* Case 3 above */
>   } else {
> - nr_immediate++;
> - goto keep_locked;
> + /*
> +  * Currently, vstorage is the only fuse process,
> +  * exercising writeback; it mustn't sleep to 
> avoid
> +  * deadlocks.
> +  */
> + if (!strncmp(current->comm, "vstorage", 8) ||
> + wait_on_page_bit_killable_timeout(page, 
> PG_writeback, 5 * HZ) != 0) {
> + nr_immediate++;
> + goto keep_locked;
> + }
>   }
>   }
>  
> @@ -1592,9 +1600,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   if (nr_writeback && nr_writeback == nr_taken)
>   zone_set_flag(zone, ZONE_WRITEBACK);
>  
> - if (!global_reclaim(sc) && nr_immediate)
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> -
> + /*
> +  * memcg will stall in page writeback so only consider forcibly
> +  * stalling for global reclaim
> +  */
>   if (global_reclaim(sc)) {
>   /*
>* Tag a zone as congested if all the dirty pages scanned were
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] fs-writeback: add endless writeback debug

2017-08-25 Thread Dmitry Monakhov
https://jira.sw.ru/browse/PSBM-69587
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/fs-writeback.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f34ae6c..9df1573 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -787,11 +787,15 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 {
unsigned long start_time = jiffies;
long wrote = 0;
-
+   int trace = 0;
+   
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
struct super_block *sb = inode->i_sb;
 
+   if (time_is_before_jiffies(start_time + 15* HZ))
+   trace = 1;
+
if (!grab_super_passive(sb)) {
/*
 * grab_super_passive() may fail consistently due to
@@ -799,6 +803,9 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 * requeue_io() to avoid busy retrying the inode/sb.
 */
redirty_tail(inode, wb);
+   if (trace)
+   printk("%s:%d writeback is taking too long 
ino:%ld sb(%p):%s\n",
+  __FUNCTION__, __LINE__, inode->i_ino, 
sb, sb->s_id);
continue;
}
wrote += writeback_sb_inodes(sb, wb, work);
@@ -890,6 +897,7 @@ static long wb_writeback(struct bdi_writeback *wb,
unsigned long oldest_jif;
struct inode *inode;
long progress;
+   int trace = 0;
 
oldest_jif = jiffies;
work->older_than_this = _jif;
@@ -902,6 +910,9 @@ static long wb_writeback(struct bdi_writeback *wb,
if (work->nr_pages <= 0)
break;
 
+   if (time_is_before_jiffies(wb_start + 15* HZ))
+   trace = 1;
+
/*
 * Background writeout and kupdate-style writeback may
 * run forever. Stop them if there is other work to do
@@ -973,6 +984,10 @@ static long wb_writeback(struct bdi_writeback *wb,
inode = wb_inode(wb->b_more_io.prev);
spin_lock(>i_lock);
spin_unlock(>list_lock);
+   if (trace)
+   printk("%s:%d writeback is taking too long 
ino:%ld st:%ld sb(%p):%s\n",
+  __FUNCTION__, __LINE__, inode->i_ino,
+  inode->i_state, inode->i_sb, 
inode->i_sb->s_id);
/* This function drops i_lock... */
inode_sleep_on_writeback(inode);
spin_lock(>list_lock);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] fused: save logrotate option to fstab

2017-08-17 Thread Dmitry Monakhov
currently one may run 'vstorage-mount -s' with -L option, but
it will affect only current mount w/o reflection to fstab opts.
In fact mount.fuse.vstorage already has parser for logrotate option, so
this patch makes this feature fully supported.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 pcs/clients/fused/fused.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/pcs/clients/fused/fused.c b/pcs/clients/fused/fused.c
index b73da80..5c4a84e 100644
--- a/pcs/clients/fused/fused.c
+++ b/pcs/clients/fused/fused.c
@@ -176,6 +176,8 @@ static char *make_fstab_options(
int timeout,
char *logfile,
int loglevel,
+   unsigned long rotate_num,
+   unsigned long long rotate_size,
unsigned long mntflags,
char *username,
char *groupname,
@@ -197,6 +199,9 @@ static char *make_fstab_options(
res += fstab_add_option(, "logfile=%s", logfile);
if (loglevel != LOG_LEVEL_SRV_DEFAULT)
res += fstab_add_option(, "loglevel=%u", 
(unsigned)loglevel);
+   if (rotate_num || rotate_size)
+   res += fstab_add_option(, "logrotate=%lux%llu", rotate_num, 
rotate_size);
+
if (g_read_cache.params.pathname)
res += fstab_add_option(, "cache=%s", 
g_read_cache.params.pathname);
if (g_read_cache.params.total_sz_mb > 0)
@@ -501,6 +506,7 @@ int main(int argc, char** argv)
unsigned long mntflags = 0;
int ch, res = -1;
int pipefd[2];
+   int rotate_opt = 0;
unsigned long rotate_num = 10;
unsigned long long rotate_size = 100LL * 1024LL * 1024LL;
int after_exec = 0;
@@ -595,6 +601,7 @@ int main(int argc, char** argv)
case 'L':
if (parse_logrotate_diskspace(optarg, _num, 
_size) < 0)
usage(NULL);
+   rotate_opt = 1;
break;
case 'd':
pcs_log_level = strtoul(optarg, , 10);
@@ -678,8 +685,11 @@ int main(int argc, char** argv)
usage("Invalid read cache parameters");
 
if (fstab_modify) {
-   fstab_options = make_fstab_options(timeout, logfile, 
pcs_log_level, mntflags,
-   username, groupname, mode, nodef, mntparams);
+   fstab_options = make_fstab_options(timeout, logfile, 
pcs_log_level,
+  rotate_opt ? rotate_num : 0,
+  rotate_opt ? rotate_size : 0,
+  mntflags,  username, 
+  groupname, mode, nodef, 
mntparams);
if (!fstab_options) {
pcs_log(LOG_ERR, PCS_FUSED_MSG_PREFIX"failed to make 
fstab options");
exit(252);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 v3] ext4: add generic uevent infrastructure

2017-06-16 Thread Dmitry Monakhov
Andrey Ryabinin <aryabi...@virtuozzo.com> writes:

> From: Dmitry Monakhov <dmonak...@openvz.org>
>
> *Purpose:
> It is reasonable to annaunce fs related events via uevent infrastructure.
> This patch implement only ext4'th part, but IMHO this should be usefull for
> any generic filesystem.
>
> Example: Runtime fs-error is pure async event. Currently there is no good
> way to handle this situation and inform user-space about this.
>
> *Implementation:
>  Add uevent infrastructure similar to dm uevent
>  FS_ACTION = {MOUNT|UMOUNT|REMOUNT|ERROR|FREEZE|UNFREEZE}
>  FS_UUID
>  FS_NAME
>  FS_TYPE
>
> Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
Only one note about mem allocation context, see below. Otherwise looks good.
>
> https://jira.sw.ru/browse/PSBM-66618
> Signed-off-by: Andrey Ryabinin <aryabi...@virtuozzo.com>
> ---
> Changes since v2:
>   - Report error event only once per superblock
>
>  fs/ext4/ext4.h  | 11 
>  fs/ext4/super.c | 88 
> -
>  2 files changed, 98 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1cd964870da3..ce60718c7143 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1356,6 +1356,8 @@ struct ext4_sb_info {
>   /* Precomputed FS UUID checksum for seeding other checksums */
>   __u32 s_csum_seed;
>  
> + bool s_err_event_sent;
> +
>   /* Reclaim extents from extent status tree */
>   struct shrinker s_es_shrinker;
>   struct list_head s_es_lru;
> @@ -2758,6 +2760,15 @@ extern int ext4_check_blockref(const char *, unsigned 
> int,
>  struct ext4_ext_path;
>  struct ext4_extent;
>  
> +enum ext4_event_type {
> + EXT4_UA_MOUNT,
> + EXT4_UA_UMOUNT,
> + EXT4_UA_REMOUNT,
> + EXT4_UA_ERROR,
> + EXT4_UA_FREEZE,
> + EXT4_UA_UNFREEZE,
> +};
> +
>  /*
>   * Maximum number of logical blocks in a file; ext4_extent's ee_block is
>   * __le32.
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index ee065861b62a..088313b6333f 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -301,6 +301,79 @@ void ext4_itable_unused_set(struct super_block *sb,
>   bg->bg_itable_unused_hi = cpu_to_le16(count >> 16);
>  }
>  
> +static int ext4_uuid_valid(const u8 *uuid)
> +{
> + int i;
> +
> + for (i = 0; i < 16; i++) {
> + if (uuid[i])
> + return 1;
> + }
> + return 0;
> +}
> +
> +/**
> + * ext4_send_uevent - prepare and send uevent
> + *
> + * @sb:  super_block
> + * @action:  action type
> + *
> + */
> +int ext4_send_uevent(struct super_block *sb, enum ext4_event_type action)
> +{
> + int ret;
> + struct kobj_uevent_env *env;
> + const u8 *uuid = sb->s_uuid;
> + enum kobject_action kaction = KOBJ_CHANGE;
> +
> + env = kzalloc(sizeof(struct kobj_uevent_env), GFP_KERNEL);
Please change GFP_KERNEL to GFP_NOFS otherwise it may deadlock.
> + if (!env)
> + return -ENOMEM;
> +
> + ret = add_uevent_var(env, "FS_TYPE=%s", sb->s_type->name);
> + if (ret)
> + goto out;
> + ret = add_uevent_var(env, "FS_NAME=%s", sb->s_id);
> + if (ret)
> + goto out;
> +
> + if (ext4_uuid_valid(uuid)) {
> + ret = add_uevent_var(env, "UUID=%pUB", uuid);
> + if (ret)
> + goto out;
> + }
> +
> + switch (action) {
> + case EXT4_UA_MOUNT:
> + kaction = KOBJ_ONLINE;
> + ret = add_uevent_var(env, "FS_ACTION=%s", "MOUNT");
> + break;
> + case EXT4_UA_UMOUNT:
> + kaction = KOBJ_OFFLINE;
> + ret = add_uevent_var(env, "FS_ACTION=%s", "UMOUNT");
> + break;
> + case EXT4_UA_REMOUNT:
> + ret = add_uevent_var(env, "FS_ACTION=%s", "REMOUNT");
> + break;
> + case EXT4_UA_ERROR:
> + ret = add_uevent_var(env, "FS_ACTION=%s", "ERROR");
> + break;
> + case EXT4_UA_FREEZE:
> + ret = add_uevent_var(env, "FS_ACTION=%s", "FREEZE");
> + break;
> + case EXT4_UA_UNFREEZE:
> + ret = add_uevent_var(env, "FS_ACTION=%s", "UNFREEZE");
> + break;
> + default:
> + ret = -EINVAL;
> + }
> + if (ret)
> + goto out;
> + ret = kobject_uevent_env(&(EXT4_SB(sb)->s_kobj)

Re: [Devel] [PATCH rh7 v2 1/3] fs/cleancache: fix data invalidation in the cleancache during direct_io

2017-04-13 Thread Dmitry Monakhov
Andrey Ryabinin  writes:

> Currently some direct_io fs hooks call invalidate_inode_pages2_range()
> conditionally iff mapping->nrpages is not zero. So if nrpages is zero,
> data in cleancache wouldn't be invalidated. So the next buffered read
> may get stale data from the cleancache.

>
> Fix this by calling invalidate_inode_pages2_range() regardless of nrpages
> value. And if nrpages is zero, bail out from invalidate_inode_pages2_range()
> only after cleancache_invalidate_inode(), so that we invalidate cleancache
> but still avoid pointless page cache lookups.
BTW, can we please make tcache plugable. So one who do not want fancy
caching features can simply disable it. As we do with pfcache.

>
> https://jira.sw.ru/browse/PSBM-63908
> Signed-off-by: Andrey Ryabinin 
> ---
>  fs/9p/vfs_file.c  |  4 ++--
>  fs/nfs/direct.c   | 16 ++--
>  fs/nfs/inode.c|  7 ---
>  fs/xfs/xfs_file.c | 30 ++
>  mm/filemap.c  | 28 
>  mm/truncate.c |  4 
>  6 files changed, 42 insertions(+), 47 deletions(-)
>
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index 7da03f8..afe0036 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -482,7 +482,7 @@ v9fs_file_write_internal(struct inode *inode, struct 
> p9_fid *fid,
>   if (invalidate && (total > 0)) {
>   pg_start = origin >> PAGE_CACHE_SHIFT;
>   pg_end = (origin + total - 1) >> PAGE_CACHE_SHIFT;
> - if (inode->i_mapping && inode->i_mapping->nrpages)
> + if (inode->i_mapping)
>   invalidate_inode_pages2_range(inode->i_mapping,
> pg_start, pg_end);
>   *offset += total;
> @@ -688,7 +688,7 @@ v9fs_direct_write(struct file *filp, const char __user * 
> data,
>* about to write.  We do this *before* the write so that if we fail
>* here we fall back to buffered write
>*/
> - if (mapping->nrpages) {
> + {
>   pgoff_t pg_start = offset >> PAGE_CACHE_SHIFT;
>   pgoff_t pg_end   = (offset + count - 1) >> PAGE_CACHE_SHIFT;
>  
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index ab96f01..963 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -1132,12 +1132,10 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, 
> const struct iovec *iov,
>   if (result)
>   goto out_unlock;
>  
> - if (mapping->nrpages) {
> - result = invalidate_inode_pages2_range(mapping,
> - pos >> PAGE_CACHE_SHIFT, end);
> - if (result)
> - goto out_unlock;
> - }
> + result = invalidate_inode_pages2_range(mapping,
> + pos >> PAGE_CACHE_SHIFT, end);
> + if (result)
> + goto out_unlock;
>  
>   task_io_account_write(count);
>  
> @@ -1161,10 +1159,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, 
> const struct iovec *iov,
>  
>   result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, uio);
>  
> - if (mapping->nrpages) {
> - invalidate_inode_pages2_range(mapping,
> -   pos >> PAGE_CACHE_SHIFT, end);
> - }
> + invalidate_inode_pages2_range(mapping,
> + pos >> PAGE_CACHE_SHIFT, end);
>  
>   mutex_unlock(>i_mutex);
>  
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 8c06aed..779b05c 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -1065,10 +1065,11 @@ static int nfs_invalidate_mapping(struct inode 
> *inode, struct address_space *map
>   if (ret < 0)
>   return ret;
>   }
> - ret = invalidate_inode_pages2(mapping);
> - if (ret < 0)
> - return ret;
>   }
> + ret = invalidate_inode_pages2(mapping);
> + if (ret < 0)
> + return ret;
> +
>   if (S_ISDIR(inode->i_mode)) {
>   spin_lock(>i_lock);
>   memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 9a2193b..0b7a35b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -346,7 +346,7 @@ xfs_file_aio_read(
>* serialisation.
>*/
>   xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
> - if ((ioflags & XFS_IO_ISDIRECT) && inode->i_mapping->nrpages) {
> + if ((ioflags & XFS_IO_ISDIRECT)) {
>   xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
>   xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
>  
> @@ -361,22 +361,20 @@ xfs_file_aio_read(
>* flush and reduce the chances of repeated iolock cycles going
>* forward.
>*/
> - if (inode->i_mapping->nrpages) {
> - ret = filemap_write_and_wait(VFS_I(ip)->i_mapping);
> -

[Devel] [PATCH] fs: Prevent massive warn spamming

2017-03-20 Thread Dmitry Monakhov
Even if detection spots potential bug, is it not good to bloat kmsg.
WARN_ON_ONCE is enought to detect exact calltrace.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 mm/page_alloc.c | 2 +-
 mm/slab.c   | 4 ++--
 mm/slub.c   | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b799171..d6a04f5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3150,7 +3150,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
lockdep_trace_alloc(gfp_mask);
 
might_sleep_if(gfp_mask & __GFP_WAIT);
-   WARN_ON((gfp_mask & __GFP_FS) && current->journal_info);
+   WARN_ON_ONCE((gfp_mask & __GFP_FS) && current->journal_info);
 
if (should_fail_alloc_page(gfp_mask, order))
return NULL;
diff --git a/mm/slab.c b/mm/slab.c
index f0e4b79..4f0c22e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3343,7 +3343,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, 
int nodeid,
flags &= gfp_allowed_mask;
 
lockdep_trace_alloc(flags);
-   WARN_ON((flags & __GFP_FS) && current->journal_info);
+   WARN_ON_ONCE((flags & __GFP_FS) && current->journal_info);
 
if (slab_should_failslab(cachep, flags))
return NULL;
@@ -3433,7 +3433,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, 
unsigned long caller)
flags &= gfp_allowed_mask;
 
lockdep_trace_alloc(flags);
-   WARN_ON((flags & __GFP_FS) && current->journal_info);
+   WARN_ON_ONCE((flags & __GFP_FS) && current->journal_info);
 
if (slab_should_failslab(cachep, flags))
return NULL;
diff --git a/mm/slub.c b/mm/slub.c
index fcebd14..280adf6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1266,7 +1266,7 @@ static inline int slab_pre_alloc_hook(struct kmem_cache 
*s, gfp_t flags)
flags &= gfp_allowed_mask;
lockdep_trace_alloc(flags);
might_sleep_if(flags & __GFP_WAIT);
-   WARN_ON((flags & __GFP_FS) && current->journal_info);
+   WARN_ON_ONCE((flags & __GFP_FS) && current->journal_info);
 
return should_failslab(s->object_size, flags, s->flags);
 }
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] WARNING at mm/slub.c

2017-03-20 Thread Dmitry Monakhov

Denis Kirjanov <dkirja...@cloudlinux.com> writes:

> On 3/16/17, Denis Kirjanov <dkirja...@cloudlinux.com> wrote:
>> Hi guys,
>>
>> with the kernel rh7-3.10.0-327.36.1.vz7.18.7 we're seeing the
>> following WARNING while running LTP test suite:
>>
>> [11796.576981] WARNING: at mm/slub.c:1252
>> slab_pre_alloc_hook.isra.42.part.43+0x15/0x17()
>>
>> [11796.591008] Call Trace:
>> [11796.592065]  [] dump_stack+0x19/0x1b
>> [11796.593076]  [] warn_slowpath_common+0x70/0xb0
>> [11796.594228]  [] warn_slowpath_null+0x1a/0x20
>> [11796.595442]  []
>> slab_pre_alloc_hook.isra.42.part.43+0x15/0x17
>> [11796.596686]  [] kmem_cache_alloc_trace+0x58/0x230
>> [11796.597965]  [] ? kmapset_new+0x1e/0x50
>> [11796.599224]  [] kmapset_new+0x1e/0x50
>> [11796.600433]  [] __sysfs_add_one+0x4a/0xb0
>> [11796.601431]  [] sysfs_add_one+0x1b/0xd0
>> [11796.602451]  [] sysfs_add_file_mode+0xb7/0x100
>> [11796.603449]  [] sysfs_create_file+0x2a/0x30
>> [11796.604461]  [] kobject_add_internal+0x16c/0x2f0
>> [11796.605503]  [] kobject_add+0x75/0xd0
>> [11796.606627]  [] ? kmem_cache_alloc_trace+0x207/0x230
>> [11796.607655]  [] __link_block_group+0xe1/0x120 [btrfs]
>> [11796.608634]  [] btrfs_make_block_group+0x150/0x270
>> [btrfs]
>> [11796.609701]  [] __btrfs_alloc_chunk+0x67f/0x8a0
>> [btrfs]
>> [11796.610756]  [] btrfs_alloc_chunk+0x34/0x40 [btrfs]
>> [11796.611800]  [] do_chunk_alloc+0x23f/0x410 [btrfs]
>> [11796.612954]  []
>> btrfs_check_data_free_space+0xea/0x280 [btrfs]
>> [11796.614008]  [] __btrfs_buffered_write+0x151/0x5c0
>> [btrfs]
>> [11796.615153]  [] btrfs_file_aio_write+0x246/0x560
>> [btrfs]
>> [11796.616141]  [] ?
>> __mem_cgroup_commit_charge+0x152/0x350
>> [11796.617220]  [] do_sync_write+0x90/0xe0
>> [11796.618253]  [] vfs_write+0xbd/0x1e0
>> [11796.619224]  [] SyS_write+0x7f/0xe0
>> [11796.620185]  [] system_call_fastpath+0x16/0x1b
>> [11796.621145] ---[ end trace 1437311f89b9e3c6 ]---
>>
>
> Guys, I've found your commit:
>
> commit 149819fef38230c95f4d6c644061bc8b0dcdd51d
> Author: Vladimir Davydov <vdavy...@parallels.com>
> Date:   Fri Jun 5 13:20:02 2015 +0400
>
> mm/fs: Port diff-mm-debug-memallocation-caused-fs-reentrance
>
> Enable the debug once again, as the issue it found has been fixed:
> https://jira.sw.ru/browse/PSBM-34112
>
> Previous commit: 255427905323ac97a3c9b2d5acb2bf21ea2b31f6.
>
> Author: Dmitry Monakhov
> Email: dmonak...@openvz.org
> Subject: mm: debug memallocation caused fs reentrance
> Date: Sun, 9 Nov 2014 11:53:14 +0400
>
> But I can't open a link to figure out the original reason for the patch.
Originally we found that
 [] dump_stack+0x19/0x1b
 [] warn_slowpath_common+0x61/0x80
 [] warn_slowpath_null+0x1a/0x20
 [] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
 [] kmem_cache_alloc+0x55/0x210
 [] ? ext4_mb_add_groupinfo+0xe1/0x230 [ext4]
 [] ext4_mb_add_groupinfo+0xe1/0x230 [ext4]
 [] ext4_flex_group_add+0xba6/0x14b0 [ext4]
 [] ? ext4_bg_num_gdb+0x79/0x90 [ext4]
 [] ext4_resize_fs+0x76d/0xe40 [ext4]
 [] ext4_ioctl+0xded/0x1110 [ext4]
 [] ? do_filp_open+0x4b/0xb0
 [] do_vfs_ioctl+0x255/0x4f0
 [] ? __fd_install+0x47/0x60
 [] SyS_ioctl+0x54/0xa0
 [] system_call_fastpath+0x16/0x1b

This is pure bug, which resut in deadlock, or fscorruption. which I've fixed 
here
https://github.com/torvalds/linux/commit/4fdb5543183d027a19805b72025b859af73d0863
I've realized that his is whole class of locking issues which should be
detected on runtume that is why I've add this warning, I also send the
patch to mainstream http://www.spinics.net/lists/linux-btrfs/msg39034.html
which note that btrfs definitely has fs-reentrance issues
http://www.spinics.net/lists/linux-btrfs/msg39035.html

Dave does not like the way I've do the detection so patch was not
committed, but it exists in our tree, It is resonanable to replace
WARN_ON with WARN_ON_ONCE to prevent spamming. I'll send a patch



>
>
>
>> Thanks!
>>
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/2] mfsync: cleanup

2017-03-15 Thread Dmitry Monakhov
Long time ago prink was used for debug purposes only, and was merged by 
occasion,
it was cleanedup in b4d7159537296b, but resurected after rebase. Let's kill it 
completely.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/ext4/ioctl.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index bb372fa..cd831d5 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -784,10 +784,9 @@ resize_out:
int i, err;
 
if (copy_from_user(, (struct ext4_ioc_mfsync_info *)arg,
-  sizeof(mfsync))) {
-   printk("%s:%d", __FUNCTION__, __LINE__);
+  sizeof(mfsync)))
return -EFAULT;
-   }
+
if (mfsync.size == 0)
return 0;
usr_fd = (__u32 __user *) (arg + sizeof(__u32));
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/2] ext4/mfsync: Prevent resource abuse

2017-03-15 Thread Dmitry Monakhov
- Mfsync is not standard interface let's hide it from VEs
- Limit number of files in single request.


https://jira.sw.ru/browse/PSBM-59965
https://jira.sw.ru/browse/PSBM-59966
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/ext4/ioctl.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index cd831d5..9232330 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -783,12 +783,17 @@ resize_out:
__u32 __user *usr_fd;
int i, err;
 
+   if (!ve_is_super(get_exec_env()))
+   return -ENOTSUPP;
if (copy_from_user(, (struct ext4_ioc_mfsync_info *)arg,
   sizeof(mfsync)))
return -EFAULT;
 
if (mfsync.size == 0)
return 0;
+   if (mfsync.size > NR_FILE)
+   return -ENFILE;
+
usr_fd = (__u32 __user *) (arg + sizeof(__u32));
 
filpp = kzalloc(mfsync.size * sizeof(*filp), GFP_KERNEL);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH:vz7] ext4: fix seek_data soft lookup on sparse files

2017-02-27 Thread Dmitry Monakhov
Good fix requires optimal implementation of next_extent like it was
done in 14516bb or 2d90c160, but this makes patch huge, let's
just break the loop when necessery.

https://jira.sw.ru/browse/PSBM-55818
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/ext4/file.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index c63d937..167e262 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -612,7 +612,17 @@ static loff_t ext4_seek_data(struct file *file, loff_t 
offset, loff_t maxsize)
if (unwritten)
break;
}
-
+   if (signal_pending(current)) {
+   mutex_unlock(>i_mutex);
+   return -EINTR;
+   }
+   if (need_resched()) {
+   mutex_unlock(>i_mutex);
+   cond_resched();
+   mutex_lock(>i_mutex);
+   isize = inode->i_size;
+   end = isize >> blkbits;
+   }
last++;
dataoff = (loff_t)last << blkbits;
} while (last <= end);
-- 
2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] vz6 ext4: Discard preallocated block before swap_extents

2017-02-27 Thread Dmitry Monakhov
Vasily Averin <v...@virtuozzo.com> writes:

> Dima,
> please take look at comment below.
>
> On 2017-02-25 18:16, Dmitry Monakhov wrote:
>> Inode preallocation consists of two parts (used and unused) fully controlled
>> by inode, so it must be discarded before swap extents.
>> Currently we may skip drop_preallocation if file is sparse.
>> 
>> This patch does:
>> - Moves ext4_discard_preallocations to ext4_swap_extents.
>>   This makes more readable and reliable for future changes.
>> - Cleanup main move_extent loop
>> 
>> https://jira.sw.ru/browse/PSBM-57003
>> xfstests:ext4/024 (pended: 
>> https://github.com/dmonakhov/xfstests/commit/7a4763963f73ea5d5bba45eefa484494aa3df7cf)
>> Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
>> ---
>>  fs/ext4/extents.c |  3 +++
>>  fs/ext4/move_extent.c | 17 +++--
>>  2 files changed, 10 insertions(+), 10 deletions(-)
>> 
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index 85c4d4e..fd49ab0 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -4371,6 +4371,9 @@ ext4_swap_extents(handle_t *handle, struct inode 
>> *inode1,
>>  BUG_ON(!mutex_is_locked(>i_mutex));
>>  BUG_ON(!mutex_is_locked(>i_mutex));
>>  
>> +ext4_discard_preallocations(inode1);
>> +ext4_discard_preallocations(inode2);
>> +
>>  while (count) {
>>  struct ext4_extent *ex1, *ex2, tmp_ex;
>>  ext4_lblk_t e1_blk, e2_blk;
>> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
>> index 39eaa8f..97a7db5 100644
>> --- a/fs/ext4/move_extent.c
>> +++ b/fs/ext4/move_extent.c
>> @@ -628,6 +628,7 @@ ext4_move_extents(struct file *o_filp, struct file 
>> *d_filp, __u64 orig_blk,
>>  ext4_lblk_t o_end, o_start = orig_blk;
>>  ext4_lblk_t d_start = donor_blk;
>>  int ret;
>> +__u64 m_len = *moved_len;
>>  
>>  if (orig_inode->i_sb != donor_inode->i_sb) {
>>  ext4_debug("ext4 move extent: The argument files "
>> @@ -696,7 +697,7 @@ ext4_move_extents(struct file *o_filp, struct file 
>> *d_filp, __u64 orig_blk,
>>  if (next_blk == EXT_MAX_BLOCKS) {
>>  o_start = o_end;
>>  ret = -ENODATA;
>> -goto out;
>> +break;
>>  }
>>  d_start += next_blk - o_start;
>>  o_start = next_blk;
>> @@ -708,7 +709,7 @@ ext4_move_extents(struct file *o_filp, struct file 
>> *d_filp, __u64 orig_blk,
>>  o_start = cur_blk;
>>  /* Extent inside requested range ?*/
>>  if (cur_blk >= o_end)
>> -goto out;
>> +break;
>>  } else { /* in_range(o_start, o_blk, o_len) */
>>  cur_len += cur_blk - o_start;
>>  }
>> @@ -743,6 +744,7 @@ ext4_move_extents(struct file *o_filp, struct file 
>> *d_filp, __u64 orig_blk,
>>  break;
>>  o_start += cur_len;
>>  d_start += cur_len;
>> +m_len += cur_len;
>>  repeat:
>>  if (path) {
>>  ext4_ext_drop_refs(path);
>> @@ -755,15 +757,10 @@ ext4_move_extents(struct file *o_filp, struct file 
>> *d_filp, __u64 orig_blk,
>>  *moved_len = len;
>>  
>>  out:
>> -if (*moved_len) {
>> -ext4_discard_preallocations(orig_inode);
>> -ext4_discard_preallocations(donor_inode);
>> -}
>> +WARN_ON(m_len > len);
>> +if (ret == 0)
>> +*moved_len = m_len;
>>  
>> -if (path) {
>> -ext4_ext_drop_refs(path);
>> -kfree(path);
>> -}
>
> I do not understand why kfree for path is dropped here. 
> Rest places looks reasonable for me,
> but this one looks like some mistake.
Yes, this is copy-paste mistake. Please see updated verstion

From: Dmitry Monakhov <dmonak...@openvz.org>
To: devel@openvz.org
Cc: dmonak...@openvz.org,
v...@virtuozzo.com
Subject: [PATCH] vz6 ext4: Discard preallocated block before swap_extents v2
Date: Mon, 27 Feb 2017 15:33:07 +0400
Message-Id: <1488195187-26606-1-git-send-email-dmonak...@openvz.org>

>
> Take a look -- path is still was freed inside cycle,
> why it should not be freed at finish too?
>
>>  up_write(_I(orig_inode)->i_data_sem);
>>  up_write(_I(donor_inode)->i_data_sem);
>>  up_write(_inode->i_alloc_sem);
>> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] vz6 ext4: Discard preallocated block before swap_extents

2017-02-27 Thread Dmitry Monakhov
Inode preallocation consists of two parts (used and unused) fully controlled
by inode, so it must be discarded before swap extents.
Currently we may skip drop_preallocation if file is sparse.

This patch does:
- Moves ext4_discard_preallocations to ext4_swap_extents.
  This makes more readable and reliable for future changes.
- Cleanup main move_extent loop

https://jira.sw.ru/browse/PSBM-57003
xfstests:ext4/024 (pended: 
https://github.com/dmonakhov/xfstests/commit/7a4763963f73ea5d5bba45eefa484494aa3df7cf)
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/ext4/extents.c |  3 +++
 fs/ext4/move_extent.c | 17 +++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 85c4d4e..fd49ab0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4371,6 +4371,9 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1,
BUG_ON(!mutex_is_locked(>i_mutex));
BUG_ON(!mutex_is_locked(>i_mutex));
 
+   ext4_discard_preallocations(inode1);
+   ext4_discard_preallocations(inode2);
+
while (count) {
struct ext4_extent *ex1, *ex2, tmp_ex;
ext4_lblk_t e1_blk, e2_blk;
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 39eaa8f..97a7db5 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -628,6 +628,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
ext4_lblk_t o_end, o_start = orig_blk;
ext4_lblk_t d_start = donor_blk;
int ret;
+   __u64 m_len = *moved_len;
 
if (orig_inode->i_sb != donor_inode->i_sb) {
ext4_debug("ext4 move extent: The argument files "
@@ -696,7 +697,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
if (next_blk == EXT_MAX_BLOCKS) {
o_start = o_end;
ret = -ENODATA;
-   goto out;
+   break;
}
d_start += next_blk - o_start;
o_start = next_blk;
@@ -708,7 +709,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
o_start = cur_blk;
/* Extent inside requested range ?*/
if (cur_blk >= o_end)
-   goto out;
+   break;
} else { /* in_range(o_start, o_blk, o_len) */
cur_len += cur_blk - o_start;
}
@@ -743,6 +744,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
break;
o_start += cur_len;
d_start += cur_len;
+   m_len += cur_len;
repeat:
if (path) {
ext4_ext_drop_refs(path);
@@ -755,15 +757,10 @@ ext4_move_extents(struct file *o_filp, struct file 
*d_filp, __u64 orig_blk,
*moved_len = len;
 
 out:
-   if (*moved_len) {
-   ext4_discard_preallocations(orig_inode);
-   ext4_discard_preallocations(donor_inode);
-   }
+   WARN_ON(m_len > len);
+   if (ret == 0)
+   *moved_len = m_len;
 
-   if (path) {
-   ext4_ext_drop_refs(path);
-   kfree(path);
-   }
up_write(_I(orig_inode)->i_data_sem);
up_write(_I(donor_inode)->i_data_sem);
up_write(_inode->i_alloc_sem);
-- 
2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH:vz7] ext4: fix seek_data soft lookup on sparse files

2017-02-27 Thread Dmitry Monakhov
Good fix requires optimal implementation of next_extent like it was
done in 14516bb or 2d90c160, but this makes patch huge, let's
just break the loop when necessery.

https://jira.sw.ru/browse/PSBM-55818
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/ext4/file.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index c63d937..167e262 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -612,7 +612,17 @@ static loff_t ext4_seek_data(struct file *file, loff_t 
offset, loff_t maxsize)
if (unwritten)
break;
}
-
+   if (signal_pending(current)) {
+   mutex_unlock(>i_mutex);
+   return -EINTR;
+   }
+   if (need_resched()) {
+   mutex_unlock(>i_mutex);
+   cond_resched();
+   mutex_lock(>i_mutex);
+   isize = inode->i_size;
+   end = isize >> blkbits;
+   }
last++;
dataoff = (loff_t)last << blkbits;
} while (last <= end);
-- 
2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] vz6 ext4: Discard preallocated block before swap_extents v2

2017-02-27 Thread Dmitry Monakhov
Inode preallocation consists of two parts (used and unused) fully controlled
by inode, so it must be discarded before swap extents.
Currently we may skip drop_preallocation if file is sparse.

This patch does:
- Moves ext4_discard_preallocations to ext4_swap_extents.
  This makes more readable and reliable for future changes.
- Cleanup main move_extent loop

https://jira.sw.ru/browse/PSBM-57003
xfstests:ext4/024 (pended: 
https://github.com/dmonakhov/xfstests/commit/7a4763963f73ea5d5bba45eefa484494aa3df7cf)
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/ext4/extents.c |  3 +++
 fs/ext4/move_extent.c | 17 +++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 85c4d4e..fd49ab0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4371,6 +4371,9 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1,
BUG_ON(!mutex_is_locked(>i_mutex));
BUG_ON(!mutex_is_locked(>i_mutex));
 
+   ext4_discard_preallocations(inode1);
+   ext4_discard_preallocations(inode2);
+
while (count) {
struct ext4_extent *ex1, *ex2, tmp_ex;
ext4_lblk_t e1_blk, e2_blk;
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 39eaa8f..df904aa 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -628,6 +628,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
ext4_lblk_t o_end, o_start = orig_blk;
ext4_lblk_t d_start = donor_blk;
int ret;
+   __u64 m_len = *moved_len;
 
if (orig_inode->i_sb != donor_inode->i_sb) {
ext4_debug("ext4 move extent: The argument files "
@@ -696,7 +697,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
if (next_blk == EXT_MAX_BLOCKS) {
o_start = o_end;
ret = -ENODATA;
-   goto out;
+   break;
}
d_start += next_blk - o_start;
o_start = next_blk;
@@ -708,7 +709,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
o_start = cur_blk;
/* Extent inside requested range ?*/
if (cur_blk >= o_end)
-   goto out;
+   break;
} else { /* in_range(o_start, o_blk, o_len) */
cur_len += cur_blk - o_start;
}
@@ -743,6 +744,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
break;
o_start += cur_len;
d_start += cur_len;
+   m_len += cur_len;
repeat:
if (path) {
ext4_ext_drop_refs(path);
@@ -750,15 +752,10 @@ ext4_move_extents(struct file *o_filp, struct file 
*d_filp, __u64 orig_blk,
path = NULL;
}
}
-   *moved_len = o_start - orig_blk;
-   if (*moved_len > len)
-   *moved_len = len;
-
 out:
-   if (*moved_len) {
-   ext4_discard_preallocations(orig_inode);
-   ext4_discard_preallocations(donor_inode);
-   }
+   WARN_ON(m_len > len);
+   if (ret == 0)
+   *moved_len = m_len;
 
if (path) {
ext4_ext_drop_refs(path);
-- 
2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: fuse_prepare_write() cannot handle page from killed request

2017-02-14 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> After fuse_prepare_write() called __fuse_readpage(file, page, ...),
> the page might be already unlocked by fuse_kill_requests():
>
>>  for (i = 0; i < req->num_pages; i++) {
>>  struct page *page = req->pages[i];
>>  SetPageError(page);
>>  unlock_page(page);
ACK.
>
> so it is incorrect to touch it at all. The problem can be easily
> fixed the same way is it was done in fuse_readpage() checking "killed"
> flag.
>
> Another minor complication is that there are three different use-cases
> for that snippet from fuse_kill_requests() above: fuse_readpages(),
> fuse_readpage() and fuse_prepare_write(). Among them only the latter
> needs explicit page_cache_release() call. That's why the patch introduces
> ad-hoc request flag "page_needs_release".
>
> https://jira.sw.ru/browse/PSBM-54547
> Signed-off-by: Maxim Patlasov 
> ---
>  fs/fuse/file.c   |   15 ++-
>  fs/fuse/fuse_i.h |3 +++
>  fs/fuse/inode.c  |2 ++
>  3 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a514748..41ed6f0 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1008,7 +1008,7 @@ static void fuse_short_read(struct fuse_req *req, 
> struct inode *inode,
>  
>  static int __fuse_readpage(struct file *file, struct page *page, size_t 
> count,
>  int *err, struct fuse_req **req_pp, u64 *attr_ver_p,
> -bool *killed_p)
> +bool page_needs_release, bool *killed_p)
>  {
>   struct fuse_io_priv io = { .async = 0, .file = file };
>   struct inode *inode = page->mapping->host;
> @@ -1040,6 +1040,7 @@ static int __fuse_readpage(struct file *file, struct 
> page *page, size_t count,
>   req->pages[0] = page;
>   req->page_descs[0].length = count;
>   req->page_cache = 1;
> + req->page_needs_release = page_needs_release;
>  
>   num_read = fuse_send_read(req, , page_offset(page), count, NULL);
>   killed = req->killed;
> @@ -1071,7 +1072,7 @@ static int fuse_readpage(struct file *file, struct page 
> *page)
>   goto out;
>  
>   num_read = __fuse_readpage(file, page, count, , , _ver,
> -);
> +false, );
>   if (!err) {
>   /*
>* Short read means EOF.  If file size is larger, truncate it
> @@ -1153,6 +1154,7 @@ static void fuse_send_readpages(struct fuse_req *req, 
> struct file *file)
>   req->out.page_zeroing = 1;
>   req->out.page_replace = 1;
>   req->page_cache = 1;
> + req->page_needs_release = false;
>   fuse_read_fill(req, file, pos, count, FUSE_READ);
>   fuse_account_request(fc, count);
>   req->misc.read.attr_ver = fuse_get_attr_version(fc);
> @@ -2368,6 +2370,7 @@ static int fuse_prepare_write(struct fuse_conn *fc, 
> struct file *file,
>   unsigned num_read;
>   unsigned page_len;
>   int err;
> + bool killed = false;
>  
>   if (fuse_file_fail_immediately(file)) {
>   unlock_page(page);
> @@ -2385,12 +2388,14 @@ static int fuse_prepare_write(struct fuse_conn *fc, 
> struct file *file,
>   }
>  
>   num_read = __fuse_readpage(file, page, page_len, , , NULL,
> -NULL);
> +true, );
>   if (req)
>   fuse_put_request(fc, req);
>   if (err) {
> - unlock_page(page);
> - page_cache_release(page);
> + if (!killed) {
> + unlock_page(page);
> + page_cache_release(page);
> + }
>   } else if (num_read != PAGE_CACHE_SIZE) {
>   zero_user_segment(page, num_read, PAGE_CACHE_SIZE);
>   }
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 22eb9c9..fefa8ff 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -330,6 +330,9 @@ struct fuse_req {
>   /** Request contains pages from page-cache */
>   unsigned page_cache:1;
>  
> + /** Request pages need page_cache_release() */
> + unsigned page_needs_release:1;
> +
>   /** Request was killed -- pages were released */
>   unsigned killed:1;
>  
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index b63aae2..ddd858c 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -378,6 +378,8 @@ static void fuse_kill_requests(struct fuse_conn *fc, 
> struct inode *inode,
>   struct page *page = req->pages[i];
>   SetPageError(page);
>   unlock_page(page);
> + if (req->page_needs_release)
> + page_cache_release(page);
>   req->pages[i] = NULL;
>   }
>  
___

[Devel] [PATCH] ms/xfs: rework buffer dispose list tracking B

2017-01-27 Thread Dmitry Monakhov
Add lost hunks from original a408235726

https://jira.sw.ru/browse/PSBM-58492
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/xfs/xfs_buf.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8d8c9ce..47a6cb0 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1585,7 +1585,7 @@ xfs_buftarg_wait_rele(
 */
atomic_set(>b_lru_ref, 0);
bp->b_state |= XFS_BSTATE_DISPOSE;
-   list_move(item, dispose);
+   list_lru_isolate_move(lru, item, dispose);
spin_unlock(>b_lock);
return LRU_REMOVED;
 }
@@ -1646,7 +1646,7 @@ xfs_buftarg_isolate(
}
 
bp->b_state |= XFS_BSTATE_DISPOSE;
-   list_move(item, dispose);
+   list_lru_isolate_move(lru, item, dispose);
spin_unlock(>b_lock);
return LRU_REMOVED;
 }
-- 
2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH vz7] fuse: fuse_writepage_locked must check for FUSE_INVALIDATE_FILES (v2)

2017-01-12 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> The patch fixes another race dealing with fuse_invalidate_files,
> this time when it races with truncate(2):
>
> Thread A: the flusher performs writeback as usual:
>
>   fuse_writepages -->
> fuse_send_writepages -->
>   end_page_writeback
>
> but before fuse_send_writepages acquires fc->lock and calls 
> fuse_flush_writepages,
> some innocent user process re-dirty-es the page.
>
> Thread B: truncate(2) attempts to truncate (shrink) file as usual:
>
>   fuse_do_setattr -->
> invalidate_inode_pages2
>
> (This is possible because Thread A has not incremented fi->writectr yet.) But
> invalidate_inode_pages2 finds that re-dirty-ed page and sticks in:
>
>   invalidate_inode_pages2 -->
> fuse_launder_page -->
>   fuse_writepage_locked -->
>   fuse_wait_on_page_writeback
>
> Thread A: the flusher proceeds with fuse_flush_writepages, sends write request
> to userspace fuse daemon, but the daemon is not obliged to fulfill it 
> immediately.
> So, thread B waits now for thread A, while thread A waits for userspace.
>
> Now fuse_invalidate_files steps in sticking in filemap_write_and_wait on the
> page locked by Thread B (launder_page always work on a locked page). Deadlock.
>
> The patch fixes deadlock by waking up fuse_writepage_locked after marking
> files with FAIL_IMMEDIATELY flag.
>
> Changed in v2:
>   - instead of flagging "fail_immediately", let fuse_writepage_locked return
> fuse_file pointer, then the caller (fuse_launder_page) can use it for
> conditional wait on __fuse_wait_on_page_writeback_or_invalidate. This is
> important because otherwise fuse_invalidate_files may deadlock when
> launder waits for fuse writeback.
ACK-by: dmonak...@openvz.org
>
> Signed-off-by: Maxim Patlasov 
> ---
>  fs/fuse/file.c |   51 +--
>  1 file changed, 45 insertions(+), 6 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 0ffc806..34e75c2 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1963,7 +1963,8 @@ static struct fuse_file *fuse_write_file(struct 
> fuse_conn *fc,
>  }
>  
>  static int fuse_writepage_locked(struct page *page,
> -  struct writeback_control *wbc)
> +  struct writeback_control *wbc,
> +  struct fuse_file **ff_pp)
>  {
>   struct address_space *mapping = page->mapping;
>   struct inode *inode = mapping->host;
> @@ -1971,13 +1972,30 @@ static int fuse_writepage_locked(struct page *page,
>   struct fuse_inode *fi = get_fuse_inode(inode);
>   struct fuse_req *req;
>   struct page *tmp_page;
> + struct fuse_file *ff;
> + int err = 0;
>  
>   if (fuse_page_is_writeback(inode, page->index)) {
>   if (wbc->sync_mode != WB_SYNC_ALL) {
>   redirty_page_for_writepage(wbc, page);
>   return 0;
>   }
> - fuse_wait_on_page_writeback(inode, page->index);
> +
> + /* we can acquire ff here because we do have locked pages here! 
> */
> + ff = fuse_write_file(fc, get_fuse_inode(inode));
> + if (!ff)
> + goto dummy_end_page_wb_err;
> +
> + /* FUSE_NOTIFY_INVAL_FILES must be able to wake us up */
> + __fuse_wait_on_page_writeback_or_invalidate(inode, ff, 
> page->index);
> +
> + if (test_bit(FUSE_S_FAIL_IMMEDIATELY, >ff_state)) {
> + if (ff_pp)
> + *ff_pp = ff;
> + goto dummy_end_page_wb;
> + }
> +
> + fuse_release_ff(inode, ff);
>   }
>  
>   if (test_set_page_writeback(page))
> @@ -1995,6 +2013,8 @@ static int fuse_writepage_locked(struct page *page,
>   req->ff = fuse_write_file(fc, fi);
>   if (!req->ff)
>   goto err_nofile;
> + if (ff_pp)
> + *ff_pp = fuse_file_get(req->ff);
>   fuse_write_fill(req, req->ff, page_offset(page), 0);
>   fuse_account_request(fc, PAGE_CACHE_SIZE);
>  
> @@ -2029,13 +2049,23 @@ err_free:
>  err:
>   end_page_writeback(page);
>   return -ENOMEM;
> +
> +dummy_end_page_wb_err:
> + printk("FUSE: page under fwb dirtied on dead file\n");
> + err = -EIO;
> + /* fall through ... */
> +dummy_end_page_wb:
> + if (test_set_page_writeback(page))
> + BUG();
> + end_page_writeback(page);
> + return err;
>  }
>  
>  static int fuse_writepage(struct page *page, struct writeback_control *wbc)
>  {
>   int err;
>  
> - err = fuse_writepage_locked(page, wbc);
> + err = fuse_writepage_locked(page, wbc, NULL);
>   unlock_page(page);
>  
>   return err;
> @@ -2423,9 +2453,18 @@ static int fuse_launder_page(struct page *page)
>   struct writeback_control wbc = {
>   .sync_mode = WB_SYNC_ALL,
>   

Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: trust server file size unless opened

2016-12-15 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> Before the patch, the only way to pick up updated file size from server (in a
> scenario when local inode was created earlier, then the file was updated
> from another node) was in fuse_open_common():
>
>>  atomic_inc(>num_openers);
>>
>>  if (atomic_read(>num_openers) == 1) {
>>  err = fuse_getattr_size(inode, file, );
>>  ...
>>  spin_lock(>lock);
>>  i_size_write(inode, size);
>>  spin_unlock(>lock);
>>  }
>
> This is correct, but someone may ask about i_size w/o open, e.g.: ls -l foo.
> The patch ensures that every time the server reports us some file size, if no
> open-s happened yet (num_openers=0), fuse stores that server size in local
> inode->i_size. This resolves the following problem:
>
> # pstorage-mount -c test -l /var/log/f1.log /pcs1
> # pstorage-mount -c test -l /var/log/f2.log /pcs2
>
> # date > /pcs1/foo; ls -l /pcs1/foo /pcs2/foo
> -rwx-- 1 root root 29 Dec 14 16:31 /pcs1/foo
> -rwx-- 1 root root 29 Dec 14 16:31 /pcs2/foo
>
> # date >> /pcs1/foo; ls -l /pcs1/foo /pcs2/foo
> -rwx-- 1 root root 58 Dec 14 16:31 /pcs1/foo
> -rwx-- 1 root root 29 Dec 14 16:31 /pcs2/foo
>
> https://jira.sw.ru/browse/PSBM-57047
>
> Signed-off-by: Maxim Patlasov 
Ok. But IMHO fi->num_openers is redundant it protects special metadata,
but thre are other cases where we may get client/server mdata out of sync.

> ---
>  fs/fuse/file.c   |   12 +++-
>  fs/fuse/fuse_i.h |3 +++
>  fs/fuse/inode.c  |4 +++-
>  3 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 9cad8c5..62967d2 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -296,12 +296,20 @@ int fuse_open_common(struct inode *inode, struct file 
> *file, bool isdir)
>   u64 size;
>  
>   mutex_lock(>i_mutex);
> +
> + spin_lock(>lock);
>   atomic_inc(>num_openers);
>  
>   if (atomic_read(>num_openers) == 1) {
> + fi->i_size_unstable = 1;
> + spin_unlock(>lock);
>   err = fuse_getattr_size(inode, file, );
>   if (err) {
> + spin_lock(>lock);
>   atomic_dec(>num_openers);
> + fi->i_size_unstable = 0;
> + spin_unlock(>lock);
> +
>   mutex_unlock(>i_mutex);
>   fuse_release_common(file, FUSE_RELEASE);
>   return err;
> @@ -309,8 +317,10 @@ int fuse_open_common(struct inode *inode, struct file 
> *file, bool isdir)
>  
>   spin_lock(>lock);
>   i_size_write(inode, size);
> + fi->i_size_unstable = 0;
> + spin_unlock(>lock);
> + } else
>   spin_unlock(>lock);
> - }
>  
>   mutex_unlock(>i_mutex);
>   }
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 1d24bf6..22eb9c9 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -124,6 +124,9 @@ struct fuse_inode {
>  
>   /** Mostly to detect very first open */
>   atomic_t num_openers;
> +
> + /** Even though num_openers>0, trust server i_size */
> + int i_size_unstable;
>  };
>  
>  /** FUSE inode state bits */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 5ccecae..f606deb 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -97,6 +97,7 @@ static struct inode *fuse_alloc_inode(struct super_block 
> *sb)
>   fi->writectr = 0;
>   fi->orig_ino = 0;
>   fi->state = 0;
> + fi->i_size_unstable = 0;
>   INIT_LIST_HEAD(>write_files);
>   INIT_LIST_HEAD(>rw_files);
>   INIT_LIST_HEAD(>queued_writes);
> @@ -226,7 +227,8 @@ void fuse_change_attributes(struct inode *inode, struct 
> fuse_attr *attr,
>* extend local i_size without keeping userspace server in sync. So,
>* attr->size coming from server can be stale. We cannot trust it.
>*/
> - if (!is_wb || !S_ISREG(inode->i_mode))
> + if (!is_wb || !S_ISREG(inode->i_mode) ||
> + !atomic_read(>num_openers) || fi->i_size_unstable)
>   i_size_write(inode, attr->size);
>   spin_unlock(>lock);
>  
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/4] Revert: [fs] xfs: rework buffer dispose list tracking

2016-12-06 Thread Dmitry Monakhov
From: Dave Chinner <dchin...@redhat.com>

35c0abc0c70cfb3b37505ec137beae7fabca6b79 Mon Sep 17 00:00:00 2001
Message-id: <1472129410-4267-1-git-send-email-bfos...@redhat.com>
Patchwork-id: 157287
O-Subject: [RHEL7 PATCH] xfs: rework buffer dispose list tracking
Bugzilla: 1349175
RH-Acked-by: Dave Chinner <dchin...@redhat.com>
RH-Acked-by: Eric Sandeen <sand...@redhat.com>

- Retain the buffer lru helpers as rhel7 does not include built-in
  list_lru infrastructure.
- Some b_lock bits dropped as they were introduced by a previous
  selective backport.
- Backport use of dispose list from upstream list_lru-based
  xfs_wait_buftarg[_rele]() to downstream variant.

commit a408235726aa82c0358c9ec68124b6f4bc0a79df
Author: Dave Chinner <dchin...@redhat.com>
Date:   Wed Aug 28 10:18:06 2013 +1000

xfs: rework buffer dispose list tracking

In converting the buffer lru lists to use the generic code, the locking
for marking the buffers as on the dispose list was lost.  This results in
confusion in LRU buffer tracking and acocunting, resulting in reference
counts being mucked up and filesystem beig unmountable.

To fix this, introduce an internal buffer spinlock to protect the state
field that holds the dispose list information.  Because there is now
locking needed around xfs_buf_lru_add/del, and they are used in exactly
one place each two lines apart, get rid of the wrappers and code the logic
directly in place.

Further, the LRU emptying code used on unmount is less than optimal.
Convert it to use a dispose list as per a normal shrinker walk, and repeat
the walk that fills the dispose list until the LRU is empty.  Thi avoids
needing to drop and regain the LRU lock for every item being freed, and
allows the same logic as the shrinker isolate call to be used.  Simpler,
easier to understand.

Signed-off-by: Dave Chinner <dchin...@redhat.com>
Signed-off-by: Glauber Costa <glom...@openvz.org>
Cc: "Theodore Ts'o" <ty...@mit.edu>
Cc: Adrian Hunter <adrian.hun...@intel.com>
Cc: Al Viro <v...@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityuts...@linux.intel.com>
Cc: Arve Hjonnevag <a...@android.com>
Cc: Carlos Maiolino <cmaiol...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Chuck Lever <chuck.le...@oracle.com>
Cc: Daniel Vetter <daniel.vet...@ffwll.ch>
Cc: David Rientjes <rient...@google.com>
Cc: Gleb Natapov <g...@redhat.com>
Cc: Greg Thelen <gthe...@google.com>
Cc: J. Bruce Fields <bfie...@redhat.com>
Cc: Jan Kara <j...@suse.cz>
Cc: Jerome Glisse <jgli...@redhat.com>
Cc: John Stultz <john.stu...@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hir...@jp.fujitsu.com>
Cc: Kent Overstreet <koverstr...@google.com>
Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Cc: Marcelo Tosatti <mtosa...@redhat.com>
Cc: Mel Gorman <mgor...@suse.de>
Cc: Steven Whitehouse <swhit...@redhat.com>
Cc: Thomas Hellstrom <thellst...@vmware.com>
Cc: Trond Myklebust <trond.mykleb...@netapp.com>
Signed-off-by: Andrew Morton <a...@linux-foundation.org>
Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>

Signed-off-by: Brian Foster <bfos...@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/xfs/xfs_buf.c | 57 
 fs/xfs/xfs_buf.h |  8 +++-
 2 files changed, 11 insertions(+), 54 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e380398..c0de0e2 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -96,7 +96,7 @@ xfs_buf_lru_add(
atomic_inc(>b_hold);
list_add_tail(>b_lru, >bt_lru);
btp->bt_lru_nr++;
-   bp->b_state &= ~XFS_BSTATE_DISPOSE;
+   bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
}
spin_unlock(>bt_lru_lock);
 }
@@ -198,21 +198,19 @@ xfs_buf_stale(
 */
xfs_buf_ioacct_dec(bp);
 
-   spin_lock(>b_lock);
-   atomic_set(>b_lru_ref, 0);
+   atomic_set(&(bp)->b_lru_ref, 0);
if (!list_empty(>b_lru)) {
struct xfs_buftarg *btp = bp->b_target;
 
spin_lock(>bt_lru_lock);
if (!list_empty(>b_lru) &&
-   !(bp->b_state & XFS_BSTATE_DISPOSE)) {
+   !(bp->b_lru_flags & _XBF_LRU_DISPOSE)) {
list_del_init(>b_lru);
btp->bt_lru_nr--;
atomic_dec(>b_hold);
}
spin_unlock(>bt_lru_lock);
}
-   spin_unlock(>b_lock);
ASSERT(atomic_read(>b_hold) >= 1);
 }
 
@

[Devel] [PATCH 2/4] ms/xfs: convert buftarg LRU to generic code

2016-12-06 Thread Dmitry Monakhov
Convert the buftarg LRU to use the new generic LRU list and take advantage
of the functionality it supplies to make the buffer cache shrinker node
aware.

Signed-off-by: Glauber Costa <glom...@openvz.org>
Signed-off-by: Dave Chinner <dchin...@redhat.com>
Cc: "Theodore Ts'o" <ty...@mit.edu>
Cc: Adrian Hunter <adrian.hun...@intel.com>
Cc: Al Viro <v...@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityuts...@linux.intel.com>
Cc: Arve Hjønnevåg <a...@android.com>
Cc: Carlos Maiolino <cmaiol...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Chuck Lever <chuck.le...@oracle.com>
Cc: Daniel Vetter <daniel.vet...@ffwll.ch>
Cc: David Rientjes <rient...@google.com>
Cc: Gleb Natapov <g...@redhat.com>
Cc: Greg Thelen <gthe...@google.com>
Cc: J. Bruce Fields <bfie...@redhat.com>
Cc: Jan Kara <j...@suse.cz>
Cc: Jerome Glisse <jgli...@redhat.com>
Cc: John Stultz <john.stu...@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hir...@jp.fujitsu.com>
Cc: Kent Overstreet <koverstr...@google.com>
Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Cc: Marcelo Tosatti <mtosa...@redhat.com>
Cc: Mel Gorman <mgor...@suse.de>
Cc: Steven Whitehouse <swhit...@redhat.com>
Cc: Thomas Hellstrom <thellst...@vmware.com>
Cc: Trond Myklebust <trond.mykleb...@netapp.com>
Signed-off-by: Andrew Morton <a...@linux-foundation.org>
Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>
(cherry picked from commit e80dfa19976b884db1ac2bc5d7d6ca0a4027bd1c)
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/xfs/xfs_buf.c | 170 ++-
 fs/xfs/xfs_buf.h |   5 +-
 2 files changed, 81 insertions(+), 94 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index c0de0e2..87a314a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -85,20 +85,14 @@ xfs_buf_vmap_len(
  * The LRU takes a new reference to the buffer so that it will only be freed
  * once the shrinker takes the buffer off the LRU.
  */
-STATIC void
+static void
 xfs_buf_lru_add(
struct xfs_buf  *bp)
 {
-   struct xfs_buftarg *btp = bp->b_target;
-
-   spin_lock(>bt_lru_lock);
-   if (list_empty(>b_lru)) {
-   atomic_inc(>b_hold);
-   list_add_tail(>b_lru, >bt_lru);
-   btp->bt_lru_nr++;
+   if (list_lru_add(>b_target->bt_lru, >b_lru)) {
bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
+   atomic_inc(>b_hold);
}
-   spin_unlock(>bt_lru_lock);
 }
 
 /*
@@ -107,24 +101,13 @@ xfs_buf_lru_add(
  * The unlocked check is safe here because it only occurs when there are not
  * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
  * to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free(). i.e. it removes an unnecessary round trip on the
- * bt_lru_lock.
+ * xfs_buf_free().
  */
-STATIC void
+static void
 xfs_buf_lru_del(
struct xfs_buf  *bp)
 {
-   struct xfs_buftarg *btp = bp->b_target;
-
-   if (list_empty(>b_lru))
-   return;
-
-   spin_lock(>bt_lru_lock);
-   if (!list_empty(>b_lru)) {
-   list_del_init(>b_lru);
-   btp->bt_lru_nr--;
-   }
-   spin_unlock(>bt_lru_lock);
+   list_lru_del(>b_target->bt_lru, >b_lru);
 }
 
 /*
@@ -199,18 +182,10 @@ xfs_buf_stale(
xfs_buf_ioacct_dec(bp);
 
atomic_set(&(bp)->b_lru_ref, 0);
-   if (!list_empty(>b_lru)) {
-   struct xfs_buftarg *btp = bp->b_target;
-
-   spin_lock(>bt_lru_lock);
-   if (!list_empty(>b_lru) &&
-   !(bp->b_lru_flags & _XBF_LRU_DISPOSE)) {
-   list_del_init(>b_lru);
-   btp->bt_lru_nr--;
-   atomic_dec(>b_hold);
-   }
-   spin_unlock(>bt_lru_lock);
-   }
+   if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+   (list_lru_del(>b_target->bt_lru, >b_lru)))
+   atomic_dec(>b_hold);
+
ASSERT(atomic_read(>b_hold) >= 1);
 }
 
@@ -1597,11 +1572,14 @@ xfs_buf_iomove(
  * returned. These buffers will have an elevated hold count, so wait on those
  * while freeing all the buffers only held by the LRU.
  */
-void
-xfs_wait_buftarg(
-   struct xfs_buftarg  *btp)
+static enum lru_status
+xfs_buftarg_wait_rele(
+   struct list_head*item,
+   spinlock_t  *lru_lock,
+   void*arg)
+
 {
-   struct xfs_buf  *bp;
+   struct xfs_buf  *bp = container_of(item, struct xfs_buf, b_lru);
 
/*
 * First wait on the buftarg I/O count for all in-flight b

[Devel] [PATCH 3/4] From c70ded437bb646ace0dcbf3c7989d4edeed17f7e Mon Sep 17 00:00:00 2001 [PATCH 2/3] ms/xfs-convert-buftarg-lru-to-generic-code-fix

2016-12-06 Thread Dmitry Monakhov
From: Andrew Morton <a...@linux-foundation.org>

fix warnings

Cc: Dave Chinner <dchin...@redhat.com>
Cc: Glauber Costa <glom...@openvz.org>
Signed-off-by: Andrew Morton <a...@linux-foundation.org>
Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>
(cherry picked from commit addbda40bed47d8942658fca93e14b5f1cbf009a)

Signed-off-by: Vladimir Davydov <vdavy...@parallels.com>
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/xfs/xfs_buf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 87a314a..bf933d5 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1654,7 +1654,7 @@ xfs_buftarg_isolate(
return LRU_REMOVED;
 }
 
-static long
+static unsigned long
 xfs_buftarg_shrink_scan(
struct shrinker *shrink,
struct shrink_control   *sc)
@@ -1662,7 +1662,7 @@ xfs_buftarg_shrink_scan(
struct xfs_buftarg  *btp = container_of(shrink,
struct xfs_buftarg, bt_shrinker);
LIST_HEAD(dispose);
-   longfreed;
+   unsigned long   freed;
unsigned long   nr_to_scan = sc->nr_to_scan;
 
freed = list_lru_walk_node(>bt_lru, sc->nid, xfs_buftarg_isolate,
@@ -1678,7 +1678,7 @@ xfs_buftarg_shrink_scan(
return freed;
 }
 
-static long
+static unsigned long
 xfs_buftarg_shrink_count(
struct shrinker *shrink,
struct shrink_control   *sc)
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 4/4] ms/xfs: rework buffer dispose list tracking

2016-12-06 Thread Dmitry Monakhov
In converting the buffer lru lists to use the generic code, the locking
for marking the buffers as on the dispose list was lost.  This results in
confusion in LRU buffer tracking and acocunting, resulting in reference
counts being mucked up and filesystem beig unmountable.

To fix this, introduce an internal buffer spinlock to protect the state
field that holds the dispose list information.  Because there is now
locking needed around xfs_buf_lru_add/del, and they are used in exactly
one place each two lines apart, get rid of the wrappers and code the logic
directly in place.

Further, the LRU emptying code used on unmount is less than optimal.
Convert it to use a dispose list as per a normal shrinker walk, and repeat
the walk that fills the dispose list until the LRU is empty.  Thi avoids
needing to drop and regain the LRU lock for every item being freed, and
allows the same logic as the shrinker isolate call to be used.  Simpler,
easier to understand.

Signed-off-by: Dave Chinner <dchin...@redhat.com>
Signed-off-by: Glauber Costa <glom...@openvz.org>
Cc: "Theodore Ts'o" <ty...@mit.edu>
Cc: Adrian Hunter <adrian.hun...@intel.com>
Cc: Al Viro <v...@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityuts...@linux.intel.com>
Cc: Arve Hjønnevåg <a...@android.com>
Cc: Carlos Maiolino <cmaiol...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Chuck Lever <chuck.le...@oracle.com>
Cc: Daniel Vetter <daniel.vet...@ffwll.ch>
Cc: David Rientjes <rient...@google.com>
Cc: Gleb Natapov <g...@redhat.com>
Cc: Greg Thelen <gthe...@google.com>
Cc: J. Bruce Fields <bfie...@redhat.com>
Cc: Jan Kara <j...@suse.cz>
Cc: Jerome Glisse <jgli...@redhat.com>
Cc: John Stultz <john.stu...@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hir...@jp.fujitsu.com>
Cc: Kent Overstreet <koverstr...@google.com>
Cc: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Cc: Marcelo Tosatti <mtosa...@redhat.com>
Cc: Mel Gorman <mgor...@suse.de>
Cc: Steven Whitehouse <swhit...@redhat.com>
Cc: Thomas Hellstrom <thellst...@vmware.com>
Cc: Trond Myklebust <trond.mykleb...@netapp.com>
Signed-off-by: Andrew Morton <a...@linux-foundation.org>
Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>
(cherry picked from commit a408235726aa82c0358c9ec68124b6f4bc0a79df)
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/xfs/xfs_buf.c | 147 +++
 fs/xfs/xfs_buf.h |   8 ++-
 2 files changed, 78 insertions(+), 77 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index bf933d5..8d8c9ce 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -80,37 +80,6 @@ xfs_buf_vmap_len(
 }
 
 /*
- * xfs_buf_lru_add - add a buffer to the LRU.
- *
- * The LRU takes a new reference to the buffer so that it will only be freed
- * once the shrinker takes the buffer off the LRU.
- */
-static void
-xfs_buf_lru_add(
-   struct xfs_buf  *bp)
-{
-   if (list_lru_add(>b_target->bt_lru, >b_lru)) {
-   bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
-   atomic_inc(>b_hold);
-   }
-}
-
-/*
- * xfs_buf_lru_del - remove a buffer from the LRU
- *
- * The unlocked check is safe here because it only occurs when there are not
- * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
- * to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free().
- */
-static void
-xfs_buf_lru_del(
-   struct xfs_buf  *bp)
-{
-   list_lru_del(>b_target->bt_lru, >b_lru);
-}
-
-/*
  * Bump the I/O in flight count on the buftarg if we haven't yet done so for
  * this buffer. The count is incremented once per buffer (per hold cycle)
  * because the corresponding decrement is deferred to buffer release. Buffers
@@ -181,12 +150,14 @@ xfs_buf_stale(
 */
xfs_buf_ioacct_dec(bp);
 
-   atomic_set(&(bp)->b_lru_ref, 0);
-   if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+   spin_lock(>b_lock);
+   atomic_set(>b_lru_ref, 0);
+   if (!(bp->b_state & XFS_BSTATE_DISPOSE) &&
(list_lru_del(>b_target->bt_lru, >b_lru)))
atomic_dec(>b_hold);
 
ASSERT(atomic_read(>b_hold) >= 1);
+   spin_unlock(>b_lock);
 }
 
 static int
@@ -987,10 +958,28 @@ xfs_buf_rele(
/* the last reference has been dropped ... */
xfs_buf_ioacct_dec(bp);
if (!(bp->b_flags & XBF_STALE) && atomic_read(>b_lru_ref)) {
-   xfs_buf_lru_add(bp);
+   /*
+* If the buffer is added to the LRU take a new
+* reference to the buffer for the LRU and clear the
+* (now stale) dispose list state flag
+*/
+   if (list_lru_add(>b_target->

[Devel] [PATCH 0/4] [7.3] rebase xfs lru patches

2016-12-06 Thread Dmitry Monakhov
rh7-3.10.0-514 already has 'fs-xfs-rework-buffer-dispose-list-tracking', but
originally it depens on ms/xfs-convert-buftarg-LRU-to-generic, so
In order to preserve original logic I've revert rhel's patch (1'st one),
and reapply it later in natural order:
TOC:
0001-Revert-fs-xfs-rework-buffer-dispose-list-tracking.patch

0002-ms-xfs-convert-buftarg-LRU-to-generic-code.patch
0003-From-c70ded437bb646ace0dcbf3c7989d4edeed17f7e-Mon-Se.patch [not changed]
0004-ms-xfs-rework-buffer-dispose-list-tracking.patch
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/2] fs: constify iov_iter_count/iov_iter_iovec helpers

2016-12-05 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 include/linux/fs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e30e8a1..a27bd15 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -448,13 +448,13 @@ static inline int iov_iter_has_iovec(const struct 
iov_iter *i)
 {
return i->ops == _iovec_ops;
 }
-static inline struct iovec *iov_iter_iovec(struct iov_iter *i)
+static inline struct iovec *iov_iter_iovec(const struct iov_iter *i)
 {
BUG_ON(!iov_iter_has_iovec(i));
return (struct iovec *)i->data;
 }
 
-static inline size_t iov_iter_count(struct iov_iter *i)
+static inline size_t iov_iter_count(const struct iov_iter *i)
 {
return i->count;
 }
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/2] fs/ceph: honor kernel direct aio changes v2

2016-12-05 Thread Dmitry Monakhov
Base patches:
fs/ceph: honor kernel direct aio changes
fs/ceph: add BUG_ON to iov_iter access

Changes: replace opencoded iter to iovec coversion with propper helper.
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/ceph/file.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 82676fa..0b72417 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -40,8 +40,8 @@
  */
 static size_t dio_get_pagev_size(const struct iov_iter *it)
 {
-const struct iovec *iov = it->iov;
-const struct iovec *iovend = iov + it->nr_segs;
+const struct iovec *iov = iov_iter_iovec(it);
+size_t total = iov_iter_count(it);
 size_t size;
 
 size = iov->iov_len - it->iov_offset;
@@ -50,8 +50,10 @@ static size_t dio_get_pagev_size(const struct iov_iter *it)
  * and the next base are page aligned.
  */
 while (PAGE_ALIGNED((iov->iov_base + iov->iov_len)) &&
-   (++iov < iovend && PAGE_ALIGNED((iov->iov_base {
-size += iov->iov_len;
+   PAGE_ALIGNED(((iov++)->iov_base))) {
+   size_t n =  min(iov->iov_len, total);
+   size += n;
+   total -= n;
 }
 dout("dio_get_pagevlen len = %zu\n", size);
 return size;
@@ -71,7 +73,7 @@ dio_get_pages_alloc(const struct iov_iter *it, size_t nbytes,
struct page **pages;
int ret = 0, idx, npages;
 
-   align = (unsigned long)(it->iov->iov_base + it->iov_offset) &
+   align = (unsigned long)(iov_iter_iovec(it)->iov_base + it->iov_offset) &
(PAGE_SIZE - 1);
npages = calc_pages_for(align, nbytes);
pages = kmalloc(sizeof(*pages) * npages, GFP_KERNEL);
@@ -82,10 +84,11 @@ dio_get_pages_alloc(const struct iov_iter *it, size_t 
nbytes,
}
 
for (idx = 0; idx < npages; ) {
-   void __user *data = tmp_it.iov->iov_base + tmp_it.iov_offset;
+   struct iovec *tmp_iov = iov_iter_iovec(_it);
+   void __user *data = tmp_iov->iov_base + tmp_it.iov_offset;
size_t off = (unsigned long)data & (PAGE_SIZE - 1);
size_t len = min_t(size_t, nbytes,
-  tmp_it.iov->iov_len - tmp_it.iov_offset);
+  tmp_iov->iov_len - tmp_it.iov_offset);
int n = (len + off + PAGE_SIZE - 1) >> PAGE_SHIFT;
ret = get_user_pages_fast((unsigned long)data, n, write,
   pages + idx);
@@ -522,10 +525,9 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct 
iov_iter *i,
size_t left = len = ret;
 
while (left) {
-   void __user *data = i->iov[0].iov_base +
-   i->iov_offset;
-   l = min(i->iov[0].iov_len - i->iov_offset,
-   left);
+   struct iovec *iov = (struct iovec *)i->data;
+   void __user *data = iov->iov_base + i->iov_offset;
+   l = min(iov->iov_len - i->iov_offset, left);
 
ret = ceph_copy_page_vector_to_user([k],
data, off, l);
@@ -1121,7 +1123,7 @@ static ssize_t inline_to_iov(struct kiocb *iocb, struct 
iov_iter *i,
 
while (left) {
struct iovec *iov = iov_iter_iovec(i);
-   void __user *udata = iov->iov_base + i->iov_offset;
+   void __user *udata = iov->iov_base;
size_t n = min(iov->iov_len - i->iov_offset, left);
 
if (__copy_to_user(udata, kdata, n)) {
@@ -1139,8 +1141,8 @@ static ssize_t inline_to_iov(struct kiocb *iocb, struct 
iov_iter *i,
size_t left = min_t(loff_t, iocb->ki_pos + len, i_size) - pos;
 
while (left) {
-   struct iovec *iov = iov_iter_iovec(i);
-   void __user *udata = iov->iov_base + i->iov_offset;
+   struct iovec *iov = (struct iovec *)i->data;
+   void __user *udata = iov->iov_base;
size_t n = min(iov->iov_len - i->iov_offset, left);
 
if (__clear_user(udata, n)) {
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7] vfs: add warning in guard_bio_eod() if truncated_bytes > bvec->bv_len

2016-12-03 Thread Dmitry Monakhov

Pavel Tikhomirov  writes:

> https://jira.sw.ru/browse/PSBM-55105
>
> In bug we crashed in zero_fill_bio when trying to zero memset bio_vec:
>
> struct bio_vec {
>   bv_page = 0xea0004437500,
>   bv_len = 4294948864,
>   bv_offset = 0
> }
>
> which is bigger than its bio->bi_size = 104448, guard_bio_eod might
> lead to these bv_len overflow and is suspicious as quiet recently
> in vz7.19.4 we've ported commit 2573b2539875("vfs: make guard_bh_eod()
> more generic") which adds bv_len reduction, and before that there
> were no crash.
>
> Signed-off-by: Pavel Tikhomirov 
> ---
>  fs/buffer.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index c45200d..b820080 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3009,6 +3009,7 @@ void guard_bio_eod(int rw, struct bio *bio)
>  
>   /* Truncate the bio.. */
>   bio->bi_size -= truncated_bytes;
> + WARN_ON(truncated_bytes > bvec->bv_len);
BUG_ON would be more appropriate here.
>   bvec->bv_len -= truncated_bytes;
>  
>   /* ..and clear the end of the buffer for reads */
> -- 
> 2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH vz7] fuse: no mtime flush on fdatasync

2016-12-02 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> fuse_fsync_common() may skip fuse_flush_mtime() if datasync=1 because
> mtime is pure metadata and the content of file doesn't depend on it.
>
> https://jira.sw.ru/browse/PSBM-55919
>
> Signed-off-by: Maxim Patlasov 
ACK.
> ---
>  fs/fuse/file.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 559dfd9..e5c4778 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -684,8 +684,8 @@ int fuse_fsync_common(struct file *file, loff_t start, 
> loff_t end,
>   if (err)
>   goto out;
>  
> - if (test_bit(FUSE_I_MTIME_UPDATED,
> -  _fuse_inode(inode)->state)) {
> + if (!datasync && test_bit(FUSE_I_MTIME_UPDATED,
> +   _fuse_inode(inode)->state)) {
>   err = fuse_flush_mtime(file, false);
>   if (err)
>   goto out;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

2016-12-01 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> Alexey,
>
>
> You're right. And while composing the patch I well understood that it's 
> possible to rework fuse_sync_writes() using a counter instead of 
> negative bias. But the problem with flush_mtime still exists anyway. 
> Think about it: we firstly acquire local mtime from local inode, then 
> fill and submit mtime-update-request. Since then, we don't know when 
> exactly fuse daemon will apply that new mtime to its metadata 
> structures. If another mtime-update is generated in-between (e.g. "touch 
> -d  file", or even simplier -- just a single direct write 
> implicitly updating mtime), we wouldn't know which of those two 
> mtime-update-requests are processed by fused first. That comes from a 
> general FUSE protocol limitation: when kernel fuse queues request A, 
> then request B, it cannot be sure if they will be processed by userspace 
> as  or .
>
>
> The big advantage of the patch I sent is that it's very simple, 
> straightforward and presumably will remove 99% of contention between 
> fsync and io_submit (assuming we spend most of time waiting for 
> userspace ACK for FUSE_FSYNC request. There are actually three questions 
> to answer:

>
>
> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a 
> lot of io_submits? The goal of fsync is to ensure that some state is 
> actually went to platters. An app who races io_submit-s with fsync-s 
> actually doesn't care which state will come to platters. I'm not sure 
> that it's reasonable to work very hard to achieve the best possible 
> performance for such a marginal app.
Obiously any filesystem behave like this.
Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of 
io_submit-s
All that io may happens in parallel, fs guarantee only that metadata
will be serialized. So all that concurent IO flowa to blockdevice which
does no have i_mutex so all IO indeed happen concurrently.
But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex
on file. For general filesystem (xfs/ext4) we grab i_mutex only on write
path, fsync is lockless. But int case of fuse we artificially introduce
i_mutex inside fsync which basically kill concurrency for upper FS.
As result we have SMP scalability as we have in Linux-v2.2 with single
mutex in VFS.

BTW: I'm wondering why do we care about mtime at all. for fs-in-file
we can relax that, for example flush mtime only on fsync, and not for fdatasync.

>
>
> 2) Will the patch (in the form I sent it) break something? I think no. 
> If you know some usecase that can be broken, let's discuss it in more 
> details.
>
>
> 3) Should we expect some noticeable (or significant) improvement in 
> performance comparing fuse_fsync with no locking at all vs. the locking 
> we have with that patch applied? I tend to think that the answer is "no" 
> because handling FUSE_FSYNC is notoriously heavy-weight operation. If 
> you disagree, let's firstly measure that difference in performance 
> (simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then 
> start to think if it's really worthy to fully re-work locking scheme to 
> preserve flush_mtime correctness w/o i_mutex.
>
>
> Thanks,
>
> Maxim
>
>
> On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:
>> Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
>> can be done only under i_mutex.
>>
>> IMHO it is only due to bad implementation.
>> If fuse_set_nowrite would be done with separate
>> count instead of adding negative bias, it would
>> be possible.
>>
>>
>> On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov  
>> wrote:
>>> Hello!
>>>
>>> I do not think you got it right.
>>>
>>> i_mutex in fsync is not about some atomicity,
>>> it is about stopping data feed while fsync is executed
>>> to prevent livelock.
>>>
>>> I cannot tell anything about mtime update, it is just some voodoo
>>> magic for me.
>>>
>>> What's about fsync semantics, I see two different ways:
>>>
>>> A.
>>>
>>> 1. Remove useless write_inode_now. Its work is done
>>>  by filemap_write_and_wait_range(), there is no need to repeat it
>>> under mutex.
>>> 2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
>>>  fuse continuation forfilemap_write_and_wait_range().
>>> 3. i_mutex is preserved only around fsync call.
>>>
>>> B.
>>> 1. Remove  write_inode_now as well.
>>> 2. Remove i_mutex _completely_. (No idea about mtime voodo though)
>>> 2. Replace fuse_sync_writes() with fuse_set_nowrite()
>>>  and add release after call to FSYNC.
>>>
>>> Both prevent livelock. B is obviosly optimal.
>>>
>>> But A preserves historic fuse protocol semantics.
>>> F.e. I have no idea would user space survive truncate
>>> racing with fsync. pstorage should survice, though this
>>> path was never tested.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov  
>>> wrote:
 fuse_fsync_common() does need i_mutex for 

[Devel] drop: ext4: resplit block_page_mkwrite: fix get-host convention

2016-11-18 Thread Dmitry Monakhov

We no longer needed vzfs crunches:
Please drop this patch:
ext4: resplit block_page_mkwrite: fix get-host convention
commit c97eaffbf6c9b909e324c59380962158185639bf
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] scsi: make scsi error laud

2016-11-11 Thread Dmitry Monakhov
This patch is not for release, testing purpose only.
We need it in order to investigate #PSBM-54665

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 287045b..7364d86 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -141,12 +141,13 @@ int scsi_host_set_state(struct Scsi_Host *shost, enum 
scsi_host_state state)
return 0;
 
  illegal:
-   SCSI_LOG_ERROR_RECOVERY(1,
-   shost_printk(KERN_ERR, shost,
-"Illegal host state transition"
-"%s->%s\n",
-scsi_host_state_name(oldstate),
-scsi_host_state_name(state)));
+   shost_printk(KERN_ERR, shost,
+"Illegal host state transition"
+"%s->%s\n",
+scsi_host_state_name(oldstate),
+scsi_host_state_name(state));
+   dump_stack();
+
return -EINVAL;
 }
 EXPORT_SYMBOL(scsi_host_set_state);
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 573574b..c2e3307 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -61,6 +61,13 @@ struct virtio_scsi_vq {
struct virtqueue *vq;
 };
 
+#define __check_ret(val) do {  \
+   if (val == FAILED) {\
+   printk("virtscsi_failure"); \
+   dump_stack();   \
+   }   \
+   } while(0)
+
 /*
  * Per-target queue state.
  *
@@ -489,6 +496,7 @@ static int virtscsi_add_cmd(struct virtqueue *vq,
return virtqueue_add_sgs(vq, sgs, out_num, in_num, cmd, GFP_ATOMIC);
 }
 
+
 static int virtscsi_kick_cmd(struct virtio_scsi_vq *vq,
 struct virtio_scsi_cmd *cmd,
 size_t req_size, size_t resp_size)
@@ -633,6 +641,7 @@ static int virtscsi_tmf(struct virtio_scsi *vscsi, struct 
virtio_scsi_cmd *cmd)
virtscsi_poll_requests(vscsi);
 
 out:
+   __check_ret(ret);
mempool_free(cmd, virtscsi_cmd_pool);
return ret;
 }
@@ -644,8 +653,10 @@ static int virtscsi_device_reset(struct scsi_cmnd *sc)
 
sdev_printk(KERN_INFO, sc->device, "device reset\n");
cmd = mempool_alloc(virtscsi_cmd_pool, GFP_NOIO);
-   if (!cmd)
+   if (!cmd) {
+   __check_ret(FAILED);
return FAILED;
+   }
 
memset(cmd, 0, sizeof(*cmd));
cmd->sc = sc;
@@ -666,11 +677,12 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
struct virtio_scsi *vscsi = shost_priv(sc->device->host);
struct virtio_scsi_cmd *cmd;
 
-   scmd_printk(KERN_INFO, sc, "abort\n");
+   scmd_printk(KERN_INFO, sc, "%s abort\n", __FUNCTION__);
cmd = mempool_alloc(virtscsi_cmd_pool, GFP_NOIO);
-   if (!cmd)
+   if (!cmd) {
+   __check_ret(FAILED);
return FAILED;
-
+   }
memset(cmd, 0, sizeof(*cmd));
cmd->sc = sc;
cmd->req.tmf = (struct virtio_scsi_ctrl_tmf_req){
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] scsi-DBG: make scsi error laud

2016-11-11 Thread Dmitry Monakhov
This patch is not for release, testing purpose only.
We need it in order to investigate #PSBM-54665

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 287045b..7364d86 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -141,12 +141,13 @@ int scsi_host_set_state(struct Scsi_Host *shost, enum 
scsi_host_state state)
return 0;
 
  illegal:
-   SCSI_LOG_ERROR_RECOVERY(1,
-   shost_printk(KERN_ERR, shost,
-"Illegal host state transition"
-"%s->%s\n",
-scsi_host_state_name(oldstate),
-scsi_host_state_name(state)));
+   shost_printk(KERN_ERR, shost,
+"Illegal host state transition"
+"%s->%s\n",
+scsi_host_state_name(oldstate),
+scsi_host_state_name(state));
+   dump_stack();
+
return -EINVAL;
 }
 EXPORT_SYMBOL(scsi_host_set_state);
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/2] ms/xfs: convert dquot cache lru to list_lru part2

2016-11-10 Thread Dmitry Monakhov
Modify patch according to MS change-set:ff6d6af2351 which requires that 
XFS_STATS_XXX()
has two arguments.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/xfs/xfs_qm.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 1b383f5..a0518a8 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -478,11 +478,11 @@ xfs_qm_dquot_isolate(
 */
if (dqp->q_nrefs) {
xfs_dqunlock(dqp);
-   XFS_STATS_INC(xs_qm_dqwants);
+   XFS_STATS_INC(dqp->q_mount, xs_qm_dqwants);
 
trace_xfs_dqreclaim_want(dqp);
list_lru_isolate(lru, >q_lru);
-   XFS_STATS_DEC(xs_qm_dquot_unused);
+   XFS_STATS_DEC(dqp->q_mount, xs_qm_dquot_unused);
return LRU_REMOVED;
}
 
@@ -526,19 +526,19 @@ xfs_qm_dquot_isolate(
 
ASSERT(dqp->q_nrefs == 0);
list_lru_isolate_move(lru, >q_lru, >dispose);
-   XFS_STATS_DEC(xs_qm_dquot_unused);
+   XFS_STATS_DEC(dqp->q_mount, xs_qm_dquot_unused);
trace_xfs_dqreclaim_done(dqp);
-   XFS_STATS_INC(xs_qm_dqreclaims);
+   XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaims);
return LRU_REMOVED;
 
 out_miss_busy:
trace_xfs_dqreclaim_busy(dqp);
-   XFS_STATS_INC(xs_qm_dqreclaim_misses);
+   XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses);
return LRU_SKIP;
 
 out_unlock_dirty:
trace_xfs_dqreclaim_busy(dqp);
-   XFS_STATS_INC(xs_qm_dqreclaim_misses);
+   XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses);
xfs_dqunlock(dqp);
spin_lock(lru_lock);
return LRU_RETRY;
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/2] xfs: compile for 661c0b9b3

2016-11-10 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/xfs/xfs_buf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e379876..28ad0bf 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1582,7 +1582,7 @@ xfs_buftarg_wait_rele(
 
 {
struct xfs_buf  *bp = container_of(item, struct xfs_buf, b_lru);
-
+   struct xfs_buftarg  *btp = bp->b_target;
/*
 * First wait on the buftarg I/O count for all in-flight buffers to be
 * released. This is critical as new buffers do not make the LRU until
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7] pfcache: hide trusted.pfcache from listxattr

2016-09-23 Thread Dmitry Monakhov
Pavel Tikhomirov  writes:

> In SyS_listxattr -> listxattr -> ext4_listxattr ->
> ext4_xattr_list_entries we choose list handler for
> each ext4_xattr_entry based on e_name_index, and as
> for trusted.pfcache index is EXT4_XATTR_INDEX_TRUSTED,
> we chouse ext4_xattr_trusted_list which prints xattr
> to the list.
>
> To hide our trusted.pfcache from list change e_name_index
> to new EXT4_XATTR_INDEX_TRUSTED_CSUM and thus use
> ext4_xattr_trusted_csum_list instead which won't put
> xattr to the returned list.
Why we want to hide it?
>
> Test:
>
> TEST_FILE=/vz/root/101/testfile
> TEST_SHA1=`sha1sum $TEST_FILE | awk '{print $1}'`
> setfattr -n trusted.pfcache -v $TEST_SHA1 $TEST_FILE
> setfattr -n trusted.test -v test $TEST_FILE
> getfattr -d -m trusted $TEST_FILE
>
> before patch it was listed:
>
> trusted.pfcache="da39a3ee5e6b4b0d3255bfef95601890afd80709"
> trusted.test="test"
>
> after - not:
>
> trusted.test="test"
>
> https://jira.sw.ru/browse/PSBM-52180
> Signed-off-by: Pavel Tikhomirov 
> ---
>  fs/ext4/pfcache.c | 28 ++--
>  fs/ext4/xattr.c   |  1 +
>  fs/ext4/xattr.h   |  1 +
>  3 files changed, 16 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/pfcache.c b/fs/ext4/pfcache.c
> index ff2300b..5fc6d9f 100644
> --- a/fs/ext4/pfcache.c
> +++ b/fs/ext4/pfcache.c
> @@ -441,8 +441,8 @@ int ext4_load_data_csum(struct inode *inode)
>  {
>   int ret;
>  
> - ret = ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED,
> - EXT4_DATA_CSUM_NAME, EXT4_I(inode)->i_data_csum,
> + ret = ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> + "", EXT4_I(inode)->i_data_csum,
>   EXT4_DATA_CSUM_SIZE);
>   if (ret < 0)
>   return ret;
> @@ -482,8 +482,8 @@ static int ext4_save_data_csum(struct inode *inode, u8 
> *csum)
>   if (ret)
>   return ret;
>  
> - return ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
> - EXT4_DATA_CSUM_NAME, EXT4_I(inode)->i_data_csum,
> + return ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> + "", EXT4_I(inode)->i_data_csum,
>   EXT4_DATA_CSUM_SIZE, 0);
>  }
>  
> @@ -492,8 +492,8 @@ void ext4_load_dir_csum(struct inode *inode)
>   char value[EXT4_DIR_CSUM_VALUE_LEN];
>   int ret;
>  
> - ret = ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED,
> -  EXT4_DATA_CSUM_NAME, value, sizeof(value));
> + ret = ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> +  "", value, sizeof(value));
>   if (ret == EXT4_DIR_CSUM_VALUE_LEN &&
>   !strncmp(value, EXT4_DIR_CSUM_VALUE, sizeof(value)))
>   ext4_set_inode_state(inode, EXT4_STATE_PFCACHE_CSUM);
> @@ -502,8 +502,8 @@ void ext4_load_dir_csum(struct inode *inode)
>  void ext4_save_dir_csum(struct inode *inode)
>  {
>   ext4_set_inode_state(inode, EXT4_STATE_PFCACHE_CSUM);
> - ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
> - EXT4_DATA_CSUM_NAME,
> + ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> + "",
>   EXT4_DIR_CSUM_VALUE,
>   EXT4_DIR_CSUM_VALUE_LEN, 0);
>  }
> @@ -516,8 +516,8 @@ void ext4_truncate_data_csum(struct inode *inode, loff_t 
> pos)
>  
>   if (EXT4_I(inode)->i_data_csum_end < 0) {
>   WARN_ON(journal_current_handle());
> - ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
> - EXT4_DATA_CSUM_NAME, NULL, 0, 0);
> + ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> + "", NULL, 0, 0);
>   ext4_close_pfcache(inode);
>   }
>   spin_lock(>i_lock);
> @@ -658,8 +658,8 @@ static int ext4_xattr_trusted_csum_get(struct dentry 
> *dentry, const char *name,
>   return -EPERM;
>  
>   if (S_ISDIR(inode->i_mode))
> - return ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED,
> -   EXT4_DATA_CSUM_NAME, buffer, size);
> + return ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> +   "", buffer, size);
>  
>   if (!S_ISREG(inode->i_mode))
>   return -ENODATA;
> @@ -717,8 +717,8 @@ static int ext4_xattr_trusted_csum_set(struct dentry 
> *dentry, const char *name,
>   else
>   return -EINVAL;
>  
> - return ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
> -   EXT4_DATA_CSUM_NAME, value, size, flags);
> + return ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> +   "", value, size, flags);
>   }
>  
>   if (!S_ISREG(inode->i_mode))
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 5dabf58..81b5534 100644
> --- a/fs/ext4/xattr.c
> +++ 

Re: [Devel] [PATCH rh7 3/3] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests (v3)

2016-07-29 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> Dima,
>
>
> One week elapsed, still no feedback from you. Do you have something 
> against this patch?
Sorry for delay Max. I was overloaded by pended crap I've collected
before vacations, and lost your email. Again sorry.

Whole patch looks good. Thank you for your rede

BTW: We defenitely need regression testing for original bug (broken
barries and others). I'm working on that.
>
>
> Thanks,
>
> Maxim
>
>
> On 07/20/2016 11:21 PM, Maxim Patlasov wrote:
>> Commit 9f860e606 introduced an engine to delay fsync: doing
>> fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
>> io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
>> later, when incoming FLUSH|FUA comes.
>>
>> That was deemed as important because (PSBM-47026):
>>
>>> This optimization becomes more important due to the fact that customers 
>>> tend to use pcompact heavily => ploop images grow each day.
>> Now, we can easily re-use the engine to delay fsync for reloc
>> requests as well. As explained in the description of commit
>> 5aa3fe09:
>>
>>>  1->read_data_from_old_post
>>>  2->write_to_new_pos
>>>->sumbit_alloc
>>>   ->submit_pad
>>>   ->post_submit->convert_unwritten
>>>  3->update_index ->write_page with FLUSH|FUA
>>>  4->nullify_old_pos
>>> 5->issue_flush
>> by the time of step 3 extent coversion is not yet stable because
>> belongs to uncommitted transaction. But instead of doing fsync
>> inside ->post_submit, we can fsync later, as the very first step
>> of write_page for index_update.
>>
>> Changed in v2:
>>   - process delayed fsync asynchronously, via PLOOP_E_FSYNC_PENDED eng_state
>>
>> Changed in v3:
>>   - use extra arg for ploop_index_wb_proceed_or_delay() instead of ad-hoc 
>> PLOOP_REQ_FSYNC_IF_DELAYED
>>
>> https://jira.sw.ru/browse/PSBM-47026
>>
>> Signed-off-by: Maxim Patlasov 
>> ---
>>   drivers/block/ploop/dev.c   |9 +++--
>>   drivers/block/ploop/map.c   |   32 
>>   include/linux/ploop/ploop.h |1 +
>>   3 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
>> index df3eec9..ed60b1f 100644
>> --- a/drivers/block/ploop/dev.c
>> +++ b/drivers/block/ploop/dev.c
>> @@ -2720,6 +2720,11 @@ restart:
>>  ploop_index_wb_complete(preq);
>>  break;
>>   
>> +case PLOOP_E_FSYNC_PENDED:
>> +/* fsync done */
>> +ploop_index_wb_proceed(preq);
>> +break;
>> +
>>  default:
>>  BUG();
>>  }
>> @@ -4106,7 +4111,7 @@ static void ploop_relocate(struct ploop_device * plo)
>>  preq->bl.tail = preq->bl.head = NULL;
>>  preq->req_cluster = 0;
>>  preq->req_size = 0;
>> -preq->req_rw = WRITE_SYNC|REQ_FUA;
>> +preq->req_rw = WRITE_SYNC;
>>  preq->eng_state = PLOOP_E_ENTRY;
>>  preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
>>  preq->error = 0;
>> @@ -4410,7 +4415,7 @@ static void ploop_relocblks_process(struct 
>> ploop_device *plo)
>>  preq->bl.tail = preq->bl.head = NULL;
>>  preq->req_cluster = ~0U; /* uninitialized */
>>  preq->req_size = 0;
>> -preq->req_rw = WRITE_SYNC|REQ_FUA;
>> +preq->req_rw = WRITE_SYNC;
>>  preq->eng_state = PLOOP_E_ENTRY;
>>  preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
>>  preq->error = 0;
>> diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
>> index 5f7fd66..715dc15 100644
>> --- a/drivers/block/ploop/map.c
>> +++ b/drivers/block/ploop/map.c
>> @@ -915,6 +915,24 @@ void ploop_index_wb_proceed(struct ploop_request * preq)
>>  put_page(page);
>>   }
>>   
>> +static void ploop_index_wb_proceed_or_delay(struct ploop_request * preq,
>> +int do_fsync_if_delayed)
>> +{
>> +if (do_fsync_if_delayed) {
>> +struct map_node * m = preq->map;
>> +struct ploop_delta * top_delta = map_top_delta(m->parent);
>> +struct ploop_io * top_io = _delta->io;
>> +
>> +if (test_bit(PLOOP_IO_FSYNC_DELAYED, _io->io_state)) {
>> +preq->eng_state = PLOOP_E_FSYNC_PENDED;
>> +ploop_add_req_to_fsync_queue(preq);
>> +return;
>> +}
>> +}
>> +
>> +ploop_index_wb_proceed(preq);
>> +}
>> +
>>   /* Data write is commited. Now we need to update index. */
>>   
>>   void ploop_index_update(struct ploop_request * preq)
>> @@ -927,6 +945,7 @@ void ploop_index_update(struct ploop_request * preq)
>>  int old_level;
>>  struct page * page;
>>  unsigned long state = READ_ONCE(preq->state);
>> +int do_fsync_if_delayed = 0;
>>   
>>  /* No way back, we are going to initiate index write. */
>>   
>> @@ -985,10 +1004,12 @@ void ploop_index_update(struct ploop_request * preq)

Re: [Devel] Bug 124651 - ext4 bugon panic when I mmap a file

2016-07-25 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> Dima,
>
>
> Just in case, does this:
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=124651
>
>
> affect us?
No. His testcase does not work 3.10.0-327.18.2.vz7.14.21
I've tested this like this:


signature.asc
Description: PGP signature
#! /bin/bash

#Testcase for https://bugzilla.kernel.org/show_bug.cgi?id=124651

echo Install debug info
yum install -y systemtap systemtap-runtime
yum install --enablerepo=virtuozzo-updates-debuginfo \
--enablerepo=virtuozzo-os-debuginfo fedora-source -y \
vzkernel-devel-$(uname -r) vzkernel-debuginfo-$(uname -r) \
vzkernel-debuginfo-common-$(uname -m)-$(uname -r) || exit 1

# Fetch source
curl https://bugzilla.kernel.org/attachment.cgi?id=224251 > /tmp/test.c || exit 1

# Original stap file not detect sb by default. So i've modified it.
base64 -d >/tmp/fail_ext4.stp <___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] ext4: fix broken mfsync_ioctl

2016-07-21 Thread Dmitry Monakhov
Fix obvious user->kmem memcoy typo

https://jira.sw.ru/browse/PSBM-49885
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/ext4/ioctl.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 4ef2876..7260d99 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -775,6 +775,7 @@ resize_out:
struct ext4_ioc_mfsync_info mfsync;
struct file **filpp;
unsigned int *flags;
+   __u32 __user *usr_fd;
int i, err;
 
if (copy_from_user(, (struct ext4_ioc_mfsync_info *)arg,
@@ -784,6 +785,8 @@ resize_out:
}
if (mfsync.size == 0)
return 0;
+   usr_fd = (__u32 __user *) (arg + sizeof(__u32));
+
filpp = kzalloc(mfsync.size * sizeof(*filp), GFP_KERNEL);
if (!filpp)
return -ENOMEM;
@@ -797,12 +800,9 @@ resize_out:
int ret;
 
err = -EFAULT;
-   ret = get_user(fd, mfsync.fd + i);
-   if (ret) {
-   printk("%s:%d i:%d p:%p", __FUNCTION__, 
__LINE__,
-  i, mfsync.fd + i);
+   ret = get_user(fd, usr_fd + i);
+   if (ret)
goto mfsync_fput;
-   }
 
/* negative fd means fdata_sync */
flags[i] = (fd & (1<< 31)) != 0;
@@ -810,10 +810,8 @@ resize_out:
 
err = -EBADF;
filpp[i] = fget(fd);
-   if (!filpp[i]) {
-   printk("%s:%d", __FUNCTION__, __LINE__);
+   if (!filpp[i])
goto mfsync_fput;
-   }
}
err = ext4_sync_files(filpp, flags, mfsync.size);
 mfsync_fput:
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] ext4: fix broken fsync for dirs/symlink

2016-07-20 Thread Dmitry Monakhov
bad commit: 6a63db16da84fe

xfstests: generic/321 generic/335 generic/348
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c0e7acd..7e44850 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4919,8 +4919,8 @@ int ext4_force_commit(struct super_block *sb)
smp_rmb();
if (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
return -EROFS;
-   }
return 0;
+   }
 
journal = EXT4_SB(sb)->s_journal;
return ext4_journal_force_commit(journal);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests

2016-07-20 Thread Dmitry Monakhov
Maxim Patlasov <mpatla...@virtuozzo.com> writes:

> Dima,
>
>
> I have not heard from you since 07/06/2016. Do you agree with that 
> reasoning I provided in last email? What's your objection against the 
> patch now?
Max, this patch looks ugly because it mix many things in one place.
In order to do this in right way let's introduce fsync-pended eng-state
where we can queue our requests and let fsync_thread will handle this.
Thre are three places where we need such functionality.

ENTRY: for req with FUA and IO_FSYNC_PENDED
PLOOP_E_DATA_WBI: for reqs with FUA
PLOOP_E_NULLIFY: 
PLOOP_E_COMPLETE: for reqs with FUA
Let's do it one once and it will works fine for all cases.
>
>
> Thanks,
>
> Maxim
>
>
> On 07/06/2016 11:10 AM, Maxim Patlasov wrote:
>> Dima,
>>
>> On 07/06/2016 04:58 AM, Dmitry Monakhov wrote:
>>
>>> Maxim Patlasov <mpatla...@virtuozzo.com> writes:
>>>
>>>> Commit 9f860e606 introduced an engine to delay fsync: doing
>>>> fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
>>>> io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
>>>> later, when incoming FLUSH|FUA comes.
>>>>
>>>> That was deemed as important because (PSBM-47026):
>>>>
>>>>> This optimization becomes more important due to the fact that 
>>>>> customers tend to use pcompact heavily => ploop images grow each day.
>>>> Now, we can easily re-use the engine to delay fsync for reloc
>>>> requests as well. As explained in the description of commit
>>>> 5aa3fe09:
>>>>
>>>>>  1->read_data_from_old_post
>>>>>  2->write_to_new_pos
>>>>>->sumbit_alloc
>>>>>   ->submit_pad
>>>>>   ->post_submit->convert_unwritten
>>>>>  3->update_index ->write_page with FLUSH|FUA
>>>>>  4->nullify_old_pos
>>>>> 5->issue_flush
>>>> by the time of step 3 extent coversion is not yet stable because
>>>> belongs to uncommitted transaction. But instead of doing fsync
>>>> inside ->post_submit, we can fsync later, as the very first step
>>>> of write_page for index_update.
>>> NAK from me. What is advantage of this patch?
>>
>> The advantage is the following: in case of BAT multi-updates, instead 
>> of doing many fsync-s (one per dio_post_submit), we'll do only one 
>> (when final ->write_page is called).
>>
>>> Does it makes code more optimal? No
>>
>> Yes, it does. In the same sense as 9f860e606: saving some fsync-s.
>>
>>> Does it makes main ploop more asynchronous? No.
>>
>> Correct, the patch optimizes ploop in the other way. It's not about 
>> making ploop more asynchronous.
>>
>>
>>>
>>> If you want to make optimization then it is reasonable to
>>> queue preq with PLOOP_IO_FSYNC_DELAYED to top_io->fsync_queue
>>> before processing PLOOP_E_DATA_WBI  state for  preq with FUA
>>> So sequence will looks like follows:
>>> ->sumbit_alloc
>>>->submit_pad
>>>->post_submit->convert_unwritten-> tag PLOOP_IO_FSYNC_DELAYED
>>> ->ploop_req_state_process
>>>case PLOOP_E_DATA_WBI:
>>>if (preq->start & PLOOP_IO_FSYNC_DELAYED_FL) {
>>>preq->start &= ~PLOOP_IO_FSYNC_DELAYED_FL
>>>list_add_tail(>list, _io->fsync_queue)
>>>return;
>>> }
>>> ##Let fsync_thread do it's work
>>> ->ploop_req_state_process
>>> case LOOP_E_DATA_WBI:
>>> update_index->write_page with FUA (FLUSH is not required because 
>>> we  already done fsync)
>>
>> That's another type of optimization: making ploop more asynchronous. I 
>> thought about it, but didn't come to conclusion whether it's worthy 
>> w.r.t. adding more complexity to ploop-state-machine and possible bugs 
>> introduced with that.
>>
>> Thanks,
>> Maxim
>>
>>>
>>>> https://jira.sw.ru/browse/PSBM-47026
>>>>
>>>> Signed-off-by: Maxim Patlasov <mpatla...@virtuozzo.com>
>>>> ---
>>>>   drivers/block/ploop/dev.c   |4 ++--
>>>>   drivers/block/ploop/io_direct.c |   25 -
>>>>   drivers/block/ploop/io_kaio.c   |3 ++-
>>>>   drivers/block/ploop/map.c   |   17 -
>>>>   include/linux/ploop/ploop.h |3 ++-
>>>>  

[Devel] [PATCH] ext4: improve ext4lazyinit scalability

2016-07-15 Thread Dmitry Monakhov
ext4lazyinit is global thread. This thread performs itable initalization under

It basically does followes:
ext4_lazyinit_thread
  ->mutex_lock(>li_list_mtx);
  ->ext4_run_li_request(elr)
->ext4_init_inode_table-> Do a lot of IO if list is large

And when new mounts/umount arrives they have to block on ->li_list_mtx
because  lazy_thread holds it during full walk procedure.
ext4_fill_super
 ->ext4_register_li_request
   ->mutex_lock(_li_info->li_list_mtx);
   ->list_add(>lr_request, _li_info >li_request_list);
In my case mount takes 40minutes on server with 36 * 4Tb HDD.
Convinient user may face this in case of very slow dev ( /dev/mmcblkXXX)
Even more. I one of filesystem was frozen lazyinit_thread will simply blocks
on sb_start_write() so other mount/umounts will suck forever.

This patch changes logic like follows:
- grap ->s_umount read sem before process new li_request after that it is safe
  to drop list_mtx because all callers of li_remove_requers are holds ->s_umount
  for write.
- li_thread skip frozen SB's

Locking:
Locking order is asserted by umout path like follows: s_umount ->li_list_mtx
so the only way to to grab ->s_mount inside li_thread is via down_read_trylock

https://jira.sw.ru/browse/PSBM-49658

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 fs/ext4/super.c | 53 -
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3822a5a..0ee193f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2635,7 +2635,6 @@ static int ext4_run_li_request(struct ext4_li_request 
*elr)
sb = elr->lr_super;
ngroups = EXT4_SB(sb)->s_groups_count;
 
-   sb_start_write(sb);
for (group = elr->lr_next_group; group < ngroups; group++) {
gdp = ext4_get_group_desc(sb, group, NULL);
if (!gdp) {
@@ -2662,8 +2661,6 @@ static int ext4_run_li_request(struct ext4_li_request 
*elr)
elr->lr_next_sched = jiffies + elr->lr_timeout;
elr->lr_next_group = group + 1;
}
-   sb_end_write(sb);
-
return ret;
 }
 
@@ -2713,9 +2710,9 @@ static struct task_struct *ext4_lazyinit_task;
 static int ext4_lazyinit_thread(void *arg)
 {
struct ext4_lazy_init *eli = (struct ext4_lazy_init *)arg;
-   struct list_head *pos, *n;
struct ext4_li_request *elr;
unsigned long next_wakeup, cur;
+   LIST_HEAD(request_list);
 
BUG_ON(NULL == eli);
 
@@ -2728,21 +2725,43 @@ cont_thread:
mutex_unlock(>li_list_mtx);
goto exit_thread;
}
-
-   list_for_each_safe(pos, n, >li_request_list) {
-   elr = list_entry(pos, struct ext4_li_request,
-lr_request);
-
-   if (time_after_eq(jiffies, elr->lr_next_sched)) {
-   if (ext4_run_li_request(elr) != 0) {
-   /* error, remove the lazy_init job */
-   ext4_remove_li_request(elr);
-   continue;
+   list_splice_init(>li_request_list, _list);
+   while (!list_empty(_list)) {
+   int err = 0;
+   int progress = 0;
+
+   elr = list_entry(request_list.next,
+struct ext4_li_request, lr_request);
+   list_move(request_list.next, >li_request_list);
+   if (time_before(jiffies, elr->lr_next_sched)) {
+   if (time_before(elr->lr_next_sched, 
next_wakeup))
+   next_wakeup = elr->lr_next_sched;
+   continue;
+   }
+   if (down_read_trylock(>lr_super->s_umount)) {
+   if (sb_start_write_trylock(elr->lr_super)) {
+   progress = 1;
+   /* We holds sb->s_umount, sb can not
+* be removed from the list, it is
+* now safe to drop li_list_mtx
+*/
+   mutex_unlock(>li_list_mtx);
+   err = ext4_run_li_request(elr);
+   sb_end_write(elr->lr_super);
+   mutex_lock(>li_list_mtx);
}
+   up_read((>lr_super->s_umount));
+   }
+   /* error, remove the lazy_init job */
+ 

[Devel] [PATCH] e2fsprogs: fixup resize issues (PSBM #49322)

2016-07-08 Thread Dmitry Monakhov
Backport mainstream commits:
c82815e resize2fs: disable the meta_bg feature if necessary
7a4352d e2fsck: fix file systems with an overly large s_first_meta_bg

TODO: update changelog
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 ...size2fs-disable-the-meta_bg-feature-if-ne.patch | 63 +++
 ...file-systems-with-an-overly-large-s_first.patch | 70 ++
 e2fsprogs.spec |  6 +-
 3 files changed, 138 insertions(+), 1 deletion(-)
 create mode 100644 
e2fsprogs-1.42.9-backport-resize2fs-disable-the-meta_bg-feature-if-ne.patch
 create mode 100644 
e2fsprogs-1.42.9-e2fsck-fix-file-systems-with-an-overly-large-s_first.patch

diff --git 
a/e2fsprogs-1.42.9-backport-resize2fs-disable-the-meta_bg-feature-if-ne.patch 
b/e2fsprogs-1.42.9-backport-resize2fs-disable-the-meta_bg-feature-if-ne.patch
new file mode 100644
index 000..e1ef136
--- /dev/null
+++ 
b/e2fsprogs-1.42.9-backport-resize2fs-disable-the-meta_bg-feature-if-ne.patch
@@ -0,0 +1,63 @@
+From 21045fee7b031db004aba818cc803e92937dbac0 Mon Sep 17 00:00:00 2001
+From: Theodore Ts'o <ty...@mit.edu>
+Date: Sat, 9 Aug 2014 12:33:11 -0400
+Subject: [PATCH 2/2] backport resize2fs: disable the meta_bg feature if
+ necessary From c82815e5097f130c8b926b3303a1e063a19dcdd0 Mon Sep 17 00:00:00
+ 2001 [PATCH] resize2fs: disable the meta_bg feature if necessary
+
+When shrinking a file system, if the number block groups drops below
+the point where we started using the meta_bg layout, disable the
+meta_bg feature and set s_first_meta_bg to zero.  This is necessary to
+avoid creating an invalid/corrupted file system after the shrink.
+
+Addresses-Debian-Bug: #756922
+
+Signed-off-by: Theodore Ts'o <ty...@mit.edu>
+Reported-by: Marcin Wolcendorf <antymat+deb...@chelmska.waw.pl>
+Tested-by: Marcin Wolcendorf <antymat+deb...@chelmska.waw.pl>
+Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
+---
+ resize/resize2fs.c | 17 +
+ 1 file changed, 13 insertions(+), 4 deletions(-)
+
+diff --git a/resize/resize2fs.c b/resize/resize2fs.c
+index a8bbd7c..2dc16b8 100644
+--- a/resize/resize2fs.c
 b/resize/resize2fs.c
+@@ -462,6 +462,13 @@ retry:
+   fs->super->s_reserved_gdt_blocks = new;
+   }
+ 
++  if ((fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG) &&
++  (fs->super->s_first_meta_bg > fs->desc_blocks)) {
++  fs->super->s_feature_incompat &=
++  ~EXT2_FEATURE_INCOMPAT_META_BG;
++  fs->super->s_first_meta_bg = 0;
++  }
++
+   /*
+* If we are shrinking the number of block groups, we're done
+* and can exit now.
+@@ -947,13 +954,15 @@ static errcode_t blocks_to_move(ext2_resize_t rfs)
+   ext2fs_mark_block_bitmap2(rfs->reserve_blocks, blk);
+   }
+ 
+-  if (fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG) {
++  if (old_fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG)
+   old_blocks = old_fs->super->s_first_meta_bg;
+-  new_blocks = fs->super->s_first_meta_bg;
+-  } else {
++  else
+   old_blocks = old_fs->desc_blocks + 
old_fs->super->s_reserved_gdt_blocks;
++
++  if (fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG)
++  new_blocks = fs->super->s_first_meta_bg;
++  else
+   new_blocks = fs->desc_blocks + fs->super->s_reserved_gdt_blocks;
+-  }
+ 
+   if (old_blocks == new_blocks) {
+   retval = 0;
+-- 
+1.8.3.1
+
diff --git 
a/e2fsprogs-1.42.9-e2fsck-fix-file-systems-with-an-overly-large-s_first.patch 
b/e2fsprogs-1.42.9-e2fsck-fix-file-systems-with-an-overly-large-s_first.patch
new file mode 100644
index 000..cdf2524
--- /dev/null
+++ 
b/e2fsprogs-1.42.9-e2fsck-fix-file-systems-with-an-overly-large-s_first.patch
@@ -0,0 +1,70 @@
+From 26a16ea9c97460711f1cbaf9e0a7333b8b27884d Mon Sep 17 00:00:00 2001
+From: Theodore Ts'o <ty...@mit.edu>
+Date: Thu, 7 Jul 2016 19:17:49 +0300
+Subject: [PATCH 1/2] e2fsck: fix file systems with an overly large
+ s_first_meta_bg
+
+Signed-off-by: Theodore Ts'o <ty...@mit.edu>
+Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
+---
+ e2fsck/problem.c |  5 +
+ e2fsck/problem.h |  3 +++
+ e2fsck/super.c   | 12 
+ 3 files changed, 20 insertions(+)
+
+diff --git a/e2fsck/problem.c b/e2fsck/problem.c
+index 83584a0..431d7e7 100644
+--- a/e2fsck/problem.c
 b/e2fsck/problem.c
+@@ -438,6 +438,11 @@ static struct e2fsck_problem problem_table[] = {
+ N_("@S 64bit filesystems needs extents to access the whole disk.  "),
+ PROMPT_FIX, PR_PREEN_OK | PR_NO_OK},
+ 
++  /* The first_meta_bg is too big */
++  { PR_0_FIRST_META_BG_TOO_BIG,
++N_("First_meta_bg 

Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests

2016-07-06 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> Commit 9f860e606 introduced an engine to delay fsync: doing
> fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
> io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
> later, when incoming FLUSH|FUA comes.
>
> That was deemed as important because (PSBM-47026):
>
>> This optimization becomes more important due to the fact that customers tend 
>> to use pcompact heavily => ploop images grow each day.
>
> Now, we can easily re-use the engine to delay fsync for reloc
> requests as well. As explained in the description of commit
> 5aa3fe09:
>
>> 1->read_data_from_old_post
>> 2->write_to_new_pos
>>   ->sumbit_alloc
>>  ->submit_pad
>>  ->post_submit->convert_unwritten
>> 3->update_index ->write_page with FLUSH|FUA
>> 4->nullify_old_pos
>>5->issue_flush
>
> by the time of step 3 extent coversion is not yet stable because
> belongs to uncommitted transaction. But instead of doing fsync
> inside ->post_submit, we can fsync later, as the very first step
> of write_page for index_update.
NAK from me. What is advantage of this patch?
Does it makes code more optimal? No
Does it makes main ploop more asynchronous? No.

If you want to make optimization then it is reasonable to
queue preq with PLOOP_IO_FSYNC_DELAYED to top_io->fsync_queue
before processing PLOOP_E_DATA_WBI  state for  preq with FUA
So sequence will looks like follows:
->sumbit_alloc
  ->submit_pad
  ->post_submit->convert_unwritten-> tag PLOOP_IO_FSYNC_DELAYED
->ploop_req_state_process
  case PLOOP_E_DATA_WBI:
  if (preq->start & PLOOP_IO_FSYNC_DELAYED_FL) {
  preq->start &= ~PLOOP_IO_FSYNC_DELAYED_FL
  list_add_tail(>list, _io->fsync_queue)
  return;
   }
##Let fsync_thread do it's work
->ploop_req_state_process
   case LOOP_E_DATA_WBI:
   update_index->write_page with FUA (FLUSH is not required because we  already 
done fsync)

>
> https://jira.sw.ru/browse/PSBM-47026
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |4 ++--
>  drivers/block/ploop/io_direct.c |   25 -
>  drivers/block/ploop/io_kaio.c   |3 ++-
>  drivers/block/ploop/map.c   |   17 -
>  include/linux/ploop/ploop.h |3 ++-
>  5 files changed, 42 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index e5f010b..40768b6 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo)
>   preq->bl.tail = preq->bl.head = NULL;
>   preq->req_cluster = 0;
>   preq->req_size = 0;
> - preq->req_rw = WRITE_SYNC|REQ_FUA;
> + preq->req_rw = WRITE_SYNC;
>   preq->eng_state = PLOOP_E_ENTRY;
>   preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
>   preq->error = 0;
> @@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device 
> *plo)
>   preq->bl.tail = preq->bl.head = NULL;
>   preq->req_cluster = ~0U; /* uninitialized */
>   preq->req_size = 0;
> - preq->req_rw = WRITE_SYNC|REQ_FUA;
> + preq->req_rw = WRITE_SYNC;
>   preq->eng_state = PLOOP_E_ENTRY;
>   preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
>   preq->error = 0;
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 1086850..0a5fb15 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -1494,13 +1494,36 @@ dio_read_page(struct ploop_io * io, struct 
> ploop_request * preq,
>  
>  static void
>  dio_write_page(struct ploop_io * io, struct ploop_request * preq,
> -struct page * page, sector_t sec, unsigned long rw)
> +struct page * page, sector_t sec, unsigned long rw,
> +int do_fsync_if_delayed)
>  {
>   if (!(io->files.file->f_mode & FMODE_WRITE)) {
>   PLOOP_FAIL_REQUEST(preq, -EBADF);
>   return;
>   }
>  
> + if (do_fsync_if_delayed &&
> + test_bit(PLOOP_IO_FSYNC_DELAYED, >io_state)) {
> + struct ploop_device * plo = io->plo;
> + u64 io_count;
> + int err;
> +
> + spin_lock_irq(>lock);
> + io_count = io->io_count;
> + spin_unlock_irq(>lock);
> +
> + err = io->ops->sync(io);
> + if (err) {
> + PLOOP_FAIL_REQUEST(preq, -EBADF);
> + return;
> + }
> +
> + spin_lock_irq(>lock);
> + if (io_count == io->io_count && !(io_count & 1))
> + clear_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
> + spin_unlock_irq(>lock);
> + }
> +
>   dio_io_page(io, rw | WRITE | REQ_SYNC, preq, page, sec);
>  }
>  
> diff --git a/drivers/block/ploop/io_kaio.c 

[Devel] [RH7 PATCH] ploop: reloc vs extent_conversion race fix

2016-06-30 Thread Dmitry Monakhov
We have fixed most relocation bugs during fixing 
https://jira.sw.ru/browse/PSBM-47107

Currently reloc_a looks like follows:

 1->read_data_from_old_post
 2->write_to_new_pos
->sumbit_alloc
  ->submit_pad
  ->post_submit->convert_unwritten
 3->update_index ->write_page with FLUSH|FUA
 4->nullify_old_pos
 5->issue_flush

But on step 3 extent coversion is not yet stable because belongs to uncommitted
transaction. We MUST call ->fsync inside ->post_sumit as we do for REQ_FUA
requests. Let's tag relocatoin requests as FUA from very beginning in order to
assert sync semantics.

https://jira.sw.ru/browse/PSBM-49143
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/dev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 40768b6..e5f010b 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = 0;
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC;
+   preq->req_rw = WRITE_SYNC|REQ_FUA;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
preq->error = 0;
@@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = ~0U; /* uninitialized */
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC;
+   preq->req_rw = WRITE_SYNC|REQ_FUA;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
preq->error = 0;
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 9/9] ploop: fixup barrier handling during relocation

2016-06-24 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> Rebase Dima's patch on top of rh7-3.10.0-327.18.2.vz7.14.19,
> but without help of delayed_flush engine:
>
> To ensure consistency on crash/power outage/hard reboot
> events, ploop must implement the following barrier logic
> for RELOC_A|S requests:
>
> 1) After we store data to new place, but before updating
> BAT on disk, we have FLUSH everything (in fact, flushing
> those data would be enough, but it is simplier to flush
> everything).
>
> 2) We should not proceed handling RELOC_A|S until we
> 100% sure new BAT value went to disk platters. So far as
> new BAT is only one page, it's OK to mark corresponding
> bio with FUA flag for io_direct case. For io_kaio, not
> having FUA api, we have to post_fsync BAT update.
>
> PLOOP_REQ_FORCE_FLUSH/PLOOP_REQ_FORCE_FUA introduced
> long time ago probably were intended to ensure the
> logic above, but they actually didn't.
>
> The patch removes PLOOP_REQ_FORCE_FLUSH/PLOOP_REQ_FORCE_FUA,
> and implements barriers in a straightforward and simple way:
> check for RELOC_A|S explicitly and make FLUSH/FUA where
> needed.
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |4 ++--
>  drivers/block/ploop/io_direct.c |7 ---
>  drivers/block/ploop/io_kaio.c   |8 +---
>  drivers/block/ploop/map.c   |   22 ++
>  include/linux/ploop/ploop.h |1 -
>  5 files changed, 17 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 2b60dfa..40768b6 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -2610,8 +2610,8 @@ restart:
>   top_delta = ploop_top_delta(plo);
>   sbl.head = sbl.tail = preq->aux_bio;
>  
> - /* Relocated data write required sync before BAT updatee */
> - set_bit(PLOOP_REQ_FORCE_FUA, >state);
> + /* Relocated data write required sync before BAT update
> +  * this will happen inside index_update */
>  
>   if (test_bit(PLOOP_REQ_RELOC_S, >state)) {
>   preq->eng_state = PLOOP_E_DATA_WBI;
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index c4d0f63..266f041 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -89,15 +89,11 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
> preq,
>   sector_t sec, nsec;
>   int err;
>   struct bio_list_walk bw;
> - int postfua = 0;
>   int write = !!(rw & REQ_WRITE);
>   int delayed_fua = 0;
>  
>   trace_submit(preq);
>  
> - if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state))
> - postfua = 1;
> -
>   if ((rw & REQ_FUA) && ploop_req_delay_fua_possible(preq)) {
>   /* Mark req that delayed flush required */
>   preq->req_rw |= (REQ_FLUSH | REQ_FUA);
> @@ -233,9 +229,6 @@ flush_bio:
>   b->bi_private = preq;
>   b->bi_end_io = dio_endio_async;
>  
> - if (unlikely(postfua && !bl.head))
> - rw2 |= REQ_FUA;
> -
>   ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
>   submit_bio(rw2, b);
>   }
> diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
> index ed550f4..85863df 100644
> --- a/drivers/block/ploop/io_kaio.c
> +++ b/drivers/block/ploop/io_kaio.c
> @@ -69,6 +69,8 @@ static void kaio_complete_io_state(struct ploop_request * 
> preq)
>   unsigned long flags;
>   int post_fsync = 0;
>   int need_fua = !!(preq->req_rw & REQ_FUA);
> + unsigned long state = READ_ONCE(preq->state);
> + int reloc = !!(state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_S_FL));
>  
>   if (preq->error || !(preq->req_rw & REQ_FUA) ||
>   preq->eng_state == PLOOP_E_INDEX_READ ||
> @@ -80,9 +82,9 @@ static void kaio_complete_io_state(struct ploop_request * 
> preq)
>   }
>  
>   /* Convert requested fua to fsync */
> - if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state) ||
> - test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, >state) ||
> - (need_fua && !ploop_req_delay_fua_possible(preq))) {
This is the change I dislike the most. io_XXX should not care it is
reloc or not. Caller should rule whenether PREFLUSH/POSTFLUSH should
happen before preq completes. So IMHO this is a crunch, but correct one.

> + if (test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, >state) ||
> + (need_fua && !ploop_req_delay_fua_possible(preq)) ||
> + (reloc && ploop_req_delay_fua_possible(preq))) {
>   post_fsync = 1;
>   preq->req_rw &= ~REQ_FUA;
>   }
> diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
> index 915a216..1883674 100644
> --- a/drivers/block/ploop/map.c
> +++ b/drivers/block/ploop/map.c
> @@ -909,6 +909,7 @@ void ploop_index_update(struct ploop_request * preq)
>   struct page * 

Re: [Devel] [PATCH rh7 6/9] ploop: remove preflush from dio_submit

2016-06-24 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> After commit c2247f3745 fixing barriers for ordinary
> requests and previous patch fixing delay_fua,
> that legacy code in dio_submit processing
> (preq->req_rw & REQ_FLUSH) by setting REQ_FLUSH in
> the first outgoing bio must die: it is incorrect
> anyway (we don't wait for completion of the first
> bio before sending others).
Wow. This is so true. BTW: Reasonable way to handle FLUSH
is to queue such preq to preflush_queue similar to fsync_queue for
fsync_thread infrastructure

>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/io_direct.c |7 ---
>  1 file changed, 7 deletions(-)
>
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 1ea2008..ee3cd5c 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -89,15 +89,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
> preq,
>   sector_t sec, nsec;
>   int err;
>   struct bio_list_walk bw;
> - int preflush;
>   int postfua = 0;
>   int write = !!(rw & REQ_WRITE);
>   int delayed_fua = 0;
>  
>   trace_submit(preq);
>  
> - preflush = !!(rw & REQ_FLUSH);
> -
>   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state))
>   postfua = 1;
>  
> @@ -236,10 +233,6 @@ flush_bio:
>   b->bi_private = preq;
>   b->bi_end_io = dio_endio_async;
>  
> - if (unlikely(preflush)) {
> - rw2 |= REQ_FLUSH;
> - preflush = 0;
> - }
>   if (unlikely(postfua && !bl.head))
>   rw2 |= REQ_FUA;
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 0/9] ploop: fix barriers for reloc requests

2016-06-24 Thread Dmitry Monakhov
Maxim Patlasov <mpatla...@virtuozzo.com> writes:

> The series firstly fixes a few issues in handling
> barriers in ordinary requests (what was overlooked
> in previous patch -- see commit c2247f3745).
>
> Then there are a few minor rework w/o functional
> changes that alleviate main patches (last two ones).
>
> And finally the series fixes handling barriers
> for RELOC_A|S requests.
>
> The main complexity comes from the following bug:
> for direct_io it's not enough to send FUA to flush
> all nullified cluster block. See details in
> "fix barriers for PLOOP_E_RELOC_NULLIFY" patch.
>
Ok. Max I can not fully agree the way you orginize fix for RELOC bug
(especially for kaio). But it does all major things
1) Removes _FORCE_XXX crap
2) Cleanup barrier stuff
3) Fix RELOC_XXX code flow.

Let's keep style things aside for now, and commit that fix.
So ACK whole series. And let optimize/fix sylistic stuff leter.
> ---
>
> Dmitry Monakhov (3):
>   ploop: deadcode cleanup
>   ploop: minor rework of ->write_page() io method
>   ploop: generalize issue_flush
>
> Maxim Patlasov (6):
>   ploop: minor rework of ploop_req_delay_fua_possible
>   ploop: resurrect delayed_fua for io_kaio
>   ploop: resurrect delay_fua for io_direct
>   ploop: remove preflush from dio_submit
>   ploop: fix barriers for PLOOP_E_RELOC_NULLIFY
>   ploop: fixup barrier handling during relocation
>
>
>  drivers/block/ploop/dev.c   |   16 ++--
>  drivers/block/ploop/io_direct.c |   48 -
>  drivers/block/ploop/io_kaio.c   |   26 ++--
>  drivers/block/ploop/map.c   |   50 
> ---
>  include/linux/ploop/ploop.h |   20 +++-
>  5 files changed, 71 insertions(+), 89 deletions(-)
>
> --
> Signature


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [RH7 PATCH 4/6] ploop: io_kaio support PLOOP_REQ_DEL_FLUSH

2016-06-23 Thread Dmitry Monakhov
Currently noone tag preqs with such bit but let it be here for simmetry

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/io_kaio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index bee2cee..5341fd5 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -73,6 +73,7 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
 
/* Convert requested fua to fsync */
if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state) ||
+   test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, >state) ||
test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, >state))
post_fsync = 1;
 
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [RH7 PATCH 3/6] ploop: add delayed flush support

2016-06-23 Thread Dmitry Monakhov
dio_submit and dio_submit_pad may produce several bios. This makes
processing of REQ_FUA complicated because in order to preserve correctness
we have to TAG each bio with FUA flag which is suboptimal.
Obviously there is a room for optimization here: once all bios was acknowledged
by lower layer we may issue empty barrier aka ->issue_flush().
post_submit call back is the place where we all bios completed already.

b1:FUA, b2:FUA, b3:FUA =>  b1,b2,b3,wait_for_bios,bX:FLUSH

This allow us to remove all this REQ_FORCE_{FLUSH,FUA} crap and

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/io_direct.c | 48 +
 include/linux/ploop/ploop.h |  2 ++
 2 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 195d318..752a9c3e 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -82,31 +82,13 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
sector_t sec, nsec;
int err;
struct bio_list_walk bw;
-   int preflush;
-   int postfua = 0;
+   int preflush = !!(rw & REQ_FLUSH);
+   int postflush = !!(rw & REQ_FUA);
int write = !!(rw & REQ_WRITE);
 
trace_submit(preq);
 
-   preflush = !!(rw & REQ_FLUSH);
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, >state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
-   /* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, >state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
-   }
-
rw &= ~(REQ_FLUSH | REQ_FUA);
-
-
bio_list_init();
 
if (iblk == PLOOP_ZERO_INDEX)
@@ -237,13 +219,14 @@ flush_bio:
rw2 |= REQ_FLUSH;
preflush = 0;
}
-   if (unlikely(postfua && !bl.head))
-   rw2 |= REQ_FUA;
-
ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
submit_bio(rw2, b);
}
-
+   /* TODO: minor optimization is possible for single bio case */
+   if (postflush) {
+   set_bit(PLOOP_REQ_DEL_FLUSH, >state);
+   ploop_add_post_submit(io, preq);
+   }
ploop_complete_io_request(preq);
return;
 
@@ -523,9 +506,10 @@ dio_convert_extent(struct ploop_io *io, struct 
ploop_request * preq)
  (loff_t)sec << 9, clu_siz);
 
/* highly unlikely case: FUA coming to a block not provisioned yet */
-   if (!err && force_sync)
+   if (!err && force_sync) {
+   clear_bit(PLOOP_REQ_DEL_FLUSH, >state);
err = io->ops->sync(io);
-
+   }
if (!force_sync) {
spin_lock_irq(>lock);
io->io_count++;
@@ -546,7 +530,12 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
if (test_and_clear_bit(PLOOP_REQ_DEL_CONV, >state))
dio_convert_extent(io, preq);
 
+   if (test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, >state)) {
+   io->ops->issue_flush(io, preq);
+   return 1;
+   }
return 0;
+
 }
 
 /* Submit the whole cluster. If preq contains only partial data
@@ -562,7 +551,6 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request * 
preq,
sector_t sec, end_sec, nsec, start, end;
struct bio_list_walk bw;
int err;
-
bio_list_init();
 
/* sec..end_sec is the range which we are going to write */
@@ -694,7 +682,11 @@ flush_bio:
ploop_acc_ff_out(preq->plo, rw | b->bi_rw);
submit_bio(rw, b);
}
-
+   /* TODO: minor optimization is possible for single bio case */
+   if (preq->req_rw &  REQ_FUA) {
+   set_bit(PLOOP_REQ_DEL_FLUSH, >state);
+   ploop_add_post_submit(io, preq);
+   }
ploop_complete_io_request(preq);
return;
 
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 4c52a40..5076f16 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -472,6 +472,7 @@ enum
PLOOP_REQ_PUSH_BACKUP, /* preq was ACKed by userspace push_backup */
PLOOP_REQ_ALLOW_READS, /* READs are allowed for given req_cluster */
PLOOP_REQ_DEL_CONV,/* post_submit: conversion required */
+   PLOOP_REQ_DEL_FLUSH,   /* post_submit: REQ_FLUSH required */
PLOOP_REQ_FSYNC_DONE,  /* fsync_thread() performed f_op->fsync() */
 };
 
@@ -482,6 +483,7 @@ enum
 #define PLOOP_REQ_ZERO_FL (1 << PLOOP_REQ_ZERO)
 #define PLOOP_REQ_POST_SUBMIT_FL

[Devel] [RH7 PATCH 5/6] ploop: fixup barrier handling during relocation

2016-06-23 Thread Dmitry Monakhov
barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
 ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
  ->delta->allocate
 ->io->submit_allloc: dio_submit_alloc
   ->dio_submit_pad
E_DATA_WBI : data written, time to update index
  ->delta->allocate_complete:ploop_index_update
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
->write_page
->ploop_map_wb_complete
  ->ploop_wb_complete_post_process
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
E_RELOC_NULLIFY:

   ->submit()

Once we have delayed_flush engine it is easy to implement correct scheme for
both engines.

E_RELOC_DATA_READ ->submit_allloc => wait->post_submit->issue_flush
E_DATA_WBI ->ploop_index_update with FUA
E_RELOC_NULLIFY ->submit: => wait->post_submit->issue_flush

This makes reloc sequence optimal:
RELOC_S: R1, W2,WAIT,FLUSH, WBI:FUA
RELOC_A: R1, W2,WAIT,FLUSH, WBI:FUA, W1:NULLIFY,WAIT, FLUSH

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/dev.c |  2 +-
 drivers/block/ploop/io_kaio.c |  3 +--
 drivers/block/ploop/map.c | 28 ++--
 3 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 95e3067..090cd2d 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2533,7 +2533,7 @@ restart:
sbl.head = sbl.tail = preq->aux_bio;
 
/* Relocated data write required sync before BAT updatee */
-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
+   preq->req_rw |= REQ_FUA;
 
if (test_bit(PLOOP_REQ_RELOC_S, >state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index 5341fd5..5217ab4 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -72,8 +72,7 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
}
 
/* Convert requested fua to fsync */
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state) ||
-   test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, >state) ||
+   if (test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, >state) ||
test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, >state))
post_fsync = 1;
 
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 3a6365d..ef351fb 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -901,6 +901,8 @@ void ploop_index_update(struct ploop_request * preq)
int old_level;
struct page * page;
sector_t sec;
+   int fua = !!(preq->req_rw & REQ_FUA);
+   unsigned long state = READ_ONCE(preq->state);
 
/* No way back, we are going to initiate index write. */
 
@@ -954,12 +956,11 @@ void ploop_index_update(struct ploop_request * preq)
plo->st.map_single_writes++;
top_delta->ops->map_index(top_delta, m->mn_start, );
/* Relocate requires consistent writes, mark such reqs appropriately */
-   if (test_bit(PLOOP_REQ_RELOC_A, >state) ||
-   test_bit(PLOOP_REQ_RELOC_S, >state))
-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
-
-   top_delta->io.ops->write_page(_delta->io, preq, page, sec,
- !!(preq->req_rw & REQ_FUA));
+   if (state & (PLOOP_REQ_RELOC_A_FL | PLOOP_REQ_RELOC_S_FL)) {
+   WARN_ON(state & PLOOP_REQ_DEL_FLUSH_FL);
+   fua = 1;
+   }
+   top_delta->io.ops->write_page(_delta->io, preq, page, sec, fua);
put_page(page);
return;
 
@@ -1063,7 +1064,7 @@ static void map_wb_complete_post_process(struct ploop_map 
*map,
 * (see dio_submit()). So fsync of EXT4 image doesnt help us.
 * We need to force sync of nullified blocks.
 */
-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
+   preq->req_rw |= REQ_FUA;
top_delta->io.ops->submit(_delta->io, preq, preq->req_rw,
  , preq->iblock, 1<cluster_log);
 }
@@ -1153,8 +1154,10 @@ static void map_wb_complete(struct map_node * m, int err)
 
list_for_each_safe(cursor, tmp, >io_queue) {
struct ploop_request * preq;
+   unsigned long state;
 
preq = list_entry(cursor, struct ploop_request, list);
+   state = READ_ONCE(preq->state);
 
  

[Devel] [RH7 PATCH 2/6] ploop: generalize issue_flush

2016-06-23 Thread Dmitry Monakhov
Currently io->ops->issue_flush is called only from single place,
but it has potential to generic. Patch does not change actual logic,
but allow to call ->issue_flush from various places

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/dev.c   | 1 +
 drivers/block/ploop/io_direct.c | 1 -
 drivers/block/ploop/io_kaio.c   | 1 -
 3 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e8b0304..95e3067 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1989,6 +1989,7 @@ ploop_entry_request(struct ploop_request * preq)
if (preq->req_size == 0) {
if (preq->req_rw & REQ_FLUSH &&
!test_bit(PLOOP_REQ_FSYNC_DONE, >state)) {
+   preq->eng_state = PLOOP_E_COMPLETE;
if (top_io->ops->issue_flush) {
top_io->ops->issue_flush(top_io, preq);
return;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index ec905b4..195d318 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1836,7 +1836,6 @@ static void dio_issue_flush(struct ploop_io * io, struct 
ploop_request *preq)
bio->bi_private = preq;
 
atomic_inc(>io_count);
-   preq->eng_state = PLOOP_E_COMPLETE;
ploop_acc_ff_out(io->plo, preq->req_rw | bio->bi_rw);
submit_bio(preq->req_rw, bio);
ploop_complete_io_request(preq);
diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index de26319..bee2cee 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -951,7 +951,6 @@ static void kaio_issue_flush(struct ploop_io * io, struct 
ploop_request *preq)
 {
struct ploop_delta *delta = container_of(io, struct ploop_delta, io);
 
-   preq->eng_state = PLOOP_E_COMPLETE;
preq->req_rw &= ~REQ_FLUSH;
 
spin_lock_irq(>plo->lock);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [RH7 PATCH 1/6] ploop: generalize post_submit stage

2016-06-23 Thread Dmitry Monakhov
Currently post_submit() used only for convert_unwritten_extents.
But post_submit() is good transition point where all submitted
data was completed by lower layer, and new state about to be processed.
Iyt is ideal point where we can perform transition actions
For example:
 io_direct: Convert unwritten extents
 io_direct: issue empty barrier bio in order to simulate postflush
 io_direct,io_kaio: queue to fsync queue
 Etc.

This patch does not change anything, but prepare post_submit for
more logic which will be added later.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/dev.c   | 10 ++
 drivers/block/ploop/io_direct.c | 15 ---
 include/linux/ploop/ploop.h | 12 +++-
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e405232..e8b0304 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2351,10 +2351,12 @@ static void ploop_req_state_process(struct 
ploop_request * preq)
preq->prealloc_size = 0; /* only for sanity */
}
 
-   if (test_bit(PLOOP_REQ_POST_SUBMIT, >state)) {
-   preq->eng_io->ops->post_submit(preq->eng_io, preq);
-   clear_bit(PLOOP_REQ_POST_SUBMIT, >state);
+   if (test_and_clear_bit(PLOOP_REQ_POST_SUBMIT, >state)) {
+   struct ploop_io *io = preq->eng_io;
+
preq->eng_io = NULL;
+   if (preq->eng_io->ops->post_submit(io, preq))
+   goto out;
}
 
 restart:
@@ -2633,7 +2635,7 @@ restart:
default:
BUG();
}
-
+out:
if (release_ioc) {
struct io_context * ioc = current->io_context;
current->io_context = saved_ioc;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index f1812fe..ec905b4 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -416,8 +416,8 @@ try_again:
}
 
preq->iblock = iblk;
-   preq->eng_io = io;
-   set_bit(PLOOP_REQ_POST_SUBMIT, >state);
+   set_bit(PLOOP_REQ_DEL_CONV, >state);
+   ploop_add_post_submit(io, preq);
dio_submit_pad(io, preq, sbl, size, em);
err = 0;
goto end_write;
@@ -501,7 +501,7 @@ end_write:
 }
 
 static void
-dio_post_submit(struct ploop_io *io, struct ploop_request * preq)
+dio_convert_extent(struct ploop_io *io, struct ploop_request * preq)
 {
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
@@ -540,6 +540,15 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
}
 }
 
+static int
+dio_post_submit(struct ploop_io *io, struct ploop_request * preq)
+{
+   if (test_and_clear_bit(PLOOP_REQ_DEL_CONV, >state))
+   dio_convert_extent(io, preq);
+
+   return 0;
+}
+
 /* Submit the whole cluster. If preq contains only partial data
  * within the cluster, pad the rest of cluster with zeros.
  */
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 0fba25e..4c52a40 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -148,7 +148,7 @@ struct ploop_io_ops
  struct bio_list *sbl, iblock_t iblk, unsigned int 
size);
void(*submit_alloc)(struct ploop_io *, struct ploop_request *,
struct bio_list *sbl, unsigned int size);
-   void(*post_submit)(struct ploop_io *, struct ploop_request *);
+   int (*post_submit)(struct ploop_io *, struct ploop_request *);
 
int (*disable_merge)(struct ploop_io * io, sector_t isector, 
unsigned int len);
int (*fastmap)(struct ploop_io * io, struct bio *orig_bio,
@@ -471,6 +471,7 @@ enum
PLOOP_REQ_POST_SUBMIT, /* preq needs post_submit processing */
PLOOP_REQ_PUSH_BACKUP, /* preq was ACKed by userspace push_backup */
PLOOP_REQ_ALLOW_READS, /* READs are allowed for given req_cluster */
+   PLOOP_REQ_DEL_CONV,/* post_submit: conversion required */
PLOOP_REQ_FSYNC_DONE,  /* fsync_thread() performed f_op->fsync() */
 };
 
@@ -479,6 +480,8 @@ enum
 #define PLOOP_REQ_RELOC_S_FL (1 << PLOOP_REQ_RELOC_S)
 #define PLOOP_REQ_DISCARD_FL (1 << PLOOP_REQ_DISCARD)
 #define PLOOP_REQ_ZERO_FL (1 << PLOOP_REQ_ZERO)
+#define PLOOP_REQ_POST_SUBMIT_FL (1 << PLOOP_REQ_POST_SUBMIT)
+#define PLOOP_REQ_DEL_CONV_FL (1 << PLOOP_REQ_DEL_CONV)
 
 enum
 {
@@ -767,6 +770,13 @@ static inline void ploop_entry_qlen_dec(struct 
ploop_request * preq)
preq->plo->read_sync_reqs--;
}
 }
+static inline
+void ploop_add_post_submit(struct ploop_io *io, struct ploop_request * preq)
+{
+   BUG_ON(preq->

[Devel] [RH7 PATCH 6/6] patch ploop_state_debugging.patch

2016-06-23 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/dev.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 090cd2d..9bf8592 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1232,6 +1232,12 @@ static void ploop_complete_request(struct ploop_request 
* preq)
}
preq->bl.tail = NULL;
 
+   if (!preq->error) {
+   unsigned long state = READ_ONCE(preq->state);
+   WARN_ON(state & (PLOOP_REQ_POST_SUBMIT_FL|
+PLOOP_REQ_DEL_CONV_FL |
+PLOOP_REQ_DEL_FLUSH_FL ));
+   }
if (test_bit(PLOOP_REQ_RELOC_A, >state) ||
test_bit(PLOOP_REQ_RELOC_S, >state)) {
if (preq->error)
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [RH7 PATCH 0/6] RFC ploop: Barrier fix patch set v3

2016-06-23 Thread Dmitry Monakhov

Here is 3'rd version of barrier fix patches based on recent fixes.
This is an RFC version. I do not have time to test it before tomorrow,
Max please review is briefly and tell be your oppinion about general idea.
Basic idea is to use post_submit state to issue empty FLUSH barrier in order
to complete FUA requests. This allow us to unify all engines (direct and kaio).

This makes FUA processing optimal:
SUBMIT:FUA   :W1{b1,b2,b3,b4..},WAIT,post_submit:FLUSH
SUBMIT_ALLOC:FUA :W1{b1,b2,b3,b4..},WAIT,post_submit:FLUSH, WBI:FUA
RELOC_S: R1, W2,WAIT,post_submit:FLUSH, WBI:FUA
RELOC_A: R1, W2,WAIT,post_submit:FLUSH, WBI:FUA, 
W1:NULLIFY,WAIT,post_submit:FLUSH


#POST_SUBMIT CHANGES:
ploop-generalize-post_submit-stage.patch
ploop-generalize-issue_flush.patch
ploop-add-delayed-flush-support.patch
ploop-io_kaio-support-PLOOP_REQ_DEL_FLUSH.patch
#RELOC_XXX FIXES
ploop-fixup-barrier-handling-during-relocation.patch
patch-ploop_state_debugging.patch.patch


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ploop: fix barriers for ordinary requests

2016-06-22 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> The way how io_direct.c handles FLUSH|FUA: b1:FLUSH,b2,b3,b4,b5:FLUSH|FUA
> is completely wrong: to make sure that b1:FLUSH made effect we have to
> wait for its completion. Similarly, even if we're sure that FUA will be
> processed as post-FLUSH (also dubious!), we have to wait for completion
> b1..b4 to make sure that that flush will cover them.
>
> The patch fixes all these issues pretty simple: let's mark outgouing
> bio-s with FLUSH|FUA based on those flags in *corresponing* incoming
> bio-s.
One more thing please see below.
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |1 -
>  drivers/block/ploop/io_direct.c |   47 
> ---
>  2 files changed, 15 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 2ef1449..6b5702f 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -498,7 +498,6 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
> bio,
>   preq->req_sector = bio->bi_sector;
>   preq->req_size = bio->bi_size >> 9;
>   preq->req_rw = bio->bi_rw;
> - bio->bi_rw &= ~(REQ_FLUSH | REQ_FUA);
>   preq->eng_state = PLOOP_E_ENTRY;
>   preq->state = 0;
>   preq->error = 0;
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 6ef9cd8..84c9a48 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -92,7 +92,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
>   int preflush;
>   int postfua = 0;
>   int write = !!(rw & REQ_WRITE);
> - int bio_num;
>  
>   trace_submit(preq);
>  
> @@ -233,13 +232,13 @@ flush_bio:
>   goto flush_bio;
>   }
>  
> + bio->bi_rw |= bw.cur->bi_rw & (REQ_FLUSH | REQ_FUA);
>   bw.bv_off += copy;
>   size -= copy >> 9;
>   sec += copy >> 9;
>   }
>   ploop_extent_put(em);
>  
> - bio_num = 0;
>   while (bl.head) {
>   struct bio * b = bl.head;
>   unsigned long rw2 = rw;
> @@ -255,11 +254,10 @@ flush_bio:
>   preflush = 0;
>   }
>   if (unlikely(postfua && !bl.head))
> - rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
> + rw2 |= REQ_FUA;
>  
>   ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
> - submit_bio(rw2, b);
> - bio_num++;
> + submit_bio(rw2 | b->bi_rw, b);
>   }
>  
>   ploop_complete_io_request(preq);
> @@ -567,7 +565,6 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request 
> * preq,
>   sector_t sec, end_sec, nsec, start, end;
>   struct bio_list_walk bw;
>   int err;
> - int preflush = !!(preq->req_rw & REQ_FLUSH);
>  
>   bio_list_init();
>  
> @@ -598,14 +595,17 @@ dio_submit_pad(struct ploop_io *io, struct 
> ploop_request * preq,
>   while (sec < end_sec) {
>   struct page * page;
>   unsigned int poff, plen;
> + bool zero_page;
>  
>   if (sec < start) {
> + zero_page = true;
>   page = ZERO_PAGE(0);
>   poff = 0;
>   plen = start - sec;
>   if (plen > (PAGE_SIZE>>9))
>   plen = (PAGE_SIZE>>9);
>   } else if (sec >= end) {
> + zero_page = true;
>   page = ZERO_PAGE(0);
>   poff = 0;
>   plen = end_sec - sec;
> @@ -614,6 +614,7 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request 
> * preq,
>   } else {
>   /* sec >= start && sec < end */
>   struct bio_vec * bv;
> + zero_page = false;
>  
>   if (sec == start) {
>   bw.cur = sbl->head;
> @@ -672,6 +673,10 @@ flush_bio:
>   goto flush_bio;
>   }
>  
> + /* Handle FLUSH here, dio_post_submit will handle FUA */

submit_pad may be called w/o post_submit flag from here:
->dio_submit_alloc
  if (io->files.em_tree->_get_extent) {
   ->dio_fallocate
   ->dio_submit_pad
  ..
 }
> + if (!zero_page)
> + bio->bi_rw |= bw.cur->bi_rw & REQ_FLUSH;
> +
>   bw.bv_off += (plen<<9);
>   BUG_ON(plen == 0);
>   sec += plen;
> @@ -688,13 +693,9 @@ flush_bio:
>   b->bi_private = preq;
>   b->bi_end_io = dio_endio_async;
>  
> - rw = sbl->head->bi_rw | WRITE;
> - if (unlikely(preflush)) {
> - rw |= REQ_FLUSH;
> - preflush = 0;
> - }
> + rw = preq->req_rw & ~(REQ_FLUSH | REQ_FUA);
>   

Re: [Devel] [PATCH rh7] ploop: fix barriers for ordinary requests

2016-06-22 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> The way how io_direct.c handles FLUSH|FUA: b1:FLUSH,b2,b3,b4,b5:FLUSH|FUA
> is completely wrong: to make sure that b1:FLUSH made effect we have to
> wait for its completion. Similarly, even if we're sure that FUA will be
> processed as post-FLUSH (also dubious!), we have to wait for completion
> b1..b4 to make sure that that flush will cover them.
>
> The patch fixes all these issues pretty simple: let's mark outgouing
> bio-s with FLUSH|FUA based on those flags in *corresponing* incoming
> bio-s.
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |1 -
>  drivers/block/ploop/io_direct.c |   47 
> ---
>  2 files changed, 15 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 2ef1449..6b5702f 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -498,7 +498,6 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
> bio,
>   preq->req_sector = bio->bi_sector;
>   preq->req_size = bio->bi_size >> 9;
>   preq->req_rw = bio->bi_rw;
> - bio->bi_rw &= ~(REQ_FLUSH | REQ_FUA);
Wow. I can't even imagine that we clear barrier flags from original bios
>   preq->eng_state = PLOOP_E_ENTRY;
>   preq->state = 0;
>   preq->error = 0;
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 6ef9cd8..84c9a48 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -92,7 +92,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
>   int preflush;
>   int postfua = 0;
>   int write = !!(rw & REQ_WRITE);
> - int bio_num;
>  
>   trace_submit(preq);
>  
> @@ -233,13 +232,13 @@ flush_bio:
>   goto flush_bio;
>   }
>  
> + bio->bi_rw |= bw.cur->bi_rw & (REQ_FLUSH | REQ_FUA);
>   bw.bv_off += copy;
>   size -= copy >> 9;
>   sec += copy >> 9;
>   }
>   ploop_extent_put(em);
>  
> - bio_num = 0;
>   while (bl.head) {
>   struct bio * b = bl.head;
>   unsigned long rw2 = rw;
> @@ -255,11 +254,10 @@ flush_bio:
>   preflush = 0;
>   }
>   if (unlikely(postfua && !bl.head))
> - rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
> + rw2 |= REQ_FUA;
>  
>   ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
> - submit_bio(rw2, b);
> - bio_num++;
> + submit_bio(rw2 | b->bi_rw, b);
>   }
>  
>   ploop_complete_io_request(preq);
> @@ -567,7 +565,6 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request 
> * preq,
>   sector_t sec, end_sec, nsec, start, end;
>   struct bio_list_walk bw;
>   int err;
> - int preflush = !!(preq->req_rw & REQ_FLUSH);
>  
>   bio_list_init();
>  
> @@ -598,14 +595,17 @@ dio_submit_pad(struct ploop_io *io, struct 
> ploop_request * preq,
>   while (sec < end_sec) {
>   struct page * page;
>   unsigned int poff, plen;
> + bool zero_page;
>  
>   if (sec < start) {
> + zero_page = true;
>   page = ZERO_PAGE(0);
>   poff = 0;
>   plen = start - sec;
>   if (plen > (PAGE_SIZE>>9))
>   plen = (PAGE_SIZE>>9);
>   } else if (sec >= end) {
> + zero_page = true;
>   page = ZERO_PAGE(0);
>   poff = 0;
>   plen = end_sec - sec;
> @@ -614,6 +614,7 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request 
> * preq,
>   } else {
>   /* sec >= start && sec < end */
>   struct bio_vec * bv;
> + zero_page = false;
>  
>   if (sec == start) {
>   bw.cur = sbl->head;
> @@ -672,6 +673,10 @@ flush_bio:
>   goto flush_bio;
>   }
>  
> + /* Handle FLUSH here, dio_post_submit will handle FUA */
> + if (!zero_page)
> + bio->bi_rw |= bw.cur->bi_rw & REQ_FLUSH;
> +
>   bw.bv_off += (plen<<9);
>   BUG_ON(plen == 0);
>   sec += plen;
> @@ -688,13 +693,9 @@ flush_bio:
>   b->bi_private = preq;
>   b->bi_end_io = dio_endio_async;
>  
> - rw = sbl->head->bi_rw | WRITE;
> - if (unlikely(preflush)) {
> - rw |= REQ_FLUSH;
> - preflush = 0;
> - }
> + rw = preq->req_rw & ~(REQ_FLUSH | REQ_FUA);
>   ploop_acc_ff_out(preq->plo, rw | b->bi_rw);
> - submit_bio(rw, b);
> + submit_bio(rw | 

[Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit

2016-06-21 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/io_direct.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index b844a80..58d7580 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -517,27 +517,31 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
loff_t clu_siz = 1 << (preq->plo->cluster_log + 9);
+   int force_sync = preq->req_rw & REQ_FUA;
int err;
 
file_start_write(io->files.file);
 
-   /* Here io->io_count is even ... */
-   spin_lock_irq(>lock);
-   io->io_count++;
-   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
-   spin_unlock_irq(>lock);
-
+   if (!force_sync) {
+   /* Here io->io_count is even ... */
+   spin_lock_irq(>lock);
+   io->io_count++;
+   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
+   spin_unlock_irq(>lock);
+   }
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
 
/* highly unlikely case: FUA coming to a block not provisioned yet */
-   if (!err && (preq->req_rw & REQ_FUA))
+   if (!err && force_sync)
err = io->ops->sync(io);
 
-   spin_lock_irq(>lock);
-   io->io_count++;
-   spin_unlock_irq(>lock);
+   if (!force_sync) {
+   spin_lock_irq(>lock);
+   io->io_count++;
+   spin_unlock_irq(>lock);
+   }
/* and here io->io_count is even (+2) again. */
 
file_end_write(io->files.file);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v3

2016-06-21 Thread Dmitry Monakhov
barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
 ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
  ->delta->allocate
 ->io->submit_allloc: dio_submit_alloc
   ->dio_submit_pad
E_DATA_WBI : data written, time to update index
  ->delta->allocate_complete:ploop_index_update
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
->write_page
->ploop_map_wb_complete
  ->ploop_wb_complete_post_process
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
E_RELOC_NULLIFY:

   ->submit()

BUG#2: currecntly kaio write_page silently ignores REQ_FLUSH
BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all bios 
via REQ_FUA
   not just latest one.
This patch unify barrier handling like follows:
- Get rid of FORCE_{FLUSH,FUA}
- Introduce DELAYED_FLUSH
- fix fua handling for dio_submit
- BUG_ON for REQ_FLUSH in kaio_page_write

This makes reloc sequence optimal:
io_direct
RELOC_S: R1, W2, WBI:FLUSH|FUA
RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
io_kaio
RELOC_S: R1, W2:FUA, WBI:FUA
RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/dev.c   |  8 +---
 drivers/block/ploop/io_direct.c | 30 ++-
 drivers/block/ploop/io_kaio.c   | 23 +
 drivers/block/ploop/map.c   | 45 ++---
 include/linux/ploop/ploop.h | 19 +
 5 files changed, 60 insertions(+), 65 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..fbc5f2f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct ploop_request * 
preq)
 
__TRACE("Z %p %u\n", preq, preq->req_cluster);
 
+   if (!preq->error) {
+   WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, >state));
+   }
while (preq->bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2533,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
 
-   /* Relocated data write required sync before BAT updatee */
-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, >state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index a6d83fe..303eb70 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -83,28 +83,19 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
int err;
struct bio_list_walk bw;
int preflush;
-   int postfua = 0;
+   int fua = 0;
int write = !!(rw & REQ_WRITE);
int bio_num;
 
trace_submit(preq);
 
preflush = !!(rw & REQ_FLUSH);
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, >state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
+   fua = !!(rw & REQ_FUA);
+   if (fua && ploop_req_delay_fua_possible(rw, preq)) {
/* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, >state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
+   set_bit(PLOOP_REQ_DELAYED_FLUSH, >state);
+   fua = 0;
}
-
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
@@ -238,8 +229,10 @@ flush_bio:
rw2 |= REQ_FLUSH;
preflush = 0;
}
-   if (unlikely(postfua && !bl.head))
-   rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
+   /* Very unlikely, but correct.
+* TODO: Optimize postfua via DELAY_FLUSH for any req state */
+   if (unlikely(fua))
+   rw2 |= REQ_FUA;
 
ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
submit_bio(rw2, b);
@@ -1520,15 +1513,14 @@ dio_read_page(struct ploop_io * io, struct 
ploop_request * preq,
 
 static voi

[Devel] [PATCH 2/3] ploop: deadcode cleanup

2016-06-21 Thread Dmitry Monakhov
(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/io_direct.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 58d7580..a6d83fe 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
-   /* In case of eng_state != COMPLETE, we'll do FUA in
-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init();
 
if (iblk == PLOOP_ZERO_INDEX)
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v2

2016-06-21 Thread Dmitry Monakhov
Maxim Patlasov <mpatla...@virtuozzo.com> writes:

> Dima,
>
> I agree with general approach of this patch, but there are some 
> (easy-to-fix) issues. See, please, inline comments below...
>
> On 06/20/2016 11:58 AM, Dmitry Monakhov wrote:
>> barrier code is broken in many ways:
>> Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
>> But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
>> write_page (for indexes)
>> So in case of grow_dev we have following sequance:
>>
>> E_RELOC_DATA_READ:
>>   ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
>>->delta->allocate
>>   ->io->submit_allloc: dio_submit_alloc
>> ->dio_submit_pad
>> E_DATA_WBI : data written, time to update index
>>->delta->allocate_complete:ploop_index_update
>>  ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
>>  ->write_page
>>  ->ploop_map_wb_complete
>>->ploop_wb_complete_post_process
>>  ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
>> E_RELOC_NULLIFY:
>>
>> ->submit()
>>
>> BUG#2: currecntly kaio write_page silently ignores REQ_FUA
>
> Sorry, I can't agree, it actually does not ignore:
I've misstyped. I ment to say REQ_FLUSH.
>
>> static void
>> kaio_write_page(struct ploop_io * io, struct ploop_request * preq,
>>  struct page * page, sector_t sec, int fua)
>> {
>> /* No FUA in kaio, convert it to fsync */
>> if (fua)
>> set_bit(PLOOP_REQ_KAIO_FSYNC, >state);
>
>
>> BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all 
>> bios via REQ_FUA
>> not just latest one.
>
> No need to tag *all*. See inline comments below.
>
>> This patch unify barrier handling like follows:
>> - Get rid of FORCE_{FLUSH,FUA}
>> - Introduce DELAYED_FLUSH, currecntly it supported only by io_direct
>> - fix up fua handling for dio_submit
>>
>> This makes reloc sequence optimal:
>> io_direct
>> RELOC_S: R1, W2, WBI:FLUSH|FUA
>> RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
>> io_kaio
>> RELOC_S: R1, W2:FUA, WBI:FUA
>> RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA
>>
>> https://jira.sw.ru/browse/PSBM-47107
>> Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
>> ---
>>   drivers/block/ploop/dev.c   |  8 +---
>>   drivers/block/ploop/io_direct.c | 29 +-
>>   drivers/block/ploop/io_kaio.c   | 17 ++--
>>   drivers/block/ploop/map.c   | 45 
>> ++---
>>   include/linux/ploop/ploop.h |  8 
>>   5 files changed, 54 insertions(+), 53 deletions(-)
>>
>> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
>> index 96f7850..fbc5f2f 100644
>> --- a/drivers/block/ploop/dev.c
>> +++ b/drivers/block/ploop/dev.c
>> @@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct 
>> ploop_request * preq)
>>   
>>  __TRACE("Z %p %u\n", preq, preq->req_cluster);
>>   
>> +if (!preq->error) {
>> +WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, >state));
>> +}
>>  while (preq->bl.head) {
>>  struct bio * bio = preq->bl.head;
>>  preq->bl.head = bio->bi_next;
>> @@ -2530,9 +2533,8 @@ restart:
>>  top_delta = ploop_top_delta(plo);
>>  sbl.head = sbl.tail = preq->aux_bio;
>>   
>> -/* Relocated data write required sync before BAT updatee */
>> -set_bit(PLOOP_REQ_FORCE_FUA, >state);
>> -
>> +/* Relocated data write required sync before BAT updatee
>> + * this will happen inside index_update */
>>  if (test_bit(PLOOP_REQ_RELOC_S, >state)) {
>>  preq->eng_state = PLOOP_E_DATA_WBI;
>>  plo->st.bio_out++;
>> diff --git a/drivers/block/ploop/io_direct.c 
>> b/drivers/block/ploop/io_direct.c
>> index a6d83fe..d7ecd4a 100644
>> --- a/drivers/block/ploop/io_direct.c
>> +++ b/drivers/block/ploop/io_direct.c
>> @@ -90,21 +90,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
>> preq,
>>  trace_submit(preq);
>>   
>>  preflush = !!(rw & REQ_FLUSH);
>> -
>> -if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, >stat

[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v2

2016-06-20 Thread Dmitry Monakhov
barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
 ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
  ->delta->allocate
 ->io->submit_allloc: dio_submit_alloc
   ->dio_submit_pad
E_DATA_WBI : data written, time to update index
  ->delta->allocate_complete:ploop_index_update
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
->write_page
->ploop_map_wb_complete
  ->ploop_wb_complete_post_process
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
E_RELOC_NULLIFY:

   ->submit()

BUG#2: currecntly kaio write_page silently ignores REQ_FUA
BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all bios 
via REQ_FUA
   not just latest one.
This patch unify barrier handling like follows:
- Get rid of FORCE_{FLUSH,FUA}
- Introduce DELAYED_FLUSH, currecntly it supported only by io_direct
- fix up fua handling for dio_submit

This makes reloc sequence optimal:
io_direct
RELOC_S: R1, W2, WBI:FLUSH|FUA
RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
io_kaio
RELOC_S: R1, W2:FUA, WBI:FUA
RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/dev.c   |  8 +---
 drivers/block/ploop/io_direct.c | 29 +-
 drivers/block/ploop/io_kaio.c   | 17 ++--
 drivers/block/ploop/map.c   | 45 ++---
 include/linux/ploop/ploop.h |  8 
 5 files changed, 54 insertions(+), 53 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..fbc5f2f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct ploop_request * 
preq)
 
__TRACE("Z %p %u\n", preq, preq->req_cluster);
 
+   if (!preq->error) {
+   WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, >state));
+   }
while (preq->bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2533,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
 
-   /* Relocated data write required sync before BAT updatee */
-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, >state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index a6d83fe..d7ecd4a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -90,21 +90,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
trace_submit(preq);
 
preflush = !!(rw & REQ_FLUSH);
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, >state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
+   postfua = !!(rw & REQ_FUA);
+   if (ploop_req_delay_fua_possible(rw, preq)) {
/* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, >state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
+   set_bit(PLOOP_REQ_DELAYED_FLUSH, >state);
+   postfua = 0;
}
-
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
@@ -238,14 +229,15 @@ flush_bio:
rw2 |= REQ_FLUSH;
preflush = 0;
}
-   if (unlikely(postfua && !bl.head))
-   rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
+   /* Very unlikely, but correct.
+* TODO: Optimize postfua via DELAY_FLUSH for any req state */
+   if (unlikely(!postfua))
+   rw2 |= REQ_FUA;
 
ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
submit_bio(rw2, b);
bio_num++;
}
-
ploop_complete_io_request(preq);
return;
 
@@ -1520,15 +1512,14 @@ dio_read_page(struct ploop_io * io, struct 
ploop_request * preq,
 
 static void
 dio_write_page(struct ploop_io * io, struct ploop_request * preq,
-  s

[Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit v2

2016-06-20 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/io_direct.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index b844a80..58d7580 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -517,27 +517,31 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
loff_t clu_siz = 1 << (preq->plo->cluster_log + 9);
+   int force_sync = preq->req_rw & REQ_FUA;
int err;
 
file_start_write(io->files.file);
 
-   /* Here io->io_count is even ... */
-   spin_lock_irq(>lock);
-   io->io_count++;
-   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
-   spin_unlock_irq(>lock);
-
+   if (!force_sync) {
+   /* Here io->io_count is even ... */
+   spin_lock_irq(>lock);
+   io->io_count++;
+   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
+   spin_unlock_irq(>lock);
+   }
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
 
/* highly unlikely case: FUA coming to a block not provisioned yet */
-   if (!err && (preq->req_rw & REQ_FUA))
+   if (!err && force_sync)
err = io->ops->sync(io);
 
-   spin_lock_irq(>lock);
-   io->io_count++;
-   spin_unlock_irq(>lock);
+   if (!force_sync) {
+   spin_lock_irq(>lock);
+   io->io_count++;
+   spin_unlock_irq(>lock);
+   }
/* and here io->io_count is even (+2) again. */
 
file_end_write(io->files.file);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/3] ploop: deadcode cleanup

2016-06-20 Thread Dmitry Monakhov
(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/io_direct.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 58d7580..a6d83fe 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
-   /* In case of eng_state != COMPLETE, we'll do FUA in
-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init();
 
if (iblk == PLOOP_ZERO_INDEX)
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-19 Thread Dmitry Monakhov
Maxim Patlasov <mpatla...@virtuozzo.com> writes:

> On 06/16/2016 09:30 AM, Dmitry Monakhov wrote:
>> Dmitry Monakhov <dmonak...@openvz.org> writes:
>>
>>> Maxim Patlasov <mpatla...@virtuozzo.com> writes:
>>>
>>>> Dima,
>>>>
>>>> I agree that the ploop barrier code is broken in many ways, but I don't
>>>> think the patch actually fixes it. I hope you would agree that
>>>> completion of REQ_FUA guarantees only landing that particular bio to the
>>>> disk; it says nothing about flushing previously submitted (and
>>>> completed) bio-s and it is also possible that power outage may catch us
>>>> when this REQ_FUA is already landed to the disk, but previous bio-s are
>>>> not yet.
>>> Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
>>> So yes. it would be more correct to tag WBI with FLUSH_FUA
>>>> Hence, for RELOC_{A|S} requests we actually need something like that:
>>>>
>>>>RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>>>>RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>>>>
>>>> (i.e. we do need to flush all previously submitted data before starting
>>>> to update BAT on disk)
>>>>
>>> Correct sequence:
>>> RELOC_S: R1, W2, WBI:FLUSH_FUA
>>> RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA
>>>
>>>> not simply:
>>>>
>>>>> RELOC_S: R1, W2, WBI:FUA
>>>>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>>> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and
>>>> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we
>>>> could remove them completely (along we that optimization delaying
>>>> incoming FUA) and re-implement all this stuff from scratch:
>>>>
>>>> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set
>>>> REQ_FUA in preq->req_rw before calling ->submit(preq)
>>>>
>>>> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating
>>>> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for
>>>> RELOC_A|S in ploop_index_update and map_wb_complete
>>>>
>>>> 3) For that optimization delaying incoming FUA (what we do now if
>>>> ploop_req_delay_fua_possible() returns true) we could introduce new
>>>> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update
>>>> and map_wb_complete (the same thing as 2) above). And, yes, let's
>>>> WARN_ON if we somehow missed its processing.
>>> Yes. This was one of my ideas.
>>> 1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
>>> RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
>>> PLOOP_IO_FLUSH_DELAYED.
>>> 2) fix ->write_page to handle flush as it does with fua.
>>>> The only complication I foresee is about how to teach kaio to pre-flush
>>>> in kaio_write_page -- it's doable, but involves kaio_resubmit that's
>>>> already pretty convoluted.
>>>>
>>> Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
>> Crap. Currently kaio can handles fsync only via kaio_queue_fsync_req
>> which is async and not suitable for page_write.
>
> I think it's doable to process page_write via kaio_fsync_thread, but 
> it's tricky.
>
>> Max let's make an agreement about terminology.
>> The reason I wrote this is because linux internally interpret FUA as
>> preflush,write,postflush which is wrong from academic point of view but
>> it is the world we live it linux.
>
> Are you sure that this  (FUA == preflush,write,postflush) is universally 
> true (i.e. no exceptions)? What about bio-based block-device drivers?
>
>> This is the reason I read code
>> diferently from the way it was designed.
>> Let's state that ploop is an ideal world where:
>> FLUSH ==> preflush
>> FUA   ==> WRUTE,postflush
>
> In ideal word FUA is not obliged to be handled by postflush: it's enough 
> to guarantee that *this* particular request went to platter, other 
> requests may remain not-flushed-yet. 
> Documentation/block/writeback_cache_control.txt is absolutely clear 
> about it:
>
>> The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted 
>> from the
>> filesystem and will make sure that I/O completion for this request is only
>> signaled after the data has been committed to non-volatile storage.
>> ...
>> 

Re: [Devel] [vzlin-dev] [PATCH rh7] ploop: io_kaio: fix silly bug in kaio_complete_io_state()

2016-06-17 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> It's useless to check for preq->req_rw & REQ_FUA after:
> preq->req_rw &= ~REQ_FUA;
ACK :) But in order to make it clear for others let's post original code
here!
...
  preq->req_rw &= ~REQ_FUA;

/* Convert requested fua to fsync */
   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state) ||
   test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC,
   >state))
   post_fsync = 1;

if (!post_fsync &&
!ploop_req_delay_fua_possible(preq->req_rw, preq) &&
(preq->req_rw & REQ_FUA))
post_fsync = 1;

preq->req_rw &= ~REQ_FUA;
...


>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/io_kaio.c |2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
> index 79aa9af..de26319 100644
> --- a/drivers/block/ploop/io_kaio.c
> +++ b/drivers/block/ploop/io_kaio.c
> @@ -71,8 +71,6 @@ static void kaio_complete_io_state(struct ploop_request * 
> preq)
>   return;
>   }
>  
> - preq->req_rw &= ~REQ_FUA;
> -
>   /* Convert requested fua to fsync */
>   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state) ||
>   test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, >state))


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH rh7] ploop: fix counting bio_qlen

2016-06-17 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> The commit ec1eeb868 (May 22 2015) ported "separate queue for discard bio"
> patch from RHEL6-based kernel incorrectly. Original patch stated clearly
> that if we want to decrement bio_discard_qlen, bio_qlen must not change:
>
> @@ -500,7 +502,7 @@ ploop_bio_queue(struct ploop_device * pl
> (err = ploop_discard_add_bio(plo->fbd, bio))) {
> BIO_ENDIO(bio, err);
> list_add(>list, >free_list);
> -   plo->bio_qlen--;
> +   plo->bio_discard_qlen--;
> plo->bio_total--;
> return;
> }
>
> but that port did the opposite:
>
> @@ -521,6 +523,7 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
> bio,
> BIO_ENDIO(plo->queue, bio, err);
> list_add(>list, >free_list);
> plo->bio_qlen--;
> +   plo->bio_discard_qlen--;
> plo->bio_total--;
> return;
> }
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c |1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index db55be3..e1fbfcf 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -523,7 +523,6 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
> bio,
>   }
>   BIO_ENDIO(plo->queue, bio, err);
>   list_add(>list, >free_list);
> - plo->bio_qlen--;
>   plo->bio_discard_qlen--;
>   plo->bio_total--;
>   return;
ACK


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-16 Thread Dmitry Monakhov
Dmitry Monakhov <dmonak...@openvz.org> writes:

> Maxim Patlasov <mpatla...@virtuozzo.com> writes:
>
>> Dima,
>>
>> I agree that the ploop barrier code is broken in many ways, but I don't 
>> think the patch actually fixes it. I hope you would agree that 
>> completion of REQ_FUA guarantees only landing that particular bio to the 
>> disk; it says nothing about flushing previously submitted (and 
>> completed) bio-s and it is also possible that power outage may catch us 
>> when this REQ_FUA is already landed to the disk, but previous bio-s are 
>> not yet.
> Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
> So yes. it would be more correct to tag WBI with FLUSH_FUA
>> Hence, for RELOC_{A|S} requests we actually need something like that:
>>
>>   RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>>   RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>>
>> (i.e. we do need to flush all previously submitted data before starting 
>> to update BAT on disk)
>>
> Correct sequence:
> RELOC_S: R1, W2, WBI:FLUSH_FUA
> RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA
>
>> not simply:
>>
>>> RELOC_S: R1, W2, WBI:FUA
>>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>
>> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and 
>> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we 
>> could remove them completely (along we that optimization delaying 
>> incoming FUA) and re-implement all this stuff from scratch:
>>
>> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set 
>> REQ_FUA in preq->req_rw before calling ->submit(preq)
>>
>> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating 
>> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for 
>> RELOC_A|S in ploop_index_update and map_wb_complete
>>
>> 3) For that optimization delaying incoming FUA (what we do now if 
>> ploop_req_delay_fua_possible() returns true) we could introduce new 
>> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update 
>> and map_wb_complete (the same thing as 2) above). And, yes, let's 
>> WARN_ON if we somehow missed its processing.
> Yes. This was one of my ideas.
> 1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
> RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
> PLOOP_IO_FLUSH_DELAYED.
> 2) fix ->write_page to handle flush as it does with fua.
>>
>> The only complication I foresee is about how to teach kaio to pre-flush 
>> in kaio_write_page -- it's doable, but involves kaio_resubmit that's 
>> already pretty convoluted.
>>
> Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
Crap. Currently kaio can handles fsync only via kaio_queue_fsync_req
which is async and not suitable for page_write.
Max let's make an agreement about terminology.
The reason I wrote this is because linux internally interpret FUA as
preflush,write,postflush which is wrong from academic point of view but
it is the world we live it linux. This is the reason I read code
diferently from the way it was designed.
Let's state that ploop is an ideal world where:
FLUSH ==> preflush
FUA   ==> WRUTE,postflush
For what reasona we can perform reloc scheme as:

RELOC_A: R1,W2:FUA,WBI:FUA,W1:NULLIFY|FUA
RELOC_A: R1,W2:FUA,WBI:FUA

This allows effectively handle FUA and convert it to DELAYED_FLUSH where
possible. Also let's clarify may_fua_delay semantics to exact eng_state

may_fua_delay {

  int may_delay = 1;
  /* effectively this is equivalent of
 preq->eng_state != PLOOP_E_COMPLETE
 but it is more readable, and more error prone in future
  */
  if (preq->eng_state != PLOOP_E_DATA_WBI)
  may_delay = 0
  if ((test_bit(PLOOP_REQ_RELOC_S, >state)) ||
 (test_bit(PLOOP_REQ_RELOC_A, >state)))
  may_delay = 0;
  return may_delay;
}





k
>> Btw, I accidentally noticed awful silly bug in kaio_complete_io_state(): 
>> we checks for REQ_FUA after clearing it! This makes all FUA-s on 
>> ordinary kaio_submit path silently lost...
>>
>> Thanks,
>> Maxim
>>
>>
>> On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:
>>> barrier code is broken in many ways:
>>> Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
>>> But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
>>> write_page (for indexes)
>>> So in case of grow_dev we have following sequance:
>>>
>>> E_RELOC_DATA_READ:
>>>   ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
>>>   

Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-16 Thread Dmitry Monakhov
Maxim Patlasov <mpatla...@virtuozzo.com> writes:

> Dima,
>
> I agree that the ploop barrier code is broken in many ways, but I don't 
> think the patch actually fixes it. I hope you would agree that 
> completion of REQ_FUA guarantees only landing that particular bio to the 
> disk; it says nothing about flushing previously submitted (and 
> completed) bio-s and it is also possible that power outage may catch us 
> when this REQ_FUA is already landed to the disk, but previous bio-s are 
> not yet.
Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
So yes. it would be more correct to tag WBI with FLUSH_FUA
> Hence, for RELOC_{A|S} requests we actually need something like that:
>
>   RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>   RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>
> (i.e. we do need to flush all previously submitted data before starting 
> to update BAT on disk)
>
Correct sequence:
RELOC_S: R1, W2, WBI:FLUSH_FUA
RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA

> not simply:
>
>> RELOC_S: R1, W2, WBI:FUA
>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>
> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and 
> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we 
> could remove them completely (along we that optimization delaying 
> incoming FUA) and re-implement all this stuff from scratch:
>
> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set 
> REQ_FUA in preq->req_rw before calling ->submit(preq)
>
> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating 
> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for 
> RELOC_A|S in ploop_index_update and map_wb_complete
>
> 3) For that optimization delaying incoming FUA (what we do now if 
> ploop_req_delay_fua_possible() returns true) we could introduce new 
> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update 
> and map_wb_complete (the same thing as 2) above). And, yes, let's 
> WARN_ON if we somehow missed its processing.
Yes. This was one of my ideas.
1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
PLOOP_IO_FLUSH_DELAYED.
2) fix ->write_page to handle flush as it does with fua.
>
> The only complication I foresee is about how to teach kaio to pre-flush 
> in kaio_write_page -- it's doable, but involves kaio_resubmit that's 
> already pretty convoluted.
>
Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
> Btw, I accidentally noticed awful silly bug in kaio_complete_io_state(): 
> we checks for REQ_FUA after clearing it! This makes all FUA-s on 
> ordinary kaio_submit path silently lost...
>
> Thanks,
> Maxim
>
>
> On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:
>> barrier code is broken in many ways:
>> Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
>> But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
>> write_page (for indexes)
>> So in case of grow_dev we have following sequance:
>>
>> E_RELOC_DATA_READ:
>>   ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
>>->delta->allocate
>>   ->io->submit_allloc: dio_submit_alloc
>> ->dio_submit_pad
>> E_DATA_WBI : data written, time to update index
>>->delta->allocate_complete:ploop_index_update
>>  ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
>>  ->write_page
>>  ->ploop_map_wb_complete
>>->ploop_wb_complete_post_process
>>  ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
>> E_RELOC_NULLIFY:
>>
>>     ->submit()
>>
>> This patch unify barrier handling like follows:
>> - Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state
>> - Perform explicit FUA inside index_update for RELOC requests.
>>
>> This makes reloc sequence optimal:
>> RELOC_S: R1, W2, WBI:FUA
>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>
>> https://jira.sw.ru/browse/PSBM-47107
>> Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
>> ---
>>   drivers/block/ploop/dev.c | 10 +++---
>>   drivers/block/ploop/map.c | 29 -
>>   2 files changed, 19 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
>> index 96f7850..998fe71 100644
>> --- a/drivers/block/ploop/dev.c
>> +++ b/drivers/block/ploop/dev.c

[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-15 Thread Dmitry Monakhov
barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
 ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
  ->delta->allocate
 ->io->submit_allloc: dio_submit_alloc
   ->dio_submit_pad
E_DATA_WBI : data written, time to update index
  ->delta->allocate_complete:ploop_index_update
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
->write_page
->ploop_map_wb_complete
  ->ploop_wb_complete_post_process
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
E_RELOC_NULLIFY:

   ->submit()

This patch unify barrier handling like follows:
- Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state
- Perform explicit FUA inside index_update for RELOC requests.

This makes reloc sequence optimal:
RELOC_S: R1, W2, WBI:FUA
RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/dev.c | 10 +++---
 drivers/block/ploop/map.c | 29 -
 2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..998fe71 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,11 @@ static void ploop_complete_request(struct ploop_request 
* preq)
 
__TRACE("Z %p %u\n", preq, preq->req_cluster);
 
+   if (!preq->error) {
+   unsigned long state =  READ_ONCE(preq->state);
+   WARN_ON(state & (1 << PLOOP_REQ_FORCE_FUA));
+   WARN_ON(state & (1 <<PLOOP_REQ_FORCE_FLUSH));
+   }
while (preq->bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2535,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
 
-   /* Relocated data write required sync before BAT updatee */
-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, >state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 3a6365d..c17e598 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -896,6 +896,7 @@ void ploop_index_update(struct ploop_request * preq)
struct ploop_device * plo = preq->plo;
struct map_node * m = preq->map;
struct ploop_delta * top_delta = map_top_delta(m->parent);
+   int fua = !!(preq->req_rw & REQ_FUA);
u32 idx;
map_index_t blk;
int old_level;
@@ -953,13 +954,13 @@ void ploop_index_update(struct ploop_request * preq)
__TRACE("wbi %p %u %p\n", preq, preq->req_cluster, m);
plo->st.map_single_writes++;
top_delta->ops->map_index(top_delta, m->mn_start, );
-   /* Relocate requires consistent writes, mark such reqs appropriately */
+   /* Relocate requires consistent index update */
if (test_bit(PLOOP_REQ_RELOC_A, >state) ||
test_bit(PLOOP_REQ_RELOC_S, >state))
-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
-
-   top_delta->io.ops->write_page(_delta->io, preq, page, sec,
- !!(preq->req_rw & REQ_FUA));
+   fua = 1;
+   if (fua)
+   clear_bit(PLOOP_REQ_FORCE_FLUSH, >state);
+   top_delta->io.ops->write_page(_delta->io, preq, page, sec, fua);
put_page(page);
return;
 
@@ -1078,7 +1079,7 @@ static void map_wb_complete(struct map_node * m, int err)
int delayed = 0;
unsigned int idx;
sector_t sec;
-   int fua, force_fua;
+   int fua;
 
/* First, complete processing of written back indices,
 * finally instantiate indices in mapping cache.
@@ -1149,7 +1150,6 @@ static void map_wb_complete(struct map_node * m, int err)
 
main_preq = NULL;
fua = 0;
-   force_fua = 0;
 
list_for_each_safe(cursor, tmp, >io_queue) {
struct ploop_request * preq;
@@ -1168,13 +1168,12 @@ static void map_wb_complete(struct map_node * m, int 
err)
break;
}
 
-   if (preq->req_rw & REQ_FUA)
+

[Devel] [PATCH 2/3] ploop: deadcode cleanup

2016-06-15 Thread Dmitry Monakhov
(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/io_direct.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 74a554a..10d2314 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
-   /* In case of eng_state != COMPLETE, we'll do FUA in
-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init();
 
if (iblk == PLOOP_ZERO_INDEX)
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit

2016-06-15 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/io_direct.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index b844a80..74a554a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -517,16 +517,18 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
loff_t clu_siz = 1 << (preq->plo->cluster_log + 9);
+   int force_sync = preq->req_rw & REQ_FUA;
int err;
 
file_start_write(io->files.file);
 
-   /* Here io->io_count is even ... */
-   spin_lock_irq(>lock);
-   io->io_count++;
-   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
-   spin_unlock_irq(>lock);
-
+   if (!force_sync) {
+   /* Here io->io_count is even ... */
+   spin_lock_irq(>lock);
+   io->io_count++;
+   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
+   spin_unlock_irq(>lock);
+   }
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
@@ -535,9 +537,11 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
if (!err && (preq->req_rw & REQ_FUA))
err = io->ops->sync(io);
 
-   spin_lock_irq(>lock);
-   io->io_count++;
-   spin_unlock_irq(>lock);
+   if (!force_sync) {
+   spin_lock_irq(>lock);
+   io->io_count++;
+   spin_unlock_irq(>lock);
+   }
/* and here io->io_count is even (+2) again. */
 
file_end_write(io->files.file);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH rh7] ploop: push_backup: PLOOP_PEEK mode of ioctl(PLOOP_IOC_PUSH_BACKUP_IO)

2016-06-09 Thread Dmitry Monakhov

> Now, before stopping push_backup, userspace backup tool can ask ploop
> about blocks needed to backup, but not reported as backed up:
>
>   ctl->direction = PLOOP_PEEK;
>   ctl->n_extents = n; /* where n >= 1*/
>   ret = ioctl(pfd, PLOOP_IOC_PUSH_BACKUP_IO, ctl);
>
> If push_backup was really done completely (i.e. all blocks in main bitmask
> were reported as backed up), ret will be zero, and ctl->n_extents too.
>
> Overwise, if some blocks were missed to backup, the ioctl will fill
> ctl->extents[] with info about such "missed" blocks.
>
> https://jira.sw.ru/browse/PSBM-47764
ACK
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c |   47 +
>  drivers/block/ploop/push_backup.c |   84 
> +
>  drivers/block/ploop/push_backup.h |2 +
>  include/linux/ploop/ploop_if.h|1 
>  4 files changed, 125 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 27827a8..db55be3 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -4643,24 +4643,27 @@ pb_init_done:
>   return rc;
>  }
>  
> -static int ploop_push_backup_io_read(struct ploop_device *plo, unsigned long 
> arg,
> -  struct ploop_push_backup_io_ctl *ctl)
> +static int ploop_push_backup_io_get(struct ploop_device *plo,
> + unsigned long arg, struct ploop_push_backup_io_ctl *ctl,
> + int (*get)(struct ploop_pushbackup_desc *, cluster_t *,
> +cluster_t *, unsigned))
>  {
>   struct ploop_push_backup_ctl_extent *e;
>   unsigned n_extents = 0;
>   int rc = 0;
> + cluster_t clu = 0;
> + cluster_t len = 0;
>  
>   e = kmalloc(sizeof(*e) * ctl->n_extents, GFP_KERNEL);
>   if (!e)
>   return -ENOMEM;
>  
>   while (n_extents < ctl->n_extents) {
> - cluster_t clu, len;
> - rc = ploop_pb_get_pending(plo->pbd, , , n_extents);
> + rc = get(plo->pbd, , , n_extents);
>   if (rc == -ENOENT && n_extents)
>   break;
>   else if (rc)
> - goto io_read_done;
> + goto io_get_done;
>  
>   e[n_extents].clu = clu;
>   e[n_extents].len = len;
> @@ -4670,18 +4673,44 @@ static int ploop_push_backup_io_read(struct 
> ploop_device *plo, unsigned long arg
>   rc = -EFAULT;
>   ctl->n_extents = n_extents;
>   if (copy_to_user((void*)arg, ctl, sizeof(*ctl)))
> - goto io_read_done;
> + goto io_get_done;
>   if (n_extents &&
>   copy_to_user((void*)(arg + sizeof(*ctl)), e,
>n_extents * sizeof(*e)))
> - goto io_read_done;
> + goto io_get_done;
>   rc = 0;
>  
> -io_read_done:
> +io_get_done:
>   kfree(e);
>   return rc;
>  }
>  
> +static int ploop_push_backup_io_read(struct ploop_device *plo,
> + unsigned long arg, struct ploop_push_backup_io_ctl *ctl)
> +{
> + return ploop_push_backup_io_get(plo, arg, ctl, ploop_pb_get_pending);
> +}
> +
> +static int ploop_push_backup_io_peek(struct ploop_device *plo,
> + unsigned long arg, struct ploop_push_backup_io_ctl *ctl)
> +{
> + int rc;
> +
> + ploop_quiesce(plo);
> + rc = ploop_push_backup_io_get(plo, arg, ctl, ploop_pb_peek);
> + ploop_relax(plo);
> +
> + if (rc == -ENOENT) {
> + ctl->n_extents = 0;
> + if (copy_to_user((void*)arg, ctl, sizeof(*ctl)))
> + rc = -EFAULT;
> + else
> + rc = 0;
> + }
> +
> + return rc;
> +}
> +
>  static int ploop_push_backup_io_write(struct ploop_device *plo, unsigned 
> long arg,
> struct ploop_push_backup_io_ctl *ctl)
>  {
> @@ -4737,6 +4766,8 @@ static int ploop_push_backup_io(struct ploop_device 
> *plo, unsigned long arg)
>   return ploop_push_backup_io_read(plo, arg, );
>   case PLOOP_WRITE:
>   return ploop_push_backup_io_write(plo, arg, );
> + case PLOOP_PEEK:
> + return ploop_push_backup_io_peek(plo, arg, );
>   }
>  
>   return -EINVAL;
> diff --git a/drivers/block/ploop/push_backup.c 
> b/drivers/block/ploop/push_backup.c
> index 376052d..e8fa88d 100644
> --- a/drivers/block/ploop/push_backup.c
> +++ b/drivers/block/ploop/push_backup.c
> @@ -303,9 +303,17 @@ int ploop_pb_init(struct ploop_pushbackup_desc *pbd, 
> __u8 *uuid, bool full)
>   memcpy(pbd->cbt_uuid, uuid, sizeof(pbd->cbt_uuid));
>  
>   if (full) {
> - int i;
> + int i, off;
>   for (i = 0; i < NR_PAGES(pbd->ppb_block_max); i++)
>   memset(page_address(pbd->ppb_map[i]), 0xff, PAGE_SIZE);
> +
> + /* nullify bits beyond [0, pbd->ppb_block_max) range */
> + 

Re: [Devel] [PATCH rh7] cbt: blk_cbt_update_size() must return if cbt->block_max not changed

2016-06-09 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> It's useless to recreate cbt every time as we called for the same
> block-device size. Actually, it's worthy only if cbt->block_max
> increases.
>
> Since commit b8e560a299 (fix cbt->block_max calculation), we calculate
> cbt->block_max precisely:
>
>>   cbt->block_max  = (size + blocksize - 1) >> cbt->block_bits;
>
> Hence, the following check:
>
>   if ((new_sz + bsz) >> cbt->block_bits <= cbt->block_max)
>   goto err_mtx;
>
> must be corrected accordingly.
>
ACK
> Signed-off-by: Maxim Patlasov 
> ---
>  block/blk-cbt.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-cbt.c b/block/blk-cbt.c
> index 3a2b197..4f2ce26 100644
> --- a/block/blk-cbt.c
> +++ b/block/blk-cbt.c
> @@ -440,7 +440,7 @@ void blk_cbt_update_size(struct block_device *bdev)
>   return;
>   }
>   bsz = 1 << cbt->block_bits;
> - if ((new_sz + bsz) >> cbt->block_bits <= cbt->block_max)
> + if ((new_sz + bsz - 1) >> cbt->block_bits <= cbt->block_max)
>   goto err_mtx;
>  
>   new = do_cbt_alloc(q, cbt->uuid, new_sz, bsz);


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] cbt: blk_cbt_update_size() should not copy uninitialized data

2016-06-09 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> to_cpy is the number of page pointers to copy from current cbt to new.
> The following check:
>
>>  if ((new_sz + bsz) >> cbt->block_bits <= cbt->block_max)
>>  goto err_mtx;
>
> ensures that the copy will be done only for new cbt bigger than current. So,
> we have to calculate to_cpy based on the current (smaller) cbt. The rest of
> new cbt is OK because it was nullified by do_cbt_alloc().
>
> The bug existed since the very first version of CBT (commit ad7ba3dfe).
>
> https://jira.sw.ru/browse/PSBM-48120
>
ACK
> Signed-off-by: Maxim Patlasov 
> ---
>  block/blk-cbt.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-cbt.c b/block/blk-cbt.c
> index 001dbfd..3a2b197 100644
> --- a/block/blk-cbt.c
> +++ b/block/blk-cbt.c
> @@ -448,7 +448,7 @@ void blk_cbt_update_size(struct block_device *bdev)
>   set_bit(CBT_ERROR, >flags);
>   goto err_mtx;
>   }
> - to_cpy = NR_PAGES(new->block_max);
> + to_cpy = NR_PAGES(cbt->block_max);
>   set_bit(CBT_NOCACHE, >flags);
>   cbt_flush_cache(cbt);
>   spin_lock_irq(>lock);


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH rh7 2/2] ploop: push_backup: rework lockout machinery

2016-06-08 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> It was not very nice idea to reuse plo->lockout_tree for push_backup. Because
> by design only one preq (for any given req_cluster) can sit in the lockout
> tree, but while we're reusing the tree for a WRITE request, a READ from
> backup tool may come. Such a READ may want to to use the tree: see how
> map_index_fault calls add_lockout for snapshot configuration.
>
> The patch introduces ad-hoc separate push_backup lockout tree. This fix the
> issue (PSBM-47680) and makes the code much easier to understand.
>
> https://jira.sw.ru/browse/PSBM-47680
ACK
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c|  111 
> ++
>  drivers/block/ploop/events.h |1 
>  include/linux/ploop/ploop.h  |3 +
>  3 files changed, 95 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index d3f0ec0..27827a8 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -1117,20 +1117,25 @@ static int ploop_congested(void *data, int bits)
>   return ret;
>  }
>  
> -static int check_lockout(struct ploop_request *preq)
> +static int __check_lockout(struct ploop_request *preq, bool pb)
>  {
>   struct ploop_device * plo = preq->plo;
> - struct rb_node * n = plo->lockout_tree.rb_node;
> + struct rb_node * n = pb ? plo->lockout_pb_tree.rb_node :
> +   plo->lockout_tree.rb_node;
>   struct ploop_request * p;
> + int lockout_bit = pb ? PLOOP_REQ_PB_LOCKOUT : PLOOP_REQ_LOCKOUT;
>  
>   if (n == NULL)
>   return 0;
>  
> - if (test_bit(PLOOP_REQ_LOCKOUT, >state))
> + if (test_bit(lockout_bit, >state))
>   return 0;
>  
>   while (n) {
> - p = rb_entry(n, struct ploop_request, lockout_link);
> + if (pb)
> + p = rb_entry(n, struct ploop_request, lockout_pb_link);
> + else
> + p = rb_entry(n, struct ploop_request, lockout_link);
>  
>   if (preq->req_cluster < p->req_cluster)
>   n = n->rb_left;
> @@ -1146,19 +1151,51 @@ static int check_lockout(struct ploop_request *preq)
>   return 0;
>  }
>  
> -int ploop_add_lockout(struct ploop_request *preq, int try)
> +static int check_lockout(struct ploop_request *preq)
> +{
> + if (__check_lockout(preq, false))
> + return 1;
> +
> + /* push_backup passes READs intact */
> + if (!(preq->req_rw & REQ_WRITE))
> + return 0;
> +
> + if (__check_lockout(preq, true))
> + return 1;
> +
> + return 0;
> +}
> +
> +static int __ploop_add_lockout(struct ploop_request *preq, int try, bool pb)
>  {
>   struct ploop_device * plo = preq->plo;
> - struct rb_node ** p = >lockout_tree.rb_node;
> + struct rb_node ** p;
>   struct rb_node *parent = NULL;
>   struct ploop_request * pr;
> + struct rb_node *link;
> + struct rb_root *tree;
> + int lockout_bit;
> +
> + if (pb) {
> + link = >lockout_pb_link;
> + tree = >lockout_pb_tree;
> + lockout_bit = PLOOP_REQ_PB_LOCKOUT;
> + } else {
> + link = >lockout_link;
> + tree = >lockout_tree;
> + lockout_bit = PLOOP_REQ_LOCKOUT;
> + }
>  
> - if (test_bit(PLOOP_REQ_LOCKOUT, >state))
> + if (test_bit(lockout_bit, >state))
>   return 0;
>  
> + p = >rb_node;
>   while (*p) {
>   parent = *p;
> - pr = rb_entry(parent, struct ploop_request, lockout_link);
> + if (pb)
> + pr = rb_entry(parent, struct ploop_request, 
> lockout_pb_link);
> + else
> + pr = rb_entry(parent, struct ploop_request, 
> lockout_link);
>  
>   if (preq->req_cluster == pr->req_cluster) {
>   if (try)
> @@ -1174,23 +1211,56 @@ int ploop_add_lockout(struct ploop_request *preq, int 
> try)
>  
>   trace_add_lockout(preq);
>  
> - rb_link_node(>lockout_link, parent, p);
> - rb_insert_color(>lockout_link, >lockout_tree);
> - __set_bit(PLOOP_REQ_LOCKOUT, >state);
> + rb_link_node(link, parent, p);
> + rb_insert_color(link, tree);
> + __set_bit(lockout_bit, >state);
>   return 0;
>  }
> +
> +int ploop_add_lockout(struct ploop_request *preq, int try)
> +{
> + return __ploop_add_lockout(preq, try, false);
> +}
>  EXPORT_SYMBOL(ploop_add_lockout);
>  
> -void del_lockout(struct ploop_request *preq)
> +static void ploop_add_pb_lockout(struct ploop_request *preq)
> +{
> + __ploop_add_lockout(preq, 0, true);
> +}
> +
> +static void __del_lockout(struct ploop_request *preq, bool pb)
>  {
>   struct ploop_device * plo = preq->plo;
> + struct rb_node *link;
> + struct rb_root *tree;
> + int lockout_bit;
> +
> + if (pb) {
> + link = 

Re: [Devel] [vzlin-dev] [PATCH rh7 1/2] ploop: push_backup: roll back ALLOW_READS patch

2016-06-08 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> The patch reverts:
>
> Subject: [PATCH rh7] ploop: push_backup must pass READs intact
>
> If push_backup is in progress (doesn't matter "full" or "incremental") and
> ploop state-machine detects incoming WRITE request to the cluster-block that
> was not push_backup-ed yet, it suspends the request until userspace reports it
> as "processed".
>
> The above is fine, but while such a WRITE request is suspended, only
> subsequent WRITEs (to given cluster-block) must be suspended too. READs must
> not. Otherwise userspace backup tool will be blocked infinintely trying
> to push_backup given cluster-block.
>
> Passing READs while blocking WRITEs must be OK because: 1) ploop has not
> finalized that first WRITE yet; 2) given cluster-block will be kept
> intact (non-modified) while the WRITE is suspended.
>
> https://jira.sw.ru/browse/PSBM-46775
ACK
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |7 ---
>  include/linux/ploop/ploop.h |1 -
>  2 files changed, 8 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 96f7850..d3f0ec0 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -1137,11 +1137,6 @@ static int check_lockout(struct ploop_request *preq)
>   else if (preq->req_cluster > p->req_cluster)
>   n = n->rb_right;
>   else {
> - /* do not block backup tool READs from /dev/ploop */
> - if (!(preq->req_rw & REQ_WRITE) &&
> - test_bit(PLOOP_REQ_ALLOW_READS, >state))
> - return 0;
> -
>   list_add_tail(>list, >delay_list);
>   plo->st.bio_lockouts++;
>   trace_preq_lockout(preq, p);
> @@ -2053,7 +2048,6 @@ restart:
>   ploop_pb_clear_bit(plo->pbd, preq->req_cluster);
>   } else {
>   spin_lock_irq(>lock);
> - __set_bit(PLOOP_REQ_ALLOW_READS, >state);
>   ploop_add_lockout(preq, 0);
>   spin_unlock_irq(>lock);
>   /*
> @@ -2072,7 +2066,6 @@ restart:
>  
>   spin_lock_irq(>lock);
>   del_lockout(preq);
> - __clear_bit(PLOOP_REQ_ALLOW_READS, >state);
>   if (!list_empty(>delay_list))
>   list_splice_init(>delay_list, 
> plo->ready_queue.prev);
>   spin_unlock_irq(>lock);
> diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
> index 0fba25e..77fd833 100644
> --- a/include/linux/ploop/ploop.h
> +++ b/include/linux/ploop/ploop.h
> @@ -470,7 +470,6 @@ enum
>   PLOOP_REQ_KAIO_FSYNC,   /*force image fsync by KAIO module */
>   PLOOP_REQ_POST_SUBMIT, /* preq needs post_submit processing */
>   PLOOP_REQ_PUSH_BACKUP, /* preq was ACKed by userspace push_backup */
> - PLOOP_REQ_ALLOW_READS, /* READs are allowed for given req_cluster */
>   PLOOP_REQ_FSYNC_DONE,  /* fsync_thread() performed f_op->fsync() */
>  };
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until FLUSH|FUA

2016-05-26 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> Once we converted extent to initialized it can be part of uncompleted
> journal transaction, so we have to force transaction commit at some point.
>
> Instead of forcing transaction commit immediately, the patch delays it
> until an incoming bio with FLUSH|FUA arrives. Then, as the very first
> step of processing such a bio, we sends corresponding preq to fsync_thread
> to perform f_op->fsync().
>
> As a very unlikely case, it is also possible that processing a FLUSH|FUA
> bio itself results in converting extents. Then, the patch calls f_op->fsync()
> immediately after conversion to preserve FUA semantics.
ACK. With minor comments. See below:
>
> https://jira.sw.ru/browse/PSBM-47026
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |   70 
> ---
>  drivers/block/ploop/io_direct.c |   28 +++-
>  include/linux/ploop/ploop.h |6 +++
>  3 files changed, 76 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 654b60b..03fc289 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -1942,46 +1942,62 @@ err:
>  
>  /* Main preq state machine */
>  
> +static inline bool preq_is_special(struct ploop_request * preq)
> +{
> + return test_bit(PLOOP_REQ_MERGE, >state) ||
> + test_bit(PLOOP_REQ_RELOC_A, >state) ||
> + test_bit(PLOOP_REQ_RELOC_S, >state) ||
> + test_bit(PLOOP_REQ_DISCARD, >state) ||
> + test_bit(PLOOP_REQ_ZERO, >state);
Oh. It looks awful. Please use one atomic read here and other places.
#define PLOOP_REQ_RELOC_A_FL (1 << PLOOP_REQ_RELOC_A)
#define PLOOP_REQ_RELOC_S_FL (1 << PLOOP_REQ_RELOC_S)

unsigned long state = READ_ONCE(preq->state);
...
if (state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_B_FL) ...
> +}
> +
>  static void
>  ploop_entry_request(struct ploop_request * preq)
>  {
>   struct ploop_device * plo   = preq->plo;
>   struct ploop_delta  * top_delta = ploop_top_delta(plo);
> + struct ploop_io * top_io= _delta->io;
>   struct ploop_delta  * delta;
>   int level;
>   int err;
>   iblock_t iblk;
>  
> - /* Control request. */
> - if (unlikely(preq->bl.head == NULL &&
> -  !test_bit(PLOOP_REQ_MERGE, >state) &&
> -  !test_bit(PLOOP_REQ_RELOC_A, >state) &&
> -  !test_bit(PLOOP_REQ_RELOC_S, >state) &&
> -  !test_bit(PLOOP_REQ_DISCARD, >state) &&
> -  !test_bit(PLOOP_REQ_ZERO, >state))) {
> - complete(plo->quiesce_comp);
> - wait_for_completion(>relax_comp);
> - ploop_complete_request(preq);
> - complete(>relaxed_comp);
> - return;
> - }
> + if (!preq_is_special(preq)) {
> + /* Control request */
> + if (unlikely(preq->bl.head == NULL)) {
> + complete(plo->quiesce_comp);
> + wait_for_completion(>relax_comp);
> + ploop_complete_request(preq);
> + complete(>relaxed_comp);
> + return;
> + }
>  
> - /* Empty flush. */
> - if (unlikely(preq->req_size == 0 &&
> -  !test_bit(PLOOP_REQ_MERGE, >state) &&
> -  !test_bit(PLOOP_REQ_RELOC_A, >state) &&
> -  !test_bit(PLOOP_REQ_RELOC_S, >state) &&
> -  !test_bit(PLOOP_REQ_ZERO, >state))) {
> - if (preq->req_rw & REQ_FLUSH) {
> - if (top_delta->io.ops->issue_flush) {
> - top_delta->io.ops->issue_flush(_delta->io, 
> preq);
> - return;
> - }
> + /* Need to fsync before start handling FLUSH */
> + if ((preq->req_rw & REQ_FLUSH) &&
> + test_bit(PLOOP_IO_FSYNC_DELAYED, _io->io_state) &&
> + !test_bit(PLOOP_REQ_FSYNC_DONE, >state)) {
> + spin_lock_irq(>lock);
> + list_add_tail(>list, _io->fsync_queue);
> + if (waitqueue_active(_io->fsync_waitq))
> + wake_up_interruptible(_io->fsync_waitq);
> + spin_unlock_irq(>lock);
> + return;
>   }
>  
> - preq->eng_state = PLOOP_E_COMPLETE;
> - ploop_complete_request(preq);
> - return;
> + /* Empty flush or unknown zero-size request */
Do you know any zero size requests instead of FLUSH?
> + if (preq->req_size == 0) {
> + if (preq->req_rw & REQ_FLUSH &&
> + !test_bit(PLOOP_REQ_FSYNC_DONE, >state)) {

> + if (top_io->ops->issue_flush) {
> + top_io->ops->issue_flush(top_io, preq);
> 

Re: [Devel] [PATCH rh7] cbt: fix possible race on alloc_page()

2016-05-25 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> cbt_page_alloc() drops cbt->lock before calling alloc_page(),
> then re-acquires it. It's safer to re-check that cbt->map[idx]
> is still NULL after re-acquiring the lock.
>
> Signed-off-by: Maxim Patlasov 
Indeed. Ack.
> ---
>  block/blk-cbt.c |7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-cbt.c b/block/blk-cbt.c
> index 8ba52fb..14ad1a2 100644
> --- a/block/blk-cbt.c
> +++ b/block/blk-cbt.c
> @@ -128,7 +128,12 @@ static int cbt_page_alloc(struct cbt_info  **cbt_pp, 
> unsigned long idx,
>   spin_unlock_irq(>lock);
>   return -ENOMEM;
>   }
> - cbt->map[idx] = page;
> +
> + if (likely(CBT_PAGE(cbt, idx) == NULL))
> + cbt->map[idx] = page;
> + else
> + __free_page(page);
> +
>   page = NULL;
>   spin_unlock_irq(>lock);
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH rh7 4/4] ploop: get rid of direct calls to file->f_op->fsync()

2016-05-20 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> The patch hides file->f_op->fsync() in dio_sync. The only exception is
> dio_truncate where "file" may come from userspace fd.
>
> Signed-off-by: Maxim Patlasov 
Acked-by:dmonak...@openvz.org
> ---
>  drivers/block/ploop/io_direct.c |   13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 1ff848c..8096110 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -405,8 +405,7 @@ try_again:
>   }
>  
>   /* flush new i_size to disk */
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> + err = io->ops->sync(io);
>   if (err)
>   goto end_write;
>  
> @@ -524,8 +523,8 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
> * preq)
> FALLOC_FL_CONVERT_UNWRITTEN,
> (loff_t)sec << 9, clu_siz);
>   if (!err)
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> + err = io->ops->sync(io);
> +
>   file_end_write(io->files.file);
>   if (err) {
>   PLOOP_REQ_SET_ERROR(preq, err);
> @@ -814,8 +813,7 @@ static int dio_fsync_thread(void * data)
>   /* filemap_fdatawrite() has been made already */
>   filemap_fdatawait(io->files.mapping);
>  
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> + err = io->ops->sync(io);
>  
>   /* Do we need to invalidate page cache? Not really,
>* because we use it only to create full new pages,
> @@ -1367,8 +1365,7 @@ static int dio_alloc_sync(struct ploop_io * io, loff_t 
> pos, loff_t len)
>   if (err)
>   goto fail;
>  
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> + err = io->ops->sync(io);
>   if (err)
>   goto fail;
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH rh7 1/4] ploop: get rid of FOP_FSYNC

2016-05-20 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> We keep ploop sources as in-tree module of rhel7-based kernel. So we know
> for sure how fsync fop prototype looks like.
>
> Signed-off-by: Maxim Patlasov 
Acked-by:dmonak...@openvz.org
> ---
>  drivers/block/ploop/io_direct.c |   15 +--
>  include/linux/ploop/compat.h|6 --
>  2 files changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 5a2e12a..583b110 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -406,7 +406,8 @@ try_again:
>   }
>  
>   /* flush new i_size to disk */
> - err = io->files.file->f_op->FOP_FSYNC(io->files.file, 
> 0);
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
>   if (err)
>   goto end_write;
>  
> @@ -524,7 +525,8 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
> * preq)
> FALLOC_FL_CONVERT_UNWRITTEN,
> (loff_t)sec << 9, clu_siz);
>   if (!err)
> - err = io->files.file->f_op->FOP_FSYNC(io->files.file, 0);
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
>   file_end_write(io->files.file);
>   if (err) {
>   PLOOP_REQ_SET_ERROR(preq, err);
> @@ -815,8 +817,8 @@ static int dio_fsync_thread(void * data)
>  
>   err = 0;
>   if (io->files.file->f_op->fsync)
> - err = io->files.file->f_op->FOP_FSYNC(io->files.file,
> -   0);
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
>  
>   /* Do we need to invalidate page cache? Not really,
>* because we use it only to create full new pages,
> @@ -853,7 +855,7 @@ static int dio_fsync(struct file * file)
>   ret = filemap_write_and_wait(mapping);
>   err = 0;
>   if (file->f_op && file->f_op->fsync) {
> - err = file->f_op->FOP_FSYNC(file, 0);
> + err = file->f_op->fsync(file, 0, LLONG_MAX, 0);
>   if (!ret)
>   ret = err;
>   }
> @@ -1385,7 +1387,8 @@ static int dio_alloc_sync(struct ploop_io * io, loff_t 
> pos, loff_t len)
>   goto fail;
>  
>   if (io->files.file->f_op && io->files.file->f_op->fsync) {
> - err = io->files.file->f_op->FOP_FSYNC(io->files.file, 0);
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
>   if (err)
>   goto fail;
>   }
> diff --git a/include/linux/ploop/compat.h b/include/linux/ploop/compat.h
> index 03c3ae3..8a36d81 100644
> --- a/include/linux/ploop/compat.h
> +++ b/include/linux/ploop/compat.h
> @@ -58,10 +58,4 @@ static void func(struct bio *bio, int err) {
>  
>  #endif
>  
> -#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,32)
> -#define FOP_FSYNC(file, datasync) fsync(file, 0, LLONG_MAX, datasync)
> -#else
> -#define FOP_FSYNC(file, datasync) fsync(file, F_DENTRY(file), datasync)
> -#endif
> -
>  #endif


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH rh7 3/4] ploop: get rid of dio_fsync()

2016-05-20 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> For ext4 dio_fsync() is actually equivalent to direct call to fsync fop:
>
> 1) file->f_op cannot be NULL;
> 2) file->f_op->fsync is always equal to ext4_sync_file;
> 3) ext4_sync_file() does filemap_write_and_wait() internally,
>no need to call it explicitly.
>
> The patch also fixes a potential problem: if fsync() fails, it's better
> to pass error code up to the stack of callers.
>
> Signed-off-by: Maxim Patlasov 
Acked-by:dmonak...@openvz.org
> ---
>  drivers/block/ploop/io_direct.c |   52 
> ++-
>  1 file changed, 24 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index a37f296..1ff848c 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -844,21 +844,6 @@ static int dio_fsync_thread(void * data)
>   return 0;
>  }
>  
> -static int dio_fsync(struct file * file)
> -{
> - int err, ret;
> - struct address_space *mapping = file->f_mapping;
> -
> - ret = filemap_write_and_wait(mapping);
> - err = 0;
> - if (file->f_op && file->f_op->fsync) {
> - err = file->f_op->fsync(file, 0, LLONG_MAX, 0);
> - if (!ret)
> - ret = err;
> - }
> - return ret;
> -}
> -
>  /* Invalidate page cache. It is called with inode mutex taken
>   * and mapping mapping must be synced. If some dirty pages remained,
>   * it will fail.
> @@ -949,20 +934,17 @@ static void dio_destroy(struct ploop_io * io)
>  static int dio_sync(struct ploop_io * io)
>  {
>   struct file * file = io->files.file;
> + int err = 0;
>  
>   if (file)
> - dio_fsync(file);
> - return 0;
> + err = file->f_op->fsync(file, 0, LLONG_MAX, 0);
> +
> + return err;
>  }
>  
>  static int dio_stop(struct ploop_io * io)
>  {
> - struct file * file = io->files.file;
> -
> - if (file) {
> - dio_fsync(file);
> - }
> - return 0;
> + return io->ops->sync(io);
>  }
>  
>  static int dio_open(struct ploop_io * io)
> @@ -979,7 +961,9 @@ static int dio_open(struct ploop_io * io)
>   io->files.inode = io->files.mapping->host;
>   io->files.bdev = io->files.inode->i_sb->s_bdev;
>  
> - dio_fsync(file);
> + err = io->ops->sync(io);
> + if (err)
> + return err;
>  
>   mutex_lock(>files.inode->i_mutex);
>   em_tree = ploop_dio_open(io, (delta->flags & PLOOP_FMT_RDONLY));
> @@ -1646,7 +1630,11 @@ static int dio_prepare_snapshot(struct ploop_io * io, 
> struct ploop_snapdata *sd)
>   return -EINVAL;
>   }
>  
> - dio_fsync(file);
> + err = io->ops->sync(io);
> + if (err) {
> + fput(file);
> + return err;
> + }
>  
>   mutex_lock(>files.inode->i_mutex);
>   err = dio_invalidate_cache(io->files.mapping, io->files.bdev);
> @@ -1713,7 +1701,11 @@ static int dio_prepare_merge(struct ploop_io * io, 
> struct ploop_snapdata *sd)
>   return -EINVAL;
>   }
>  
> - dio_fsync(file);
> + err = io->ops->sync(io);
> + if (err) {
> + fput(file);
> + return err;
> + }
>  
>   mutex_lock(>files.inode->i_mutex);
>  
> @@ -1772,8 +1764,12 @@ static int dio_truncate(struct ploop_io * io, struct 
> file * file,
>   atomic_long_sub(*io->size_ptr - new_size, _io_images_size);
>   *io->size_ptr = new_size;
>  
> - if (!err)
> - err = dio_fsync(file);
> + if (!err) {
> + if (io->files.file == file)
> + err = io->ops->sync(io);
> + else
> + err = file->f_op->fsync(file, 0, LLONG_MAX, 0);
> + }
>  
>   return err;
>  }


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH rh7 2/4] ploop: io_direct: check for fsync fop on startup

2016-05-20 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> We don't support host file systems without fsync fop. The patch refuses
> to start ploop if fsync is absent.
>
> Signed-off-by: Maxim Patlasov 
Acked-by:dmonak...@openvz.org
> ---
>  drivers/block/ploop/io_direct.c |   23 ---
>  1 file changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 583b110..a37f296 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -376,7 +376,6 @@ cached_submit(struct ploop_io *io, iblock_t iblk, struct 
> ploop_request * preq,
>   loff_t new_size;
>   loff_t used_pos;
>   bool may_fallocate = io->files.file->f_op->fallocate &&
> - io->files.file->f_op->fsync &&
>   io->files.flags & EXT4_EXTENTS_FL;
>  
>   trace_cached_submit(preq);
> @@ -815,10 +814,8 @@ static int dio_fsync_thread(void * data)
>   /* filemap_fdatawrite() has been made already */
>   filemap_fdatawait(io->files.mapping);
>  
> - err = 0;
> - if (io->files.file->f_op->fsync)
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
>  
>   /* Do we need to invalidate page cache? Not really,
>* because we use it only to create full new pages,
> @@ -1386,12 +1383,11 @@ static int dio_alloc_sync(struct ploop_io * io, 
> loff_t pos, loff_t len)
>   if (err)
>   goto fail;
>  
> - if (io->files.file->f_op && io->files.file->f_op->fsync) {
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> - if (err)
> - goto fail;
> - }
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
> + if (err)
> + goto fail;
> +
>   err = filemap_fdatawait(io->files.mapping);
>  
>  fail:
> @@ -1878,6 +1874,11 @@ static int dio_autodetect(struct ploop_io * io)
>   return -1;
>   }
>  
> + if (!file->f_op->fsync) {
> + printk("Cannot run on EXT4(%s): no fsync\n", s_id);
> + return -1;
> + }
> +
>   fs = get_fs();
>   set_fs(KERNEL_DS);
>   flags = 0;


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/6] e4defrag2: [TP case] force defrag for very low populated clusters

2016-05-16 Thread Dmitry Monakhov
If cluster has small numbers of blocks used it is reasonable to
relocate such blocks regardless to inode's quality and free whole cluster.


https://jira.sw.ru/browse/PSBM-46563

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 misc/e4defrag2.c |   54 +-
 1 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 797a342..9206c89 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -279,6 +279,7 @@ enum spext_flags
SP_FL_DIRLOCAL = 0x20,
SP_FL_CSUM = 0x40,
SP_FL_FMAP = 0x80,
+   SP_FL_TP_RELOC = 0x100,
 };
 
 struct rb_fhandle
@@ -383,6 +384,7 @@ struct defrag_context
unsignedcluster_size;
unsignedief_reloc_cluster;
unsignedweight_scale;
+   unsignedtp_weight_scale;
unsignedextents_quality;
 };
 
@@ -1098,6 +1100,7 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
int is_old = 0;
int is_rdonly = 0;
__u64 ief_blocks = 0;
+   __u64 tp_blocks = 0;
__u32 ino_flags = 0;
__u64 size_blk = dfx_sz2b(dfx, stat->st_size);
__u64 used_blk = dfx_sz2b(dfx, stat->st_blocks << 9);
@@ -1158,13 +1161,16 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
}
if (se->flags & SP_FL_IEF_RELOC)
ief_blocks += fec->fec_map[i].len;
+   if (se->flags & SP_FL_TP_RELOC)
+   tp_blocks += fec->fec_map[i].len;
+
fmap_csum_ext(fec->fec_map + i, );
}
 
if (fest.local_ex == fec->fec_extents)
ino_flags |= SP_FL_LOCAL;
 
-   if (ief_blocks) {
+   if (ief_blocks || tp_blocks) {
/*
 * Even if some extents belong to IEF cluster, it is not a good
 * idea to relocate the whole file. From other point of view,
@@ -1182,6 +1188,13 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
   "size_blk:%lld used_blk:%lld\n",
   __func__, stat->st_ino, ief_blocks,
   size_blk, used_blk);
+   } else if (tp_blocks * 4 > size_blk) {
+   ino_flags |= SP_FL_IEF_RELOC | SP_FL_TP_RELOC;
+   if (debug_flag & DBG_SCAN && ief_blocks != size_blk)
+   printf("%s Force add %lu to IEF/TP set ief:%lld 
"
+  "size_blk:%lld used_blk:%lld\n",
+  __func__, stat->st_ino, ief_blocks,
+  size_blk, used_blk);
} else if (debug_flag & DBG_SCAN) {
printf("%s Reject %lu from IEF set ief:%lld "
   "size_blk:%lld used_blk:%lld\n",
@@ -1592,6 +1605,7 @@ static void pass3_prep(struct defrag_context *dfx)
unsigned good = 0;
unsigned count = 0;
unsigned ief_ok = 0;
+   unsigned force_reloc = 0;
 
if (verbose)
printf("Pass3_prep:  Scan and rate cached extents\n");
@@ -1610,18 +1624,29 @@ static void pass3_prep(struct defrag_context *dfx)
print_spex("\t\t\t", ex);
 
if (prev_cluster != cluster) {
-   ief_ok = 0;
+   force_reloc = ief_ok = 0;
+   /* Is cluster has enough RO(good) data blocks ?*/
if (dfx->cluster_size  >= used * dfx->weight_scale &&
-   good * 1000 >= count * dfx->extents_quality &&
-   cluster_node) {
+   good * 1000 >= count * dfx->extents_quality)
+   ief_ok = 1;
+
+   /* Thin provision corner case: If cluster has low number
+* of data blocks it should be relocated regardless to
+* block's quality in order to improve space efficency 
*/
+   if (dfx->cluster_size  >= used * dfx->tp_weight_scale) {
+   ief_ok = 1;
+   force_reloc = 1;
+   }
+
+   if (ief_ok && cluster_node) {
while (cluster_node != node) {
struct spextent *se =
node_to_spextent(cluster_node);
-   ief_ok = 1;
se->flags |= SP_FL_IEF_RELOC;
+  

[Devel] [PATCH 1/6] e4defrag2: improve debugging

2016-05-16 Thread Dmitry Monakhov
Dump doror rejection reason.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 misc/e4defrag2.c |   14 ++
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 8ecae16..797a342 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -217,6 +217,7 @@ enum debug_flags {
DBG_FS = 0x10,
DBG_FIEMAP = 0x20,
DBG_BITMAP = 0x40,
+   DBG_ERR = 0x80,
 };
 
 /* The following macro is used for ioctl FS_IOC_FIEMAP
@@ -1740,10 +1741,14 @@ static int do_alloc_donor_space(struct defrag_context 
*dfx, dgrp_t group,
goto err;
}
TODO:  Checks are sufficient for good donor?
-   if (force_local && donor->fest.local_ex != fec->fec_extents)
+   if (force_local && donor->fest.local_ex != fec->fec_extents) {
+   ret = -2;
goto err;
-   if (donor->fest.frag > max_frag)
+   }
+   if (donor->fest.frag > max_frag) {
+   ret = -3;
goto err;
+   }
 
if (debug_flag & DBG_FS)
printf("%s: Create donor file is_local:%d blocks:%lld\n", 
__func__,
@@ -1754,11 +1759,12 @@ static int do_alloc_donor_space(struct defrag_context 
*dfx, dgrp_t group,
donor->fec = fec;
return 0;
 err:
-   if (debug_flag & DBG_RT)
+   if (debug_flag & DBG_ERR)
printf("%s:%d REJECT donor grp:%u donor_fd:%d blocks:%llu 
local:%d frag:%u ret:%d\n",
-  __func__, __LINE__,  group, donor->fd, blocks, 
force_local, max_frag, -1);
+  __func__, __LINE__,  group, donor->fd, blocks, 
force_local, max_frag, ret);
 
free(fec);
+
return -1;
 }
 
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 4/6] ext4defrag2: add on/off forcelocal option

2016-05-16 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 misc/e4defrag2.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 0ca7a63..771ee51 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -2516,7 +2516,7 @@ int main(int argc, char *argv[])
add_error_table(_ext2_error_table);
gettimeofday(_start, 0);
 
-   while ((c = getopt(argc, argv, "a:C:c:d:fF:hlmnt:s:S:T:vq:")) != EOF) {
+   while ((c = getopt(argc, argv, "a:C:c:d:fF:hl:mnt:s:S:T:vq:")) != EOF) {
switch (c) {
case 'a':
min_frag_size = strtoul(optarg, , 0);
@@ -2572,7 +2572,7 @@ int main(int argc, char *argv[])
usage();
break;
case 'l':
-   dfx.ief_force_local = 1;
+   dfx.ief_force_local = !!strtoul(optarg, , 0);
break;
 
case 'n':
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/6] ext4defrag2: improve statistics configuration

2016-05-16 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 misc/e4defrag2.c |   85 +
 1 files changed, 65 insertions(+), 20 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 9206c89..0ca7a63 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -218,6 +218,10 @@ enum debug_flags {
DBG_FIEMAP = 0x20,
DBG_BITMAP = 0x40,
DBG_ERR = 0x80,
+   DBG_CLUSTER = 0x100,
+   DBG_TAG = 0x200,
+   DBG_IAF = 0x400,
+   DBG_IEF = 0x800,
 };
 
 /* The following macro is used for ioctl FS_IOC_FIEMAP
@@ -903,7 +907,7 @@ static int group_add_ief_candidate(struct defrag_context 
*dfx, int dirfd, const
fhp->handle_bytes = dfx->root_fhp->handle_bytes;
ret = name_to_handle_at(dirfd, name, fhp, , 0);
if (ret) {
-   if (debug_flag & DBG_SCAN)
+   if (debug_flag & (DBG_SCAN|DBG_IEF))
fprintf(stderr, "Unexpected result from 
name_to_handle_at()\n");
goto free_fh;
}
@@ -916,7 +920,7 @@ static int group_add_ief_candidate(struct defrag_context 
*dfx, int dirfd, const
 
if (insert_fhandle(>group[group]->fh_root, >node)) {
/* Inode is already in the list, likely nlink > 1 */
-   if (debug_flag & DBG_SCAN)
+   if (debug_flag & (DBG_SCAN|DBG_IEF))
fprintf(stderr, "File is already in the list, nlink > 
1,"
" Not an error\n");
ext2fs_free_mem();
@@ -1127,8 +1131,10 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
goto out;
 
group_add_dircache(dfx, dirfd, , ".");
-   do_iaf_defrag_one(dfx, dirfd, name, stat, fec, );
-   goto out;
+   ret = do_iaf_defrag_one(dfx, dirfd, name, stat, fec, );
+   if (!ret)
+   goto out;
+   
}
 
if (stat->st_mtime  < older_than)
@@ -1171,6 +1177,12 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
ino_flags |= SP_FL_LOCAL;
 
if (ief_blocks || tp_blocks) {
+   if (debug_flag & DBG_SCAN && ief_blocks != size_blk)
+   printf("%s ENTER %lu to IEF set ief:%lld "
+  "size_blk:%lld used_blk:%lld\n",
+  __func__, stat->st_ino, ief_blocks,
+  size_blk, used_blk);
+
/*
 * Even if some extents belong to IEF cluster, it is not a good
 * idea to relocate the whole file. From other point of view,
@@ -1201,11 +1213,17 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
   __func__, stat->st_ino, ief_blocks,
   size_blk, used_blk);
}
+   if (debug_flag & DBG_SCAN && ief_blocks != size_blk)
+   printf("%s ENTER %lu to IEF set ief:%lld "
+  "size_blk:%lld used_blk:%lld fl:%lx\n",
+  __func__, stat->st_ino, ief_blocks,
+  size_blk, used_blk, ino_flags);
+
}
 
if (ino_flags & SP_FL_IEF_RELOC) {
struct stat dst;
-   struct rb_fhandle *rbfh;
+   struct rb_fhandle *rbfh = NULL;
/* FIXME: Is it any better way to find directory inode num? */
ret = fstat(dirfd, );
if (!ret && ino_grp ==  e4d_group_of_ino(dfx, dst.st_ino))
@@ -1456,7 +1474,7 @@ static int ief_defrag_prep_one(struct defrag_context 
*dfx, dgrp_t group,
if (fhandle->flags & SP_FL_LOCAL)
dfx->group[group]->ief_local++;
 
-   if (debug_flag & DBG_SCAN)
+   if (debug_flag & (DBG_SCAN | DBG_IEF))
printf("%s Check inode %lu flags:%x, OK...\n",
   __func__, stat->st_ino, fhandle->flags);
 
@@ -1603,7 +1621,9 @@ static void pass3_prep(struct defrag_context *dfx)
__u64 clusters_to_move = 0;
unsigned used = 0;
unsigned good = 0;
+   unsigned mdata = 0;
unsigned count = 0;
+   unsigned found = 0;
unsigned ief_ok = 0;
unsigned force_reloc = 0;
 
@@ -1620,7 +1640,7 @@ static void pass3_prep(struct defrag_context *dfx)
ex->flags |= SP_FL_FULL;
cluster = (ex->start + ex->count) & cluster_mask;
 
-   if (debug_flag & DBG_TREE)
+   if (debug_flag & DBG_CLUSTER)
print_spex("\t\t\t", ex);
 
if (prev_cluster != cluster) {
@@ -1645,7 +

[Devel] [PATCH 6/6] e4defrag2: fix collapse inode index tree issue

2016-05-16 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 misc/e4defrag2.c |   68 +++--
 1 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 7aab2b4..d351965 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -242,6 +242,7 @@ struct fmap_extent_cache
 {
unsigned fec_size;  /* map array size */
unsigned fec_extents;   /* number of valid entries */
+   struct fmap_extent *fec_xattr;
struct fmap_extent fec_map[];
 };
 
@@ -252,6 +253,9 @@ struct fmap_extent_stat
unsigned group; /* Number of groups, counter is speculative */
unsigned local_ex; /* Number of extents from  the same group as inode */
unsigned local_sz; /* Total len of local extents */
+   unsigned nr_idx; /* Number of index blocks */
+   __u64xattr; /* xattr phys block */
+
 };
 
 /* Used space and integral inode usage stats */
@@ -750,9 +754,10 @@ static int __get_inode_fiemap(struct defrag_context *dfx, 
int fd,
(*fec)->fec_size = DEFAULT_FMAP_CACHE_SZ;
(*fec)->fec_extents = 0;
}
-   if (fest)
+   if (fest) {
memset(fest, 0 , sizeof(*fest));
-
+   fest->nr_idx = st->st_blocks >> (blksz_log - 9);
+   }
ext_buf = fiemap_buf->fm_extents;
memset(fiemap_buf, 0, fie_buf_size);
fiemap_buf->fm_length = FIEMAP_MAX_OFFSET;
@@ -791,6 +796,12 @@ static int __get_inode_fiemap(struct defrag_context *dfx, 
int fd,
fest->group++;
prev_blk_grp = blk_grp;
}
+   /* We are work on livefs so race is possible */
+   if (fest->nr_idx < len) {
+   ret = -1;
+   goto out;
+   }
+   fest->nr_idx -= len;
}
 
if ((*fec)->fec_extents && lblk == lblk_last && pblk == 
pblk_last) {
@@ -834,12 +845,36 @@ static int __get_inode_fiemap(struct defrag_context *dfx, 
int fd,
 */
} while (fiemap_buf->fm_mapped_extents == EXTENT_MAX_COUNT &&
 !(ext_buf[EXTENT_MAX_COUNT-1].fe_flags & FIEMAP_EXTENT_LAST));
+
+   /* get xattr block */
+   fiemap_buf->fm_flags |= FIEMAP_FLAG_XATTR;
+   fiemap_buf->fm_start = 0;
+   memset(ext_buf, 0, ext_buf_size);
+   ret = ioctl(fd, FS_IOC_FIEMAP, fiemap_buf);
+   if (ret < 0 || fiemap_buf->fm_mapped_extents == 0) {
+   if (debug_flag & DBG_FIEMAP) {
+   fprintf(stderr, "%s: Can't get xattr info for"
+   " inode:%ld ret:%d mapped:%d\n",
+   __func__, st->st_ino, ret,
+   fiemap_buf->fm_mapped_extents);
+   }
+   goto out;
+   }
+   if (!(ext_buf[0].fe_flags & FIEMAP_EXTENT_DATA_INLINE)) {
+   fest->xattr = ext_buf[i].fe_physical >> blksz_log;
+   if (fest->nr_idx)
+   ret = -1;
+
+   fest->nr_idx--;
+   }
 out:
/FIXME:DEBUG
-   if (debug_flag & DBG_FIEMAP && fest)
-   printf("%s fmap stat ino:%ld hole:%d frag:%d local_ex:%d 
local_sz:%d group:%d\n",
+   if ((debug_flag & DBG_FIEMAP) && fest)
+   printf("%s fmap stat ino:%ld hole:%d frag:%d local_ex:%d "
+  "local_sz:%d group:%d nr_idx:%u xattr:%lld ret:%d\n",
   __func__, st->st_ino, fest->hole, fest->frag,
-  fest->local_ex, fest->local_sz, fest->group);
+  fest->local_ex, fest->local_sz, fest->group, 
fest->nr_idx,
+  fest->xattr, ret);
 
free(fiemap_buf);
 
@@ -1134,7 +1169,6 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
ret = do_iaf_defrag_one(dfx, dirfd, name, stat, fec, );
if (!ret)
goto out;
-   
}
 
if (stat->st_mtime  < older_than)
@@ -1916,7 +1950,7 @@ static int prepare_donor(struct defrag_context *dfx, 
dgrp_t group,
printf("%s grp:%u donor_fd:%d blocks:%llu frag:%u\n",
   __func__, group, donor->fd, blocks, max_frag);
}
-   assert(blocks);
+   assert(blocks && max_frag);
 
/* First try to reuse existing donor if available */
if (donor->fd != -1) {
@@ -1954,23 +1988,28 @@ static int check_iaf(struct defrag_context *dfx

[Devel] [PATCH 5/6] e4defrag2: prevent agressive donor lookup

2016-05-16 Thread Dmitry Monakhov
It was bad idea to try all dirs from all groups for donor especially for big 
filesystems.
Let's scan only local ones.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 misc/e4defrag2.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 771ee51..7aab2b4 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -1830,6 +1830,7 @@ static int do_find_donor(struct defrag_context *dfx, 
dgrp_t group,
int dir, i, ret = 0;
struct stat64 st;
dgrp_t donor_grp;
+   int dir_retries = 3;
unsigned char *raw_fh = dfx->group[group]->dir_rawh;
const char *dfname = ".e4defrag2_donor.tmp";
 
@@ -1896,7 +1897,7 @@ static int do_find_donor(struct defrag_context *dfx, 
dgrp_t group,
try_next:
close(dir);
close_donor(donor);
-   if (ret)
+   if (ret || !dir_retries--)
return -1;
}
 
@@ -1934,7 +1935,7 @@ static int prepare_donor(struct defrag_context *dfx, 
dgrp_t group,
return -1;
 
/* Sequentially search groups and create first available */
-   for (i = 0; i < nr_groups; i++) {
+   for (i = 1; i < 16; i++) {
if (dfx->group[(group + i) % nr_groups]) {
ret = do_find_donor(dfx, (group + i) % nr_groups,
donor, blocks, 0, max_frag);
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [RH6 PATCH] [MS] ext4: collapse a single extent tree block into the inode if possible

2016-05-16 Thread Dmitry Monakhov

Backport ecb94f5fdf4b72547fca022421a9dca1672bddd4
This patch is required for sane defragmenration procedure.
https://jira.sw.ru/browse/PSBM-46563
#ORIG_MSG:
[PATCH] ext4: collapse a single extent tree block into the inode if possible

If an inode has more than 4 extents, but then later some of the
extents are merged together, we can optimize the file system by moving
the extents up into the inode, and discarding the extent tree block.
This is important, because if there are a large number of inodes with
an external extent tree blocks where the contents could fit in the
inode, this can significantly increase the fsck time of the file
system.

Google-Bug-Id: 6801242

Signed-off-by: "Theodore Ts'o" <ty...@mit.edu>
Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 85c4d4e..5eba717 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1668,10 +1668,54 @@ static int ext4_ext_try_to_merge_right(struct inode 
*inode,
 }
 
 /*
+ * This function does a very simple check to see if we can collapse
+ * an extent tree with a single extent tree leaf block into the inode.
+ */
+static void ext4_ext_try_to_merge_up(handle_t *handle,
+struct inode *inode,
+struct ext4_ext_path *path)
+{
+   size_t s;
+   unsigned max_root = ext4_ext_space_root(inode, 0);
+   ext4_fsblk_t blk;
+
+   if ((path[0].p_depth != 1) ||
+   (le16_to_cpu(path[0].p_hdr->eh_entries) != 1) ||
+   (le16_to_cpu(path[1].p_hdr->eh_entries) > max_root))
+   return;
+
+   /*
+* We need to modify the block allocation bitmap and the block
+* group descriptor to release the extent tree block.  If we
+* can't get the journal credits, give up.
+*/
+   if (ext4_journal_extend(handle, 2))
+   return;
+
+   /*
+* Copy the extent data up to the inode
+*/
+   blk = ext4_idx_pblock(path[0].p_idx);
+   s = le16_to_cpu(path[1].p_hdr->eh_entries) *
+   sizeof(struct ext4_extent_idx);
+   s += sizeof(struct ext4_extent_header);
+
+   memcpy(path[0].p_hdr, path[1].p_hdr, s);
+   path[0].p_depth = 0;
+   path[0].p_ext = EXT_FIRST_EXTENT(path[0].p_hdr) +
+   (path[1].p_ext - EXT_FIRST_EXTENT(path[1].p_hdr));
+   path[0].p_hdr->eh_max = cpu_to_le16(max_root);
+
+   brelse(path[1].p_bh);
+   ext4_free_blocks(handle, inode, blk, 1, EXT4_FREE_BLOCKS_METADATA);
+}
+
+/*
  * This function tries to merge the @ex extent to neighbours in the tree.
  * return 1 if merge left else 0.
  */
-static int ext4_ext_try_to_merge(struct inode *inode,
+static int ext4_ext_try_to_merge(handle_t *handle,
+ struct inode *inode,
  struct ext4_ext_path *path,
  struct ext4_extent *ex) {
struct ext4_extent_header *eh;
@@ -1687,8 +1731,9 @@ static int ext4_ext_try_to_merge(struct inode *inode,
merge_done = ext4_ext_try_to_merge_right(inode, path, ex - 1);
 
if (!merge_done)
-   ret = ext4_ext_try_to_merge_right(inode, path, ex);
+   ret =  ext4_ext_try_to_merge_right(inode, path, ex);
 
+   ext4_ext_try_to_merge_up(handle, inode, path);
return ret;
 }
 
@@ -1897,7 +1942,7 @@ has_space:
 merge:
/* try to merge extents to the right */
if (!(flag & EXT4_GET_BLOCKS_DIO))
-   ext4_ext_try_to_merge(inode, path, nearex);
+   ext4_ext_try_to_merge(handle, inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -1906,7 +1951,7 @@ merge:
if (err)
goto cleanup;
 
-   err = ext4_ext_dirty(handle, inode, path + depth);
+   err = ext4_ext_dirty(handle, inode, path + path->p_depth);
 
 cleanup:
if (npath) {
@@ -2878,9 +2923,9 @@ static int ext4_split_extent_at(handle_t *handle,
ext4_ext_mark_initialized(ex);
 
if (!(flags & EXT4_GET_BLOCKS_DIO))
-   ext4_ext_try_to_merge(inode, path, ex);
+   ext4_ext_try_to_merge(handle, inode, path, ex);
 
-   err = ext4_ext_dirty(handle, inode, path + depth);
+   err = ext4_ext_dirty(handle, inode, path + path->p_depth);
goto out;
}
 
@@ -2894,7 +2939,7 @@ static int ext4_split_extent_at(handle_t *handle,
 * path may lead to new leaf, not to original leaf any more
 * after ext4_ext_insert_extent() returns,
 */
-   err = ext4_ext_dirty(handle, inode, path + depth);
+   err = ext4_ext_dirty(handle, inode, path + path->p_depth);
if (err)
goto fix_extent_len;
 
@@ -2912,8 +2957,8 @@ static int ext4_split_extent_at(handle_t *handle,
goto fix_extent_len

Re: [Devel] [PATCH rh7] cbt: fix cbt->block_max calculation

2016-05-10 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> When the size of block device is multiple of CBT blocksize, the following:
>
>> cbt->block_max  = (size + blocksize) >> cbt->block_bits;
Pure typo fix. ACK.
>
> is incorrect. This may end up in allocating one extra page in cbt->map and
> also make various checks with cbt->block_max prone to error.
>
> Signed-off-by: Maxim Patlasov 
> ---
>  block/blk-cbt.c   |2 +-
>  drivers/block/ploop/push_backup.c |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-cbt.c b/block/blk-cbt.c
> index 8c52bd8..8cdf1d6 100644
> --- a/block/blk-cbt.c
> +++ b/block/blk-cbt.c
> @@ -252,7 +252,7 @@ static struct cbt_info* do_cbt_alloc(struct request_queue 
> *q, __u8 *uuid,
>   return ERR_PTR(-ENOMEM);
>  
>   cbt->block_bits = ilog2(blocksize);
> - cbt->block_max  = (size + blocksize) >> cbt->block_bits;
> + cbt->block_max  = (size + blocksize - 1) >> cbt->block_bits;
>   spin_lock_init(>lock);
>   memcpy(cbt->uuid, uuid, sizeof(cbt->uuid));
>   cbt->cache = alloc_percpu(struct cbt_extent);
> diff --git a/drivers/block/ploop/push_backup.c 
> b/drivers/block/ploop/push_backup.c
> index 05af67c..4d671a5 100644
> --- a/drivers/block/ploop/push_backup.c
> +++ b/drivers/block/ploop/push_backup.c
> @@ -175,7 +175,7 @@ bool ploop_pb_check_bit(struct ploop_pushbackup_desc 
> *pbd, cluster_t clu)
>  static int convert_map_to_map(struct ploop_pushbackup_desc *pbd)
>  {
>   struct page **from_map = pbd->cbt_map;
> - blkcnt_t from_max = pbd->cbt_block_max - 1;
> + blkcnt_t from_max = pbd->cbt_block_max;
>   blkcnt_t from_bits = pbd->cbt_block_bits;
>  
>   struct page **to_map = pbd->ppb_map;


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ploop: push_backup: fix reentrance in ploop_pb_get_pending()

2016-05-09 Thread Dmitry Monakhov
The patch implements what Dima Monakhov suggested:

>  AFAIU you have a re-entrance issue if several tasks want performs ioctls
>   task1:ioctl->wait
>   task2:ioctl->wait
>
>   Just change wait sequence like this and you are safe:
>  /* blocking case */
> if (unlikely(pbd->ppb_waiting))
>  /* Other task is already waitng for event */
>  err = -EBUSY;
>  goto get_pending_unlock;
> }
> pbd->ppb_waiting = true;
> spin_unlock(>ppb_lock);
> mutex_unlock(>ctl_mutex);

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/push_backup.c |5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/block/ploop/push_backup.c 
b/drivers/block/ploop/push_backup.c
index 4d671a5..10fd55a 100644
--- a/drivers/block/ploop/push_backup.c
+++ b/drivers/block/ploop/push_backup.c
@@ -466,6 +466,11 @@ int ploop_pb_get_pending(struct ploop_pushbackup_desc *pbd,
}
 
 /* blocking case */
+   if (unlikely(pbd->ppb_waiting)) {
+   /* Other task is already waitng for event */
+   err = -EBUSY;
+   goto get_pending_unlock;
+   }
pbd->ppb_waiting = true;
spin_unlock(>ppb_lock);
 

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 1/4] ploop: introduce pbd

2016-04-30 Thread Dmitry Monakhov
Maxim Patlasov <mpatla...@virtuozzo.com> writes:

> The patch introduce push_backup descriptor ("pbd") and a few simple
> functions to create and release it.
>
> Userspace can govern it by new ioctls: PLOOP_IOC_PUSH_BACKUP_INIT and
> PLOOP_IOC_PUSH_BACKUP_STOP.
Acked-by: Dmitry Monakhov <dmonak...@openvz.org>
>
> Signed-off-by: Maxim Patlasov <mpatla...@virtuozzo.com>
> ---
>  drivers/block/ploop/Makefile  |2 
>  drivers/block/ploop/dev.c |   89 
>  drivers/block/ploop/push_backup.c |  271 
> +
>  drivers/block/ploop/push_backup.h |8 +
>  include/linux/ploop/ploop.h   |3 
>  include/linux/ploop/ploop_if.h|   19 +++
>  6 files changed, 391 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/block/ploop/push_backup.c
>  create mode 100644 drivers/block/ploop/push_backup.h
>
> diff --git a/drivers/block/ploop/Makefile b/drivers/block/ploop/Makefile
> index e36a027..0fecf16 100644
> --- a/drivers/block/ploop/Makefile
> +++ b/drivers/block/ploop/Makefile
> @@ -5,7 +5,7 @@ CFLAGS_io_direct.o = -I$(src)
>  CFLAGS_ploop_events.o = -I$(src)
>  
>  obj-$(CONFIG_BLK_DEV_PLOOP)  += ploop.o
> -ploop-objs := dev.o map.o io.o sysfs.o tracker.o freeblks.o ploop_events.o 
> discard.o
> +ploop-objs := dev.o map.o io.o sysfs.o tracker.o freeblks.o ploop_events.o 
> discard.o push_backup.o
>  
>  obj-$(CONFIG_BLK_DEV_PLOOP)  += pfmt_ploop1.o
>  pfmt_ploop1-objs := fmt_ploop1.o
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 1da073c..23da9f5 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -19,6 +19,7 @@
>  #include "ploop_events.h"
>  #include "freeblks.h"
>  #include "discard.h"
> +#include "push_backup.h"
>  
>  /* Structures and terms:
>   *
> @@ -3766,6 +3767,9 @@ static int ploop_stop(struct ploop_device * plo, struct 
> block_device *bdev)
>   return -EBUSY;
>   }
>  
> + clear_bit(PLOOP_S_PUSH_BACKUP, >state);
> + ploop_pb_stop(plo->pbd);
> +
>   for (p = plo->disk->minors - 1; p > 0; p--)
>   invalidate_partition(plo->disk, p);
>   invalidate_partition(plo->disk, 0);
> @@ -3892,6 +3896,7 @@ static int ploop_clear(struct ploop_device * plo, 
> struct block_device * bdev)
>   }
>  
>   ploop_fb_fini(plo->fbd, 0);
> + ploop_pb_fini(plo->pbd);
>  
>   plo->maintenance_type = PLOOP_MNTN_OFF;
>   plo->bd_size = 0;
> @@ -4477,6 +4482,84 @@ static int ploop_getdevice_ioc(unsigned long arg)
>   return err;
>  }
>  
> +static int ploop_push_backup_init(struct ploop_device *plo, unsigned long 
> arg)
> +{
> + struct ploop_push_backup_init_ctl ctl;
> + struct ploop_pushbackup_desc *pbd = NULL;
> + int rc = 0;
> +
> + if (list_empty(>map.delta_list))
> + return -ENOENT;
> +
> + if (plo->maintenance_type != PLOOP_MNTN_OFF)
> + return -EINVAL;
> +
> + BUG_ON(plo->pbd);
> +
> + if (copy_from_user(, (void*)arg, sizeof(ctl)))
> + return -EFAULT;
> +
> + pbd = ploop_pb_alloc(plo);
> + if (!pbd) {
> + rc = -ENOMEM;
> + goto pb_init_done;
> + }
> +
> + ploop_quiesce(plo);
> +
> + rc = ploop_pb_init(pbd, ctl.cbt_uuid, !ctl.cbt_mask_addr);
> + if (rc) {
> + ploop_relax(plo);
> + goto pb_init_done;
> + }
> +
> + plo->pbd = pbd;
> +
> + atomic_set(>maintenance_cnt, 0);
> + plo->maintenance_type = PLOOP_MNTN_PUSH_BACKUP;
> + set_bit(PLOOP_S_PUSH_BACKUP, >state);
> +
> + ploop_relax(plo);
> +
> + if (ctl.cbt_mask_addr)
> + rc = ploop_pb_copy_cbt_to_user(pbd, (char *)ctl.cbt_mask_addr);
> +pb_init_done:
> + if (rc)
> + ploop_pb_fini(pbd);
> + return rc;
> +}
> +
> +static int ploop_push_backup_stop(struct ploop_device *plo, unsigned long 
> arg)
> +{
> + struct ploop_pushbackup_desc *pbd = plo->pbd;
> + struct ploop_push_backup_stop_ctl ctl;
> +
> + if (plo->maintenance_type != PLOOP_MNTN_PUSH_BACKUP)
> + return -EINVAL;
> +
> + if (copy_from_user(, (void*)arg, sizeof(ctl)))
> + return -EFAULT;
> +
> + if (pbd && ploop_pb_check_uuid(pbd, ctl.cbt_uuid)) {
> + printk("ploop(%d): PUSH_BACKUP_STOP uuid mismatch\n",
> +plo->index);
> + return -EINVAL;
> + }
> +
> + if (!test_and_clear_b

Re: [Devel] [PATCH rh7 3/4] ploop: wire push_backup into state-machine

2016-04-30 Thread Dmitry Monakhov
Maxim Patlasov  writes:

I can not avoid obsession that this request joggling fully destroys FS
barriers assumptions.

For example: fs does
submit_bio(data_b1)
submit_bio(data_b2) 
submit_bio(commit_b3, FLUSH|FUA) journal commit record
wait_for_bio(commit_b3)
But there is no guaranee that data_b1 and data_b2 was completed already.
They can be in pedned list. In case of power-loss we have good commit
record which reference b1 and b2, but  b1 and b2 was not flushed,
which result expose of unitialized data.
In fact ext4/jbd2 will wait b1 and b2 first and only after that it will b3 so
ext4 will works fine.

Otherwise looks good.

> When ploop state-machine looks at preq first time, it suspends the preq if
> its cluster-block matches pbd->ppb_map -- the copy of CBT mask initially.
> To suspend preq we simply put it to pbd->pending_tree and plo->lockout_tree.
>
> Later, when userspace reports that out-of-band processing is done, we
> set PLOOP_REQ_PUSH_BACKUP bit in preq->state, re-schedule the preq and
> wakeup ploop state-machine. This PLOOP_REQ_PUSH_BACKUP bit lets state-machine
> know that given preq is OK and we shouldn't suspend further preq-s for
> given cluster-block anymore.
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c |   32 +++
>  drivers/block/ploop/push_backup.c |   62 
> +
>  drivers/block/ploop/push_backup.h |6 
>  include/linux/ploop/ploop.h   |1 +
>  4 files changed, 101 insertions(+)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 2a77d2e..c7cc385 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -2021,6 +2021,38 @@ restart:
>   return;
>   }
>  
> + /* push_backup special processing */
> + if (!test_bit(PLOOP_REQ_LOCKOUT, >state) &&
> + (preq->req_rw & REQ_WRITE) && preq->req_size &&
> + ploop_pb_check_bit(plo->pbd, preq->req_cluster)) {
> + if (ploop_pb_preq_add_pending(plo->pbd, preq)) {
> + /* already reported by userspace push_backup */
> + ploop_pb_clear_bit(plo->pbd, preq->req_cluster);
> + } else {
> + spin_lock_irq(>lock);
> + ploop_add_lockout(preq, 0);
> + spin_unlock_irq(>lock);
> + /*
> +  * preq IN: preq is in ppb_pending tree waiting for
> +  * out-of-band push_backup processing by userspace ...
> +  */
> + return;
> + }
> + } else if (test_bit(PLOOP_REQ_LOCKOUT, >state) &&
> +test_and_clear_bit(PLOOP_REQ_PUSH_BACKUP, >state)) {
> + /*
> +  * preq OUT: out-of-band push_backup processing by
> +  * userspace done; preq was re-scheduled
> +  */
> + ploop_pb_clear_bit(plo->pbd, preq->req_cluster);
> +
> + spin_lock_irq(>lock);
> + del_lockout(preq);
> + if (!list_empty(>delay_list))
> + list_splice_init(>delay_list, 
> plo->ready_queue.prev);
> + spin_unlock_irq(>lock);
> + }
> +
>   if (plo->trans_map) {
>   err = ploop_find_trans_map(plo->trans_map, preq);
>   if (err) {
> diff --git a/drivers/block/ploop/push_backup.c 
> b/drivers/block/ploop/push_backup.c
> index 477caf7..488b8fb 100644
> --- a/drivers/block/ploop/push_backup.c
> +++ b/drivers/block/ploop/push_backup.c
> @@ -146,6 +146,32 @@ static void set_bit_in_map(struct page **map, u64 
> map_max, u64 blk)
>   do_bit_in_map(map, map_max, blk, SET_BIT);
>  }
>  
> +static void clear_bit_in_map(struct page **map, u64 map_max, u64 blk)
> +{
> + do_bit_in_map(map, map_max, blk, CLEAR_BIT);
> +}
> +
> +static bool check_bit_in_map(struct page **map, u64 map_max, u64 blk)
> +{
> + return do_bit_in_map(map, map_max, blk, CHECK_BIT);
> +}
> +
> +/* intentionally lockless */
> +void ploop_pb_clear_bit(struct ploop_pushbackup_desc *pbd, cluster_t clu)
> +{
> + BUG_ON(!pbd);
> + clear_bit_in_map(pbd->ppb_map, pbd->ppb_block_max, clu);
> +}
> +
> +/* intentionally lockless */
> +bool ploop_pb_check_bit(struct ploop_pushbackup_desc *pbd, cluster_t clu)
> +{
> + if (!pbd)
> + return false;
> +
> + return check_bit_in_map(pbd->ppb_map, pbd->ppb_block_max, clu);
> +}
> +
>  static int convert_map_to_map(struct ploop_pushbackup_desc *pbd)
>  {
>   struct page **from_map = pbd->cbt_map;
> @@ -278,6 +304,12 @@ static void ploop_pb_add_req_to_tree(struct 
> ploop_request *preq,
>   rb_insert_color(>reloc_link, tree);
>  }
>  
> +static void ploop_pb_add_req_to_pending(struct ploop_pushbackup_desc *pbd,
> + struct ploop_request *preq)
> +{
> + ploop_pb_add_req_to_tree(preq, >pending_tree);
> +}
> +
>  

Re: [Devel] [PATCH rh7 4/4] ploop: push_backup cleanup

2016-04-30 Thread Dmitry Monakhov
Maxim Patlasov <mpatla...@virtuozzo.com> writes:

> ploop_pb_stop() is called either explicitly, when userspace makes
> ioctl(PLOOP_IOC_PUSH_BACKUP_STOP), or implicitly on ploop shutdown
> when userspace stops ploop device by ioctl(PLOOP_IOC_STOP).
>
> In both cases, it's useful to re-schedule all suspended preq-s. Otherwise,
> we won't be able to destroy ploop because some preq-s are still not
> completed.
>
Acked-by: Dmitry Monakhov <dmonak...@openvz.org>
> Signed-off-by: Maxim Patlasov <mpatla...@virtuozzo.com>
> ---
>  drivers/block/ploop/push_backup.c |   36 +++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/ploop/push_backup.c 
> b/drivers/block/ploop/push_backup.c
> index 488b8fb..05af67c 100644
> --- a/drivers/block/ploop/push_backup.c
> +++ b/drivers/block/ploop/push_backup.c
> @@ -358,6 +358,12 @@ ploop_pb_get_first_req_from_pending(struct 
> ploop_pushbackup_desc *pbd)
>  }
>  
>  static struct ploop_request *
> +ploop_pb_get_first_req_from_reported(struct ploop_pushbackup_desc *pbd)
> +{
> + return ploop_pb_get_first_req_from_tree(>reported_tree);
> +}
> +
> +static struct ploop_request *
>  ploop_pb_get_req_from_pending(struct ploop_pushbackup_desc *pbd,
> cluster_t clu)
>  {
> @@ -400,16 +406,44 @@ int ploop_pb_preq_add_pending(struct 
> ploop_pushbackup_desc *pbd,
>  
>  unsigned long ploop_pb_stop(struct ploop_pushbackup_desc *pbd)
>  {
> + unsigned long ret = 0;
> + LIST_HEAD(drop_list);
> +
>   if (pbd == NULL)
>   return 0;
>  
>   spin_lock(>ppb_lock);
>  
> + while (!RB_EMPTY_ROOT(>pending_tree)) {
> + struct ploop_request *preq =
> + ploop_pb_get_first_req_from_pending(pbd);
> + list_add(>list, _list);
> + ret++;
> + }
> +
> + while (!RB_EMPTY_ROOT(>reported_tree)) {
> + struct ploop_request *preq =
> + ploop_pb_get_first_req_from_reported(pbd);
> + list_add(>list, _list);
> + ret++;
> + }
> +
>   if (pbd->ppb_waiting)
>   complete(>ppb_comp);
>   spin_unlock(>ppb_lock);
>  
> - return 0;
> + if (!list_empty(_list)) {
> + struct ploop_device *plo = pbd->plo;
> +
> + BUG_ON(!plo);
> + spin_lock_irq(>lock);
> + list_splice_init(_list, plo->ready_queue.prev);
> + if (test_bit(PLOOP_S_WAIT_PROCESS, >state))
> + wake_up_interruptible(>waitq);
> + spin_unlock_irq(>lock);
> + }
> +
> + return ret;
>  }
>  
>  int ploop_pb_get_pending(struct ploop_pushbackup_desc *pbd,


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] ploop: force journal commit after dio_post_submit

2016-04-29 Thread Dmitry Monakhov
Maxim Patlasov <mpatla...@virtuozzo.com> writes:

> Dima,
>
> Just to let me understand the patch better, can you please give a 
> call-path for "forcing transaction commit" in ordinary ext4 life-cycle 
> (without ploop) when it handles O_DIRECT write(2) to an uninitialized 
> extent?
According to POSIX regardless to whenever you perform write via buffered or 
direct path
you must call fsync(2) in order to guarantee data reach nonvolatile storage.
This is mandatory because even in case O_DIRECT data written may still
be in volatile disk cache.

If you perform O_DIRECT ->ext4_ext_direct_IO
it calls generic __blockdev_direct_IO which has ->end_io callback arg
which is called once all bios completed. In case of ext4 it is
ext4_end_io_dio
If it is allocation or unwritten path it will convert unwritten extent to
written and finally call aio_complete() to signal user about aio completion.
At this moment all extent modification already in journal. So once user
call fsync it will call jbd2_complete_transaction() which is
simply guarantee that transaction becomes stable on disk.


>
> Thanks,
> Maxim
>
> On 04/27/2016 07:42 AM, Dmitry Monakhov wrote:
>> Once we converted extent to initialized it can be part of uncompleted
>> journal transaction, so we have to force transaction commit at some point.
>> The easiest way to do it is to perform unconditional fsync.
>> https://jira.sw.ru/browse/PSBM-45326
>>
>> TODO: This case and others can be optimized by deferring fsync.But this is
>>subject of another patch.
>>
>> Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
>> ---
>>   drivers/block/ploop/io_direct.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/block/ploop/io_direct.c 
>> b/drivers/block/ploop/io_direct.c
>> index 8032999..5a2e12a 100644
>> --- a/drivers/block/ploop/io_direct.c
>> +++ b/drivers/block/ploop/io_direct.c
>> @@ -523,6 +523,8 @@ dio_post_submit(struct ploop_io *io, struct 
>> ploop_request * preq)
>>  err = io->files.file->f_op->fallocate(io->files.file,
>>FALLOC_FL_CONVERT_UNWRITTEN,
>>(loff_t)sec << 9, clu_siz);
>> +if (!err)
>> +err = io->files.file->f_op->FOP_FSYNC(io->files.file, 0);
In fact we may delay fsync here until FLUSH or FUA preq arrives. But this
is subject for optimization for later patches.
>>  file_end_write(io->files.file);
>>  if (err) {
>>  PLOOP_REQ_SET_ERROR(preq, err);


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] ploop: force journal commit after dio_post_submit

2016-04-27 Thread Dmitry Monakhov
Once we converted extent to initialized it can be part of uncompleted
journal transaction, so we have to force transaction commit at some point.
The easiest way to do it is to perform unconditional fsync.
https://jira.sw.ru/browse/PSBM-45326

TODO: This case and others can be optimized by deferring fsync.But this is
  subject of another patch.

Signed-off-by: Dmitry Monakhov <dmonak...@openvz.org>
---
 drivers/block/ploop/io_direct.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 8032999..5a2e12a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -523,6 +523,8 @@ dio_post_submit(struct ploop_io *io, struct ploop_request * 
preq)
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
+   if (!err)
+   err = io->files.file->f_op->FOP_FSYNC(io->files.file, 0);
file_end_write(io->files.file);
if (err) {
PLOOP_REQ_SET_ERROR(preq, err);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


  1   2   >