Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-21 Thread Christoph Hellwig
On Tue, Nov 20, 2018 at 09:35:07PM -0800, Sagi Grimberg wrote:
>> Given it is over TCP, I guess it should be doable for you to preallocate one
>> 256-bvec table in one page for each request, then sets the max segment size 
>> as
>> (unsigned int)-1, and max segment number as 256, the preallocated table
>> should work anytime.
>
> 256 bvec table is really a lot to preallocate, especially when its not
> needed, I can easily initialize the bvec_iter on the bio bvec. If this
> involves preallocation of the worst-case than I don't consider this to
> be an improvement.

Ok, I took a look at the nvme-tcp code and it seems you care about
bios because you want a contiguos bio chunk for sending it down
the networking code.  Yes, in that case we sort of need to iterate
over bios.  But you already have a special case for discard, so you
don't really need any of the magic in the bio_bvecs() helper either
can can just count bi_vcnt in the bio.



[Cluster-devel] [PATCH v1 2/4] ext4: fix race between llseek SEEK_END and write

2018-11-21 Thread Eiichi Tsukata
Implement individual lock for SEEK_END for ext4 which directly calls
generic_file_llseek_size().

Signed-off-by: Eiichi Tsukata 
---
 fs/ext4/file.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 69d65d49837b..6479f3066043 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -477,6 +477,16 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int 
whence)
default:
return generic_file_llseek_size(file, offset, whence,
maxbytes, i_size_read(inode));
+   case SEEK_END:
+   /*
+* protects against inode size race with write so that llseek
+* doesn't see inode size being updated in generic_perform_write
+*/
+   inode_lock_shared(inode);
+   offset = generic_file_llseek_size(file, offset, whence,
+ maxbytes, i_size_read(inode));
+   inode_unlock_shared(inode);
+   return offset;
case SEEK_HOLE:
inode_lock_shared(inode);
offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
-- 
2.19.1



[Cluster-devel] [PATCH v1 4/4] overlayfs: fix race between llseek SEEK_END and write

2018-11-21 Thread Eiichi Tsukata
Implement individual lock for SEEK_END for overlayfs which directly calls
generic_file_llseek_size().

Signed-off-by: Eiichi Tsukata 
---
 fs/overlayfs/file.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 84dd957efa24..57bc6538eea8 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -146,10 +146,27 @@ static int ovl_release(struct inode *inode, struct file 
*file)
 static loff_t ovl_llseek(struct file *file, loff_t offset, int whence)
 {
struct inode *realinode = ovl_inode_real(file_inode(file));
+   loff_t ret;
 
-   return generic_file_llseek_size(file, offset, whence,
-   realinode->i_sb->s_maxbytes,
-   i_size_read(realinode));
+   switch (whence) {
+   default:
+   return generic_file_llseek_size(file, offset, whence,
+   realinode->i_sb->s_maxbytes,
+   i_size_read(realinode));
+   case SEEK_END:
+   case SEEK_DATA:
+   case SEEK_HOLE:
+   /*
+* protects against inode size race with write so that llseek
+* doesn't see inode size being updated in individual fs write
+*/
+   inode_lock(realinode);
+   ret = generic_file_llseek_size(file, offset, whence,
+  realinode->i_sb->s_maxbytes,
+  i_size_read(realinode));
+   inode_unlock(realinode);
+   return ret;
+   }
 }
 
 static void ovl_file_accessed(struct file *file)
-- 
2.19.1



[Cluster-devel] [PATCH v1 1/4] vfs: fix race between llseek SEEK_END and write

2018-11-21 Thread Eiichi Tsukata
The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
removed almost all locks in llseek() including SEEK_END. It based on the
idea that write() updates size atomically. But in fact, write() can be
divided into two or more parts in generic_perform_write() when pos
straddles over the PAGE_SIZE, which results in updating size multiple
times in one write(). It means that llseek() can see the size being
updated during write().

This race changes behavior of some applications. 'tail' is one of those
applications. It reads range [pos, pos_end] where pos_end is obtained
via llseek() SEEK_END. Sometimes, a read line could be broken.

reproducer:

  $ while true; do echo 123456 >> out; done
  $ while true; do tail out | grep -v 123456 ; done

example output(take 30 secs):

  12345
  1
  1234
  1
  12
  1234

This patch re-introduces generic_file_llseek_unlocked() and implements a
lock for SEEK_END/DATA/HOLE in generic_file_llseek(). I replaced all
generic_file_llseek() callers with _unlocked() if they are called with a
inode lock.

All file systems which call generic_file_llseek_size() directly
are fixed in the later commits.

Fixes: ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
Signed-off-by: Eiichi Tsukata 
---
 fs/btrfs/file.c|  2 +-
 fs/fuse/file.c |  5 +++--
 fs/gfs2/file.c |  3 ++-
 fs/read_write.c| 37 ++---
 include/linux/fs.h |  2 ++
 5 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a3c22e16509b..ec932fa0f8a9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3256,7 +3256,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t 
offset, int whence)
switch (whence) {
case SEEK_END:
case SEEK_CUR:
-   offset = generic_file_llseek(file, offset, whence);
+   offset = generic_file_llseek_unlocked(file, offset, whence);
goto out;
case SEEK_DATA:
case SEEK_HOLE:
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b52f9baaa3e7..e220b848929b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2336,13 +2336,14 @@ static loff_t fuse_file_llseek(struct file *file, 
loff_t offset, int whence)
case SEEK_SET:
case SEEK_CUR:
 /* No i_mutex protection necessary for SEEK_CUR and SEEK_SET */
-   retval = generic_file_llseek(file, offset, whence);
+   retval = generic_file_llseek_unlocked(file, offset, whence);
break;
case SEEK_END:
inode_lock(inode);
retval = fuse_update_attributes(inode, file);
if (!retval)
-   retval = generic_file_llseek(file, offset, whence);
+   retval = generic_file_llseek_unlocked(file, offset,
+ whence);
inode_unlock(inode);
break;
case SEEK_HOLE:
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 45a17b770d97..171df9550c27 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -66,7 +66,8 @@ static loff_t gfs2_llseek(struct file *file, loff_t offset, 
int whence)
error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY,
   &i_gh);
if (!error) {
-   error = generic_file_llseek(file, offset, whence);
+   error = generic_file_llseek_unlocked(file, offset,
+whence);
gfs2_glock_dq_uninit(&i_gh);
}
break;
diff --git a/fs/read_write.c b/fs/read_write.c
index bfcb4ced5664..859dbac5b2f6 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -131,6 +131,24 @@ generic_file_llseek_size(struct file *file, loff_t offset, 
int whence,
 }
 EXPORT_SYMBOL(generic_file_llseek_size);
 
+/**
+ * generic_file_llseek_unlocked - lockless generic llseek implementation
+ * @file:  file structure to seek on
+ * @offset:file offset to seek to
+ * @whence:type of seek
+ *
+ */
+loff_t generic_file_llseek_unlocked(struct file *file, loff_t offset,
+   int whence)
+{
+   struct inode *inode = file->f_mapping->host;
+
+   return generic_file_llseek_size(file, offset, whence,
+   inode->i_sb->s_maxbytes,
+   i_size_read(inode));
+}
+EXPORT_SYMBOL(generic_file_llseek_unlocked);
+
 /**
  * generic_file_llseek - generic llseek implementation for regular files
  * @file:  file structure to seek on
@@ -144,10 +162,23 @@ EXPORT_SYMBOL(generic_file_llseek_size);
 loff_t generic_file_llseek(struct file *file, loff_t offset, int whence)
 {
struct inode *inode = file->f_mapping->host;
+   loff_t retval;
 
-   return generic_file_llseek_size(file, offset, whence,
-

[Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

2018-11-21 Thread Eiichi Tsukata
Some file systems (including ext4, xfs, ramfs ...) have the following
problem as I've described in the commit message of the 1/4 patch.

  The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
  removed almost all locks in llseek() including SEEK_END. It based on the
  idea that write() updates size atomically. But in fact, write() can be
  divided into two or more parts in generic_perform_write() when pos
  straddles over the PAGE_SIZE, which results in updating size multiple
  times in one write(). It means that llseek() can see the size being
  updated during write().

  This race changes behavior of some applications. 'tail' is one of those
  applications. It reads range [pos, pos_end] where pos_end is obtained
  via llseek() SEEK_END. Sometimes, a read line could be broken.

  reproducer:

$ while true; do echo 123456 >> out; done
$ while true; do tail out | grep -v 123456 ; done

  example output(take 30 secs):

12345
1
1234
1
12
1234

Note: Some file systems which indivisually implements llseek() and hold
  inode mutex lock in it are not affected. ex:) btrfs, ocfs2

Patch 1: re-implements inode lock for SEEK_END in generic_file_llseek()
Patch 2 to 4: implement inode lock for SEEK_END in each file systems

Eiichi Tsukata (4):
  vfs: fix race between llseek SEEK_END and write
  ext4: fix race between llseek SEEK_END and write
  f2fs: fix race between llseek SEEK_END and write
  overlayfs: fix race between llseek SEEK_END and write

 fs/btrfs/file.c |  2 +-
 fs/ext4/file.c  | 10 ++
 fs/f2fs/file.c  |  6 +-
 fs/fuse/file.c  |  5 +++--
 fs/gfs2/file.c  |  3 ++-
 fs/overlayfs/file.c | 23 ---
 fs/read_write.c | 37 ++---
 include/linux/fs.h  |  2 ++
 8 files changed, 73 insertions(+), 15 deletions(-)

-- 
2.19.1



[Cluster-devel] [PATCH v1 3/4] f2fs: fix race between llseek SEEK_END and write

2018-11-21 Thread Eiichi Tsukata
This patch itself seems to be just a cleanup but with the
commit b25bd1d9fd87 ("vfs: fix race between llseek SEEK_END and write")
it fixes race.

Signed-off-by: Eiichi Tsukata 
---
 fs/f2fs/file.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 88b124677189..14a923607eff 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -447,15 +447,11 @@ static loff_t f2fs_seek_block(struct file *file, loff_t 
offset, int whence)
 
 static loff_t f2fs_llseek(struct file *file, loff_t offset, int whence)
 {
-   struct inode *inode = file->f_mapping->host;
-   loff_t maxbytes = inode->i_sb->s_maxbytes;
-
switch (whence) {
case SEEK_SET:
case SEEK_CUR:
case SEEK_END:
-   return generic_file_llseek_size(file, offset, whence,
-   maxbytes, i_size_read(inode));
+   return generic_file_llseek(file, offset, whence);
case SEEK_DATA:
case SEEK_HOLE:
if (offset < 0)
-- 
2.19.1



Re: [Cluster-devel] [PATCH V11 17/19] block: document usage of bio iterator helpers

2018-11-21 Thread Nikolay Borisov



On 21.11.18 г. 5:23 ч., Ming Lei wrote:
> Now multi-page bvec is supported, some helpers may return page by
> page, meantime some may return segment by segment, this patch
> documents the usage.
> 
> Signed-off-by: Ming Lei 
> ---
>  Documentation/block/biovecs.txt | 24 
>  1 file changed, 24 insertions(+)
> 
> diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
> index 25689584e6e0..bb008f7afb05 100644
> --- a/Documentation/block/biovecs.txt
> +++ b/Documentation/block/biovecs.txt
> @@ -117,3 +117,27 @@ Other implications:
> size limitations and the limitations of the underlying devices. Thus
> there's no need to define ->merge_bvec_fn() callbacks for individual block
> drivers.
> +
> +Usage of helpers:
> +=
> +
> +* The following helpers whose names have the suffix of "_all" can only be 
> used
> +on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers
> +shouldn't use them because the bio may have been split before it reached the
> +driver.
> +
> + bio_for_each_segment_all()
> + bio_first_bvec_all()
> + bio_first_page_all()
> + bio_last_bvec_all()
> +
> +* The following helpers iterate over single-page bvecs. The passed 'struct
> +bio_vec' will contain a single-page IO vector during the iteration
> +
> + bio_for_each_segment()
> + bio_for_each_segment_all()
> +
> +* The following helpers iterate over single-page bvecs. The passed 'struct
> +bio_vec' will contain a single-page IO vector during the iteration
> +
> + bio_for_each_bvec()

Just put this helper right below the above 2, no need to repeat the
explanation. Also I'd suggest introducing another catch-all sentence
"All other helpers are assumed to iterate multipage bio vecs" and
perhaps give an example with 1-2 helpers.

> 



Re: [Cluster-devel] [PATCH v1 3/4] f2fs: fix race between llseek SEEK_END and write

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 11:43:59AM +0900, Eiichi Tsukata wrote:
> This patch itself seems to be just a cleanup but with the
> commit b25bd1d9fd87 ("vfs: fix race between llseek SEEK_END and write")
> it fixes race.

Please move this patch to the beginning of the series and replace
the commit log with something like the one below.  Note that your
commit id is different from the one that will appear once applied
upstream, so the aboe isn't too helpful.

---
f2fs: use generic_file_llseek

f2fs always passes inode->i_sb->s_maxbytes to generic_file_llseek_size,
and thus should simply use generic_file_llseek.  For now this is a just
a cleanup, but it will allow f2fs to pick up a race fix in
generic_file_llseek for free.



Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-21 Thread Ming Lei
On Tue, Nov 20, 2018 at 09:35:07PM -0800, Sagi Grimberg wrote:
> 
> > > Wait, I see that the bvec is still a single array per bio. When you said
> > > a table I thought you meant a 2-dimentional array...
> > 
> > I mean a new 1-d table A has to be created for multiple bios in one rq,
> > and build it in the following way
> > 
> > rq_for_each_bvec(tmp, rq, rq_iter)
> >  *A = tmp;
> > 
> > Then you can pass A to iov_iter_bvec() & send().
> > 
> > Given it is over TCP, I guess it should be doable for you to preallocate one
> > 256-bvec table in one page for each request, then sets the max segment size 
> > as
> > (unsigned int)-1, and max segment number as 256, the preallocated table
> > should work anytime.
> 
> 256 bvec table is really a lot to preallocate, especially when its not
> needed, I can easily initialize the bvec_iter on the bio bvec. If this
> involves preallocation of the worst-case than I don't consider this to
> be an improvement.

If you don't provide one single bvec table, I understand you may not send
this req via one send().

The bvec_iter initialization is easy to do:

bvec_iter = bio->bi_iter

when you move to a new a bio, please refer to  __bio_for_each_bvec() or
__bio_for_each_segment().

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 02/19] block: introduce multi-page bvec helpers

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 11:23:10AM +0800, Ming Lei wrote:
> This patch introduces helpers of 'segment_iter_*' for multipage
> bvec support.
> 
> The introduced helpers treate one bvec as real multi-page segment,
> which may include more than one pages.

Unless I'm missing something these bvec vs segment names are exactly
inverted vs how we use it elsewhere.

In the iterators we use segment for single-page bvec, and bvec for multi
page ones, and here it is inverse.  Please switch it around.



Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()

2018-11-21 Thread Christoph Hellwig
> +#define bio_iter_mp_iovec(bio, iter) \
> + segment_iter_bvec((bio)->bi_io_vec, (iter))

Besides the mp naming we'd like to get rid off there also is just
a single user of this macro, please just expand it there.

> +#define segment_iter_bvec(bvec, iter)\
> +((struct bio_vec) {  \
> + .bv_page= segment_iter_page((bvec), (iter)),\
> + .bv_len = segment_iter_len((bvec), (iter)), \
> + .bv_offset  = segment_iter_offset((bvec), (iter)),  \
> +})

And for this one please keep the segment vs bvec versions of these
macros close together in the file please, right now it follow the
bvec_iter_bvec variant closely.

> +static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter 
> *iter,
> +   unsigned bytes, unsigned max_seg_len)
>  {
>   iter->bi_sector += bytes >> 9;
>  
>   if (bio_no_advance_iter(bio))
>   iter->bi_size -= bytes;
>   else
> - bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> + __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
>   /* TODO: It is reasonable to complete bio with error here. */
>  }
>  
> +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> + unsigned bytes)
> +{
> + __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
> +}

Btw, I think the remaining users of bio_advance_iter() in bio.h
should probably switch to using __bio_advance_iter to make them a little
more clear to read.

> +/* returns one real segment(multi-page bvec) each time */

space before the brace, please.

> +#define BVEC_MAX_LEN  ((unsigned int)-1)

>   while (bytes) {
> + unsigned segment_len = segment_iter_len(bv, *iter);
>  
> - iter->bi_bvec_done += len;
> + if (max_seg_len < BVEC_MAX_LEN)
> + segment_len = min_t(unsigned, segment_len,
> + max_seg_len -
> + bvec_iter_offset(bv, *iter));
> +
> + segment_len = min(bytes, segment_len);

Please stick to passing the magic zero here as can often generate more
efficient code.

Talking about efficent code - I wonder how much code size we'd save
by moving this function out of line..

But while looking over this I wonder why we even need the max_seg_len
here.  The only thing __bvec_iter_advance does it to move bi_bvec_done
and bi_idx forward, with corresponding decrements of bi_size.  As far
as I can tell the only thing that max_seg_len does is that we need
to more iterations of the while loop to archive the same thing.

And actual bvec used by the caller will be obtained using
bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page
or single-page variants.



Re: [Cluster-devel] [PATCH V11 10/19] block: loop: pass multi-page bvec to iov_iter

2018-11-21 Thread Christoph Hellwig
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1ad6eafc43f2..a281b6737b61 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -805,6 +805,10 @@ struct req_iterator {
>   __rq_for_each_bio(_iter.bio, _rq)   \
>   bio_for_each_segment(bvl, _iter.bio, _iter.iter)
>  
> +#define rq_for_each_bvec(bvl, _rq, _iter)\
> + __rq_for_each_bio(_iter.bio, _rq)   \
> + bio_for_each_bvec(bvl, _iter.bio, _iter.iter)
> +

I think this should go into the patch adding bio_for_each_bvec and
friends.

Otherwise looks fine:

Reviewed-by: Christoph Hellwig 



Re: [Cluster-devel] [PATCH V11 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 11:23:19AM +0800, Ming Lei wrote:
> bch_bio_alloc_pages() is always called on one new bio, so it is safe
> to access the bvec table directly. Given it is the only kind of this
> case, open code the bvec table access since bio_for_each_segment_all()
> will be changed to support for iterating over multipage bvec.

Looks good,

Reviewed-by: Christoph Hellwig 



Re: [Cluster-devel] [PATCH V11 12/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 11:23:20AM +0800, Ming Lei wrote:
> This patch introduces one extra iterator variable to 
> bio_for_each_segment_all(),
> then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
> 
> Given it is just one mechannical & simple change on all 
> bio_for_each_segment_all()
> users, this patch does tree-wide change in one single patch, so that we can
> avoid to use a temporary helper for this conversion.
> 
> Signed-off-by: Ming Lei 

Looks good,

Reviewed-by: Christoph Hellwig 



Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split

2018-11-21 Thread Christoph Hellwig
> + non-cluster.o

Do we really need a new source file for these few functions?

>   default:
> + if (!blk_queue_cluster(q)) {
> + blk_queue_non_cluster_bio(q, bio);
> + return;

I'd name this blk_bio_segment_split_singlepage or similar.

> +static __init int init_non_cluster_bioset(void)
> +{
> + WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0,
> +BIOSET_NEED_BVECS));
> + WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE));
> + WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0));

Please only allocate the resources once a queue without the cluster
flag is registered, there are only very few modern drivers that do that.

> +static void non_cluster_end_io(struct bio *bio)
> +{
> + struct bio *bio_orig = bio->bi_private;
> +
> + bio_orig->bi_status = bio->bi_status;
> + bio_endio(bio_orig);
> + bio_put(bio);
> +}

Why can't we use bio_chain for the split bios?

> + bio_for_each_segment(from, *bio_orig, iter) {
> + if (i++ < max_segs)
> + sectors += from.bv_len >> 9;
> + else
> + break;
> + }

The easy to read way would be:

bio_for_each_segment(from, *bio_orig, iter) {
if (i++ == max_segs)
break;
sectors += from.bv_len >> 9;
}

> + if (sectors < bio_sectors(*bio_orig)) {
> + bio = bio_split(*bio_orig, sectors, GFP_NOIO,
> + &non_cluster_bio_split);
> + bio_chain(bio, *bio_orig);
> + generic_make_request(*bio_orig);
> + *bio_orig = bio;

I don't think this is very efficient, as this means we now
clone the bio twice, first to split it at the sector boundary,
and then again when converting it to single-page bio_vec.

I think this could be something like this (totally untested):

diff --git a/block/non-cluster.c b/block/non-cluster.c
index 9c2910be9404..60389f275c43 100644
--- a/block/non-cluster.c
+++ b/block/non-cluster.c
@@ -13,58 +13,59 @@
 
 #include "blk.h"
 
-static struct bio_set non_cluster_bio_set, non_cluster_bio_split;
+static struct bio_set non_cluster_bio_set;
 
 static __init int init_non_cluster_bioset(void)
 {
WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0,
   BIOSET_NEED_BVECS));
WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE));
-   WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0));
 
return 0;
 }
 __initcall(init_non_cluster_bioset);
 
-static void non_cluster_end_io(struct bio *bio)
-{
-   struct bio *bio_orig = bio->bi_private;
-
-   bio_orig->bi_status = bio->bi_status;
-   bio_endio(bio_orig);
-   bio_put(bio);
-}
-
 void blk_queue_non_cluster_bio(struct request_queue *q, struct bio **bio_orig)
 {
-   struct bio *bio;
struct bvec_iter iter;
-   struct bio_vec from;
-   unsigned i = 0;
-   unsigned sectors = 0;
-   unsigned short max_segs = min_t(unsigned short, BIO_MAX_PAGES,
-   queue_max_segments(q));
+   struct bio *bio;
+   struct bio_vec bv;
+   unsigned short max_segs, segs = 0;
+
+   bio = bio_alloc_bioset(GFP_NOIO, bio_segments(*bio_orig),
+   &non_cluster_bio_set);
+   bio->bi_disk= (*bio_orig)->bi_disk;
+   bio->bi_partno  = (*bio_orig)->bi_partno;
+   bio_set_flag(bio, BIO_CLONED);
+   if (bio_flagged(*bio_orig, BIO_THROTTLED))
+   bio_set_flag(bio, BIO_THROTTLED);
+   bio->bi_opf = (*bio_orig)->bi_opf;
+   bio->bi_ioprio  = (*bio_orig)->bi_ioprio;
+   bio->bi_write_hint  = (*bio_orig)->bi_write_hint;
+   bio->bi_iter.bi_sector  = (*bio_orig)->bi_iter.bi_sector;
+   bio->bi_iter.bi_size= (*bio_orig)->bi_iter.bi_size;
+
+   if (bio_integrity(*bio_orig))
+   bio_integrity_clone(bio, *bio_orig, GFP_NOIO);
 
-   bio_for_each_segment(from, *bio_orig, iter) {
-   if (i++ < max_segs)
-   sectors += from.bv_len >> 9;
-   else
+   bio_clone_blkcg_association(bio, *bio_orig);
+
+   max_segs = min_t(unsigned short, queue_max_segments(q), BIO_MAX_PAGES);
+   bio_for_each_segment(bv, *bio_orig, iter) {
+   bio->bi_io_vec[segs++] = bv;
+   if (segs++ == max_segs)
break;
}
 
-   if (sectors < bio_sectors(*bio_orig)) {
-   bio = bio_split(*bio_orig, sectors, GFP_NOIO,
-   &non_cluster_bio_split);
-   bio_chain(bio, *bio_orig);
-   generic_make_request(*bio_orig);
-   *bio_orig = bio;
-   }
-   bio = bio_clone_bioset(*bio_orig, GFP_NOIO, &non_cluster_bio_set);
+   bio->bi_vcnt = se

Re: [Cluster-devel] [PATCH V11 17/19] block: document usage of bio iterator helpers

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 09:45:25AM +0200, Nikolay Borisov wrote:
> > +   bio_for_each_segment_all()
> > +   bio_first_bvec_all()
> > +   bio_first_page_all()
> > +   bio_last_bvec_all()
> > +
> > +* The following helpers iterate over single-page bvecs. The passed 'struct
> > +bio_vec' will contain a single-page IO vector during the iteration
> > +
> > +   bio_for_each_segment()
> > +   bio_for_each_segment_all()
> > +
> > +* The following helpers iterate over single-page bvecs. The passed 'struct
> > +bio_vec' will contain a single-page IO vector during the iteration
> > +
> > +   bio_for_each_bvec()
> 
> Just put this helper right below the above 2, no need to repeat the
> explanation. Also I'd suggest introducing another catch-all sentence
> "All other helpers are assumed to iterate multipage bio vecs" and
> perhaps give an example with 1-2 helpers.

Well, I think the second explanation is wrong - bio_for_each_bvec
iterates over the whole bvecs, not just single page.



Re: [Cluster-devel] [PATCH V11 15/19] block: enable multipage bvecs

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 11:23:23AM +0800, Ming Lei wrote:
>   if (bio->bi_vcnt > 0) {
> - struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> + struct bio_vec bv;
> + struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
>  
> - if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> - bv->bv_len += len;
> + bvec_last_segment(seg, &bv);
> +
> + if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) {

I think this we can simplify the try to merge into bio case a bit,
and also document it better with something like this:

diff --git a/block/bio.c b/block/bio.c
index 854676edc438..cc913281a723 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -822,54 +822,40 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * @page: page to add
  * @len: length of the data to add
  * @off: offset of the data in @page
+ * @same_page: if %true only merge if the new data is in the same physical
+ * page as the last segment of the bio.
  *
- * Try to add the data at @page + @off to the last page of @bio.  This is a
+ * Try to add the data at @page + @off to the last bvec of @bio.  This is a
  * a useful optimisation for file systems with a block size smaller than the
  * page size.
  *
  * Return %true on success or %false on failure.
  */
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-   unsigned int len, unsigned int off)
+   unsigned int len, unsigned int off, bool same_page)
 {
if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
return false;
 
if (bio->bi_vcnt > 0) {
-   struct bio_vec bv;
-   struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
-   bvec_last_segment(seg, &bv);
-
-   if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) {
-   seg->bv_len += len;
-   bio->bi_iter.bi_size += len;
-   return true;
-   }
+   struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+   phys_addr_t vec_addr = page_to_phys(bv->bv_page);
+   phys_addr_t page_addr = page_to_phys(page);
+
+   if (vec_addr + bv->bv_offset + bv->bv_len != page_addr + off)
+   return false;
+   if (same_page &&
+   (vec_addr & PAGE_SIZE) != (page_addr & PAGE_SIZE))
+   return false;
+
+   bv->bv_len += len;
+   bio->bi_iter.bi_size += len;
+   return true;
}
return false;
 }
 EXPORT_SYMBOL_GPL(__bio_try_merge_page);
 
-static bool bio_try_merge_segment(struct bio *bio, struct page *page,
- unsigned int len, unsigned int off)
-{
-   if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
-   return false;
-
-   if (bio->bi_vcnt > 0) {
-   struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
-   if (page_to_phys(seg->bv_page) + seg->bv_offset + seg->bv_len ==
-   page_to_phys(page) + off) {
-   seg->bv_len += len;
-   bio->bi_iter.bi_size += len;
-   return true;
-   }
-   }
-   return false;
-}
-
 /**
  * __bio_add_page - add page to a bio in a new segment
  * @bio: destination bio
@@ -910,7 +896,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
 int bio_add_page(struct bio *bio, struct page *page,
 unsigned int len, unsigned int offset)
 {
-   if (!bio_try_merge_segment(bio, page, len, offset)) {
+   if (!__bio_try_merge_page(bio, page, len, offset, false)) {
if (bio_full(bio))
return 0;
__bio_add_page(bio, page, len, offset);
diff --git a/fs/iomap.c b/fs/iomap.c
index ccc2ba115f4d..d918acb9bfc9 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -313,7 +313,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
 */
sector = iomap_sector(iomap, pos);
if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
-   if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+   if (__bio_try_merge_page(ctx->bio, page, plen, poff, true))
goto done;
is_contig = true;
}
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5c2190216614..b9fd44168f61 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -616,7 +616,7 @@ xfs_add_to_ioend(
bdev, sector);
}
 
-   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
+   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, true)) {
if (iop)
atomic_inc(&iop->write_count);
if (bio_full(wpc->ioend->io_bio))
diff --git a/include/linux/bio.h b/include/linux/bio.h
index e5b97

Re: [Cluster-devel] [PATCH V11 02/19] block: introduce multi-page bvec helpers

2018-11-21 Thread Ming Lei
On Wed, Nov 21, 2018 at 02:19:28PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 11:23:10AM +0800, Ming Lei wrote:
> > This patch introduces helpers of 'segment_iter_*' for multipage
> > bvec support.
> > 
> > The introduced helpers treate one bvec as real multi-page segment,
> > which may include more than one pages.
> 
> Unless I'm missing something these bvec vs segment names are exactly
> inverted vs how we use it elsewhere.
> 
> In the iterators we use segment for single-page bvec, and bvec for multi
> page ones, and here it is inverse.  Please switch it around.

bvec_iter_* is used for single-page bvec in current linus tree, and there are
lots of users now:

[linux]$ git grep -n "bvec_iter_*" ./ | wc
191 995   13242

If we have to switch it first, it can be a big change, just wondering if Jens
is happy with that?

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()

2018-11-21 Thread Ming Lei
On Wed, Nov 21, 2018 at 02:32:44PM +0100, Christoph Hellwig wrote:
> > +#define bio_iter_mp_iovec(bio, iter)   \
> > +   segment_iter_bvec((bio)->bi_io_vec, (iter))
> 
> Besides the mp naming we'd like to get rid off there also is just
> a single user of this macro, please just expand it there.

OK.

> 
> > +#define segment_iter_bvec(bvec, iter)  \
> > +((struct bio_vec) {
> > \
> > +   .bv_page= segment_iter_page((bvec), (iter)),\
> > +   .bv_len = segment_iter_len((bvec), (iter)), \
> > +   .bv_offset  = segment_iter_offset((bvec), (iter)),  \
> > +})
> 
> And for this one please keep the segment vs bvec versions of these
> macros close together in the file please, right now it follow the
> bvec_iter_bvec variant closely.

OK.

> 
> > +static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter 
> > *iter,
> > + unsigned bytes, unsigned max_seg_len)
> >  {
> > iter->bi_sector += bytes >> 9;
> >  
> > if (bio_no_advance_iter(bio))
> > iter->bi_size -= bytes;
> > else
> > -   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> > +   __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
> > /* TODO: It is reasonable to complete bio with error here. */
> >  }
> >  
> > +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter 
> > *iter,
> > +   unsigned bytes)
> > +{
> > +   __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
> > +}
> 
> Btw, I think the remaining users of bio_advance_iter() in bio.h
> should probably switch to using __bio_advance_iter to make them a little
> more clear to read.

Good point.

> 
> > +/* returns one real segment(multi-page bvec) each time */
> 
> space before the brace, please.

OK.

> 
> > +#define BVEC_MAX_LEN  ((unsigned int)-1)
> 
> > while (bytes) {
> > +   unsigned segment_len = segment_iter_len(bv, *iter);
> >  
> > -   iter->bi_bvec_done += len;
> > +   if (max_seg_len < BVEC_MAX_LEN)
> > +   segment_len = min_t(unsigned, segment_len,
> > +   max_seg_len -
> > +   bvec_iter_offset(bv, *iter));
> > +
> > +   segment_len = min(bytes, segment_len);
> 
> Please stick to passing the magic zero here as can often generate more
> efficient code.

But zero may decrease the code readability. Actually the passed
'max_seg_len' is just a constant, and complier should have generated
same efficient code for any constant, either 0 or other.

> 
> Talking about efficent code - I wonder how much code size we'd save
> by moving this function out of line..

That is good point, see the following diff:

[mingl@hp kernel]$ diff -u inline.size non_inline.size
--- inline.size 2018-11-21 23:24:52.305312076 +0800
+++ non_inline.size 2018-11-21 23:24:59.908393010 +0800
@@ -1,2 +1,2 @@
text   data bss dec hex filename
-13429213   6893922 4292692 246158271779b93 vmlinux.inline
+13429153   6893346 4292692 246151911779917 vmlinux.non_inline

vmlinux(non_inline) is built by just moving/exporting __bvec_iter_advance()
into block/bio.c.

The difference is about 276bytes.

> 
> But while looking over this I wonder why we even need the max_seg_len
> here.  The only thing __bvec_iter_advance does it to move bi_bvec_done
> and bi_idx forward, with corresponding decrements of bi_size.  As far
> as I can tell the only thing that max_seg_len does is that we need
> to more iterations of the while loop to archive the same thing.
> 
> And actual bvec used by the caller will be obtained using
> bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page
> or single-page variants.

Right, we let __bvec_iter_advance() serve for both multi-page and single-page
case, then we have to tell it via one way or another, now we use the constant
of 'max_seg_len'.

Or you suggest to implement two versions of __bvec_iter_advance()?

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split

2018-11-21 Thread Ming Lei
On Wed, Nov 21, 2018 at 03:33:55PM +0100, Christoph Hellwig wrote:
> > +   non-cluster.o
> 
> Do we really need a new source file for these few functions?
> 
> > default:
> > +   if (!blk_queue_cluster(q)) {
> > +   blk_queue_non_cluster_bio(q, bio);
> > +   return;
> 
> I'd name this blk_bio_segment_split_singlepage or similar.

OK.

> 
> > +static __init int init_non_cluster_bioset(void)
> > +{
> > +   WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0,
> > +  BIOSET_NEED_BVECS));
> > +   WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE));
> > +   WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0));
> 
> Please only allocate the resources once a queue without the cluster
> flag is registered, there are only very few modern drivers that do that.

OK.

> 
> > +static void non_cluster_end_io(struct bio *bio)
> > +{
> > +   struct bio *bio_orig = bio->bi_private;
> > +
> > +   bio_orig->bi_status = bio->bi_status;
> > +   bio_endio(bio_orig);
> > +   bio_put(bio);
> > +}
> 
> Why can't we use bio_chain for the split bios?

The parent bio is multi-page bvec, we can't submit it for non-cluster.

> 
> > +   bio_for_each_segment(from, *bio_orig, iter) {
> > +   if (i++ < max_segs)
> > +   sectors += from.bv_len >> 9;
> > +   else
> > +   break;
> > +   }
> 
> The easy to read way would be:
> 
>   bio_for_each_segment(from, *bio_orig, iter) {
>   if (i++ == max_segs)
>   break;
>   sectors += from.bv_len >> 9;
>   }

OK.

> 
> > +   if (sectors < bio_sectors(*bio_orig)) {
> > +   bio = bio_split(*bio_orig, sectors, GFP_NOIO,
> > +   &non_cluster_bio_split);
> > +   bio_chain(bio, *bio_orig);
> > +   generic_make_request(*bio_orig);
> > +   *bio_orig = bio;
> 
> I don't think this is very efficient, as this means we now
> clone the bio twice, first to split it at the sector boundary,
> and then again when converting it to single-page bio_vec.

That is exactly what bounce code does. The problem for both bounce
and non-cluster is same actually because the bvec table itself has
to be changed.

> 
> I think this could be something like this (totally untested):
> 
> diff --git a/block/non-cluster.c b/block/non-cluster.c
> index 9c2910be9404..60389f275c43 100644
> --- a/block/non-cluster.c
> +++ b/block/non-cluster.c
> @@ -13,58 +13,59 @@
>  
>  #include "blk.h"
>  
> -static struct bio_set non_cluster_bio_set, non_cluster_bio_split;
> +static struct bio_set non_cluster_bio_set;
>  
>  static __init int init_non_cluster_bioset(void)
>  {
>   WARN_ON(bioset_init(&non_cluster_bio_set, BIO_POOL_SIZE, 0,
>  BIOSET_NEED_BVECS));
>   WARN_ON(bioset_integrity_create(&non_cluster_bio_set, BIO_POOL_SIZE));
> - WARN_ON(bioset_init(&non_cluster_bio_split, BIO_POOL_SIZE, 0, 0));
>  
>   return 0;
>  }
>  __initcall(init_non_cluster_bioset);
>  
> -static void non_cluster_end_io(struct bio *bio)
> -{
> - struct bio *bio_orig = bio->bi_private;
> -
> - bio_orig->bi_status = bio->bi_status;
> - bio_endio(bio_orig);
> - bio_put(bio);
> -}
> -
>  void blk_queue_non_cluster_bio(struct request_queue *q, struct bio 
> **bio_orig)
>  {
> - struct bio *bio;
>   struct bvec_iter iter;
> - struct bio_vec from;
> - unsigned i = 0;
> - unsigned sectors = 0;
> - unsigned short max_segs = min_t(unsigned short, BIO_MAX_PAGES,
> - queue_max_segments(q));
> + struct bio *bio;
> + struct bio_vec bv;
> + unsigned short max_segs, segs = 0;
> +
> + bio = bio_alloc_bioset(GFP_NOIO, bio_segments(*bio_orig),
> + &non_cluster_bio_set);

bio_segments(*bio_orig) may be > 256, so bio_alloc_bioset() may fail.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 15/19] block: enable multipage bvecs

2018-11-21 Thread Ming Lei
On Wed, Nov 21, 2018 at 03:55:02PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 11:23:23AM +0800, Ming Lei wrote:
> > if (bio->bi_vcnt > 0) {
> > -   struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> > +   struct bio_vec bv;
> > +   struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
> >  
> > -   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> > -   bv->bv_len += len;
> > +   bvec_last_segment(seg, &bv);
> > +
> > +   if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) {
> 
> I think this we can simplify the try to merge into bio case a bit,
> and also document it better with something like this:
> 
> diff --git a/block/bio.c b/block/bio.c
> index 854676edc438..cc913281a723 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -822,54 +822,40 @@ EXPORT_SYMBOL(bio_add_pc_page);
>   * @page: page to add
>   * @len: length of the data to add
>   * @off: offset of the data in @page
> + * @same_page: if %true only merge if the new data is in the same physical
> + *   page as the last segment of the bio.
>   *
> - * Try to add the data at @page + @off to the last page of @bio.  This is a
> + * Try to add the data at @page + @off to the last bvec of @bio.  This is a
>   * a useful optimisation for file systems with a block size smaller than the
>   * page size.
>   *
>   * Return %true on success or %false on failure.
>   */
>  bool __bio_try_merge_page(struct bio *bio, struct page *page,
> - unsigned int len, unsigned int off)
> + unsigned int len, unsigned int off, bool same_page)
>  {
>   if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
>   return false;
>  
>   if (bio->bi_vcnt > 0) {
> - struct bio_vec bv;
> - struct bio_vec *seg = &bio->bi_io_vec[bio->bi_vcnt - 1];
> -
> - bvec_last_segment(seg, &bv);
> -
> - if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) {
> - seg->bv_len += len;
> - bio->bi_iter.bi_size += len;
> - return true;
> - }
> + struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> + phys_addr_t vec_addr = page_to_phys(bv->bv_page);
> + phys_addr_t page_addr = page_to_phys(page);
> +
> + if (vec_addr + bv->bv_offset + bv->bv_len != page_addr + off)
> + return false;
> + if (same_page &&
> + (vec_addr & PAGE_SIZE) != (page_addr & PAGE_SIZE))
> + return false;

I guess the correct check should be:

end_addr = vec_addr + bv->bv_offset + bv->bv_len;
if (same_page &&
(end_addr & PAGE_MASK) != (page_addr & PAGE_MASK))
return false;

And this approach is good, will take it in V12.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 02/19] block: introduce multi-page bvec helpers

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 11:06:11PM +0800, Ming Lei wrote:
> bvec_iter_* is used for single-page bvec in current linus tree, and there are
> lots of users now:
> 
> [linux]$ git grep -n "bvec_iter_*" ./ | wc
> 191 995   13242
> 
> If we have to switch it first, it can be a big change, just wondering if Jens
> is happy with that?

Your above grep statement seems to catch every use of struct bvec_iter,
due to the *.

Most uses of bvec_iter_ are either in the block headers, or are
ceph wrappers that match the above and can easily be redefined.



Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 11:31:36PM +0800, Ming Lei wrote:
> > But while looking over this I wonder why we even need the max_seg_len
> > here.  The only thing __bvec_iter_advance does it to move bi_bvec_done
> > and bi_idx forward, with corresponding decrements of bi_size.  As far
> > as I can tell the only thing that max_seg_len does is that we need
> > to more iterations of the while loop to archive the same thing.
> > 
> > And actual bvec used by the caller will be obtained using
> > bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page
> > or single-page variants.
> 
> Right, we let __bvec_iter_advance() serve for both multi-page and single-page
> case, then we have to tell it via one way or another, now we use the constant
> of 'max_seg_len'.
> 
> Or you suggest to implement two versions of __bvec_iter_advance()?

No - I think we can always use the code without any segment in
bvec_iter_advance.  Because bvec_iter_advance only operates on the
iteractor, the generation of an actual single-page or multi-page
bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec
helpers.  The only difference is how many bytes you can move the
iterator forward in a single loop iteration - so if you pass in
PAGE_SIZE as the max_seg_len you just will have to loop more often
for a large enough bytes, but not actually do anything different.



Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 11:37:27PM +0800, Ming Lei wrote:
> > +   bio = bio_alloc_bioset(GFP_NOIO, bio_segments(*bio_orig),
> > +   &non_cluster_bio_set);
> 
> bio_segments(*bio_orig) may be > 256, so bio_alloc_bioset() may fail.

Nothing a little min with BIO_MAX_PAGES couldn't fix.  This patch
was just intended as an idea how I think this code could work.



Re: [Cluster-devel] [PATCH V11 15/19] block: enable multipage bvecs

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 11:48:13PM +0800, Ming Lei wrote:
> I guess the correct check should be:
> 
>   end_addr = vec_addr + bv->bv_offset + bv->bv_len;
>   if (same_page &&
>   (end_addr & PAGE_MASK) != (page_addr & PAGE_MASK))
>   return false;

Indeed.



Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()

2018-11-21 Thread Christoph Hellwig
On Wed, Nov 21, 2018 at 05:10:25PM +0100, Christoph Hellwig wrote:
> No - I think we can always use the code without any segment in
> bvec_iter_advance.  Because bvec_iter_advance only operates on the
> iteractor, the generation of an actual single-page or multi-page
> bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec
> helpers.  The only difference is how many bytes you can move the
> iterator forward in a single loop iteration - so if you pass in
> PAGE_SIZE as the max_seg_len you just will have to loop more often
> for a large enough bytes, but not actually do anything different.

FYI, this patch reverts the max_seg_len related changes back to where
we are in mainline, and as expected everything works fine for me:

diff --git a/include/linux/bio.h b/include/linux/bio.h
index e5b975fa0558..926550ce2d21 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -137,24 +137,18 @@ static inline bool bio_full(struct bio *bio)
for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; 
iter_all.idx++)\
bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), 
i, iter_all)
 
-static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
- unsigned bytes, unsigned max_seg_len)
+static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+   unsigned bytes)
 {
iter->bi_sector += bytes >> 9;
 
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
else
-   __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
+   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
/* TODO: It is reasonable to complete bio with error here. */
 }
 
-static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
-   unsigned bytes)
-{
-   __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
-}
-
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
for (iter = (start);\
 (iter).bi_size &&  \
@@ -168,7 +162,7 @@ static inline void bio_advance_iter(struct bio *bio, struct 
bvec_iter *iter,
for (iter = (start);\
 (iter).bi_size &&  \
((bvl = bio_iter_mp_iovec((bio), (iter))), 1);  \
-__bio_advance_iter((bio), &(iter), (bvl).bv_len, BVEC_MAX_LEN))
+bio_advance_iter((bio), &(iter), (bvl).bv_len))
 
 /* returns one real segment(multi-page bvec) each time */
 #define bio_for_each_bvec(bvl, bio, iter)  \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index cab36d838ed0..138b4007b8f2 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -25,8 +25,6 @@
 #include 
 #include 
 
-#define BVEC_MAX_LEN  ((unsigned int)-1)
-
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
  */
@@ -102,8 +100,8 @@ struct bvec_iter_all {
.bv_offset  = segment_iter_offset((bvec), (iter)),  \
 })
 
-static inline bool __bvec_iter_advance(const struct bio_vec *bv,
-   struct bvec_iter *iter, unsigned bytes, unsigned max_seg_len)
+static inline bool bvec_iter_advance(const struct bio_vec *bv,
+   struct bvec_iter *iter, unsigned bytes)
 {
if (WARN_ONCE(bytes > iter->bi_size,
 "Attempted to advance past end of bvec iter\n")) {
@@ -112,18 +110,12 @@ static inline bool __bvec_iter_advance(const struct 
bio_vec *bv,
}
 
while (bytes) {
-   unsigned segment_len = segment_iter_len(bv, *iter);
-
-   if (max_seg_len < BVEC_MAX_LEN)
-   segment_len = min_t(unsigned, segment_len,
-   max_seg_len -
-   bvec_iter_offset(bv, *iter));
+   unsigned iter_len = bvec_iter_len(bv, *iter);
+   unsigned len = min(bytes, iter_len);
 
-   segment_len = min(bytes, segment_len);
-
-   bytes -= segment_len;
-   iter->bi_size -= segment_len;
-   iter->bi_bvec_done += segment_len;
+   bytes -= len;
+   iter->bi_size -= len;
+   iter->bi_bvec_done += len;
 
if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
iter->bi_bvec_done = 0;
@@ -157,13 +149,6 @@ static inline bool bvec_iter_rewind(const struct bio_vec 
*bv,
return true;
 }
 
-static inline bool bvec_iter_advance(const struct bio_vec *bv,
-struct bvec_iter *iter,
-unsigned bytes)
-{
-   return __bvec_iter_advance(bv, iter, bytes, PAGE_SIZE);
-}
-
 #define for_each_bvec(bvl, bio_vec,

Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split

2018-11-21 Thread Christoph Hellwig
Actually..

I think we can kill this code entirely.  If we look at what the
clustering setting is really about it is to avoid ever merging a
segement that spans a page boundary.  And we should be able to do
that with something like this before your series:

---
>From 0d46fa76c376493a74ea0dbe77305bd5fa2cf011 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Wed, 21 Nov 2018 18:39:47 +0100
Subject: block: remove the "cluster" flag

The cluster flag implements some very old SCSI behavior.  As far as I
can tell the original intent was to enable or disable any kind of
segment merging.  But the actually visible effect to the LLDD is that
it limits each segments to be inside a single page, which we can
also affect by setting the maximum segment size and the virt
boundary.

Signed-off-by: Christoph Hellwig 
---
 block/blk-merge.c   | 20 
 block/blk-settings.c|  3 ---
 block/blk-sysfs.c   |  5 +
 drivers/scsi/scsi_lib.c | 16 +---
 include/linux/blkdev.h  |  6 --
 5 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6be04ef8da5b..e69d8f8ba819 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -195,7 +195,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
goto split;
}
 
-   if (bvprvp && blk_queue_cluster(q)) {
+   if (bvprvp) {
if (seg_size + bv.bv_len > queue_max_segment_size(q))
goto new_segment;
if (!biovec_phys_mergeable(q, bvprvp, &bv))
@@ -295,10 +295,10 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 bool no_sg_merge)
 {
struct bio_vec bv, bvprv = { NULL };
-   int cluster, prev = 0;
unsigned int seg_size, nr_phys_segs;
struct bio *fbio, *bbio;
struct bvec_iter iter;
+   bool prev = false;
 
if (!bio)
return 0;
@@ -313,7 +313,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
}
 
fbio = bio;
-   cluster = blk_queue_cluster(q);
seg_size = 0;
nr_phys_segs = 0;
for_each_bio(bio) {
@@ -325,7 +324,7 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
if (no_sg_merge)
goto new_segment;
 
-   if (prev && cluster) {
+   if (prev) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
goto new_segment;
@@ -343,7 +342,7 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 
nr_phys_segs++;
bvprv = bv;
-   prev = 1;
+   prev = true;
seg_size = bv.bv_len;
}
bbio = bio;
@@ -396,9 +395,6 @@ static int blk_phys_contig_segment(struct request_queue *q, 
struct bio *bio,
 {
struct bio_vec end_bv = { NULL }, nxt_bv;
 
-   if (!blk_queue_cluster(q))
-   return 0;
-
if (bio->bi_seg_back_size + nxt->bi_seg_front_size >
queue_max_segment_size(q))
return 0;
@@ -415,12 +411,12 @@ static int blk_phys_contig_segment(struct request_queue 
*q, struct bio *bio,
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 struct scatterlist *sglist, struct bio_vec *bvprv,
-struct scatterlist **sg, int *nsegs, int *cluster)
+struct scatterlist **sg, int *nsegs)
 {
 
int nbytes = bvec->bv_len;
 
-   if (*sg && *cluster) {
+   if (*sg) {
if ((*sg)->length + nbytes > queue_max_segment_size(q))
goto new_segment;
if (!biovec_phys_mergeable(q, bvprv, bvec))
@@ -466,12 +462,12 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
 {
struct bio_vec bvec, bvprv = { NULL };
struct bvec_iter iter;
-   int cluster = blk_queue_cluster(q), nsegs = 0;
+   int nsegs = 0;
 
for_each_bio(bio)
bio_for_each_segment(bvec, bio, iter)
__blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg,
-&nsegs, &cluster);
+&nsegs);
 
return nsegs;
 }
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 3abe831e92c8..3e7038e475ee 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -56,7 +56,6 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->alignment_offset = 0;
lim->io_opt = 0;
lim->misaligned = 0;
-   lim->cluster = 1;
lim->zoned = BLK_ZONED_

[Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2)

2018-11-21 Thread Bob Peterson
Hi,

This is a second draft of a two-patch set to fix some of the nasty
journal recovery problems I've found lately.

The original post from 08 November had horribly bad and inaccurate
comments, and Steve Whitehouse and Andreas Gruenbacher pointed out.
This version is hopefully better and more accurately describes what
the patches do and how they work. Also, I fixed a superblock flag
that was improperly declared as a glock flag.

Other than the renamed and re-valued superblock flag, the code
remains unchanged from the previous version. It probably needs a bit
more testing, but it seems to work well.
---
The problems have to do with file system corruption caused when recovery
replays a journal after the resource group blocks have been unlocked
by the recovery process. In other words, when no cluster node takes
responsibility to replay the journal of a withdrawing node, then it
gets replayed later on, after the blocks contents have been changed.

The first patch prevents gfs2 from attempting recovery if the file system
is withdrawn or has journal IO errors. Trying to recover your own journal
from either of these unstable conditions is dangerous and likely to corrupt
the file system.

The second patch is more extensive. When a node withdraws from a file system
it signals all other nodes with the file system mounted to perform recovery
on its journal, since it cannot safely recover its own journal. This is
accomplished by a new non-disk callback glop used exclusively by the
"live" glock, which sets up an lvb in the glock to indicate which
journal(s) need to be recovered.

Regards,

Bob Peterson
---
Bob Peterson (2):
  gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
  gfs2: initiate journal recovery as soon as a node withdraws

 fs/gfs2/glock.c|  5 ++-
 fs/gfs2/glops.c| 47 +++
 fs/gfs2/incore.h   |  3 ++
 fs/gfs2/lock_dlm.c | 95 ++
 fs/gfs2/log.c  | 62 --
 fs/gfs2/super.c|  5 ++-
 fs/gfs2/super.h|  1 +
 fs/gfs2/util.c | 84 
 fs/gfs2/util.h | 13 +++
 9 files changed, 282 insertions(+), 33 deletions(-)

-- 
2.19.1



[Cluster-devel] [PATCH 2/2] gfs2: initiate journal recovery as soon as a node withdraws

2018-11-21 Thread Bob Peterson
This patch uses the "live" glock and some new lvbs to signal when
a node has withdrawn from a file system. Nodes who see this try to
initiate journal recovery.

Definitions:

1. The "withdrawer" is a node that has withdrawn from a gfs2
   file system due to some inconsistency.
2. The "recoverer" is a node who sees the need to replay the
   withdrawers's journal.

The way it works is as follows:

When things are running normally, all nodes with the file system
mounted have the "live" (non-disk) glock enqueued in SH (shared)
mode.

When a node withdraws from the file system, the withdrawer
dequeues the "live" glock and re-enqueues it in EX mode. This
forces all other nodes to see a demote request for the glock.
In the act of demoting and promoting the live glock, a new lvb
is used to inform the other nodes that it has withdrawn and its
journal needs recovery.

Although unlikely, there may be multiple withdrawers and dlm
keeps the lvb consistent between them, so many of these lvb bits
may be set at any given time, indicating more than one node has
withdrawn and all their journals need recovery.

Once the withdrawer has the live glock in EX, it knows all the
other nodes have the live glock in UN (unlocked) and have gone
through their callback, so they know it has withdrawn.

The withdrawer then dequeues the live glock, demoting it back
to UN, then pauses to give the other nodes a chance to
reacquire it. This pause will hopefully prevent gfs2's caching
the transition from EX->UN->SH from simplifying it to a DLM
lock convert directly from EX->SH, which would keep the other
nodes from doing recovery by making them wait forever for the
glock (because once it's back in SH, nothing else is going to
demote it, short of an unmount).

After the pause, it re-enqueues the live glock in shared (SH)
mode once again, thus enabling everyone to resume working
normally after recovery has taken place.

When the other nodes see the demote of the live lock, they
set a flag in the superblock indicating recovery is in progress.

Unfortunately, lvbs are only passed by dlm on lock conversions,
so the nodes need to re-enqueue the live glock. They all need
it back in SH mode eventually, but only after journal recovery.
They also need to prevent other nodes from doing recovery at
the same time for the failed node's journal. For node failures,
DLM takes care of this problem for us by setting its BLOCK_LOCKS
flag and going through its recovery state machine. However, in
the case of a withdraw, we can't use the same mechanism.

So the recoverer acquires the live glock in EX, then performs
journal recovery for all journals indicated in the lvb. Taking
the live glock in EX at this time may cause a race with the
original withdrawer, who enqueued in EX, then UN, then back to
SH. There's a new superblock flag GLF_IN_RECOVERY that is set
when the node withdraws and isn't cleared until recovery is
complete. That prevents the recoverer's new demote callback
for the live lock in EX from mistaking it for another withdraw
of the recoverer.

The recoverer's promote to EX does not cause a glock demote on
any other potential recoverer, since by definition they must
have already demoted the glock to UN before the recoverer got it.

The first recoverer is granted the live glock in EX and performs
journal recovery, clearing the recovery bits in the live glock
lvb as it does. The recoverer then demotes the glock from EX
back to SH.

All subsequent nodes also acquire the glock in EX, but
the lvb bits are then clear, so no additional recovery is done.
They skip the bulk of it and transition the live glock back to
SH too.

In order to accomodate this process, I had to make several
additional changes:

(1) I had to allow non-disk glocks (like the live glock) to be
enqueued and promoted in do_xmote, even when the file system
is withdrawn. This should never cause a problem, since they
are "non-disk" and therefore have no tangible resources to
corrupt.
(2) The glock demote callback for the live lock cannot call
journal recovery directly. That's because it's a callback
made from dlm, which means it has dlm resources locked.
If you try to call recovery here, it would be a circular
dependency that could never be filled: dlm's callback would
try to lock dlm resources it already has locked. Therefore
I had to separate the actual recovery from the callback.
I therefore modified gfs2_control_func and added a step
0 (as Dave Teigland suggested) to signal a withdraw.
Instead, the callback queues work to the workqueue that
causes gfs2_control_func to run in a different context
after the callback is complete. To that end, I changed
function gfs2_control_func (the node failure recovery
state machine) to perform this new step 0 to recover
withdrawer's journals with new function remote_withdraw
rather than its normal "dead node" recovery sequence.
(3) GFS2 now needs to know the difference between a with

[Cluster-devel] [PATCH 1/2] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn

2018-11-21 Thread Bob Peterson
This patch addresses various problems with gfs2/dlm recovery.

For example, suppose a node with a bunch of gfs2 mounts suddenly
reboots due to kernel panic, and dlm determines it should perform
recovery. DLM does so from a pseudo-state machine calling various
callbacks into lock_dlm to perform a sequence of steps. It uses
generation numbers and recover bits in dlm "control" lock lvbs.

Now suppose another node tries to recover the failed node's
journal, but in so doing, encounters an IO error or withdraws
due to unforeseen circumstances, such as an hba driver failure.
In these cases, the recovery would eventually bail out, but it
would still update its generation number in the lvb. The other
nodes would all see the newer generation number and think they
don't need to do recovery because the generation number is newer
than the last one they saw, and therefore someone else has already
taken care of it.

If the file system has an io error or is withdrawn, it cannot
safely replay any journals (its own or others) but someone else
still needs to do it. Therefore we don't want it messing with
the journal recovery generation numbers: the local generation
numbers eventually get put into the lvb generation numbers to be
seen by all nodes.

This patch adds checks to many of the callbacks used by dlm
in its recovery state machine so that the functions are ignored
and skipped if an io error has occurred or if the file system
was withdraw.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/lock_dlm.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 31df26ed7854..68ca648cf918 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1081,6 +1081,14 @@ static void gdlm_recover_prep(void *arg)
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = &sdp->sd_lockstruct;
 
+   if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) {
+   fs_err(sdp, "recover_prep ignored due to io error.\n");
+   return;
+   }
+   if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) {
+   fs_err(sdp, "recover_prep ignored due to withdraw.\n");
+   return;
+   }
spin_lock(&ls->ls_recover_spin);
ls->ls_recover_block = ls->ls_recover_start;
set_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
@@ -1103,6 +,16 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot 
*slot)
struct lm_lockstruct *ls = &sdp->sd_lockstruct;
int jid = slot->slot - 1;
 
+   if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) {
+   fs_err(sdp, "recover_slot jid %d ignored due to io error.\n",
+  jid);
+   return;
+   }
+   if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) {
+   fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n",
+  jid);
+   return;
+   }
spin_lock(&ls->ls_recover_spin);
if (ls->ls_recover_size < jid + 1) {
fs_err(sdp, "recover_slot jid %d gen %u short size %d\n",
@@ -1127,6 +1145,14 @@ static void gdlm_recover_done(void *arg, struct dlm_slot 
*slots, int num_slots,
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = &sdp->sd_lockstruct;
 
+   if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) {
+   fs_err(sdp, "recover_done ignored due to io error.\n");
+   return;
+   }
+   if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) {
+   fs_err(sdp, "recover_done ignored due to withdraw.\n");
+   return;
+   }
/* ensure the ls jid arrays are large enough */
set_recover_size(sdp, slots, num_slots);
 
@@ -1154,6 +1180,16 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, 
unsigned int jid,
 {
struct lm_lockstruct *ls = &sdp->sd_lockstruct;
 
+   if (test_bit(SDF_AIL1_IO_ERROR, &sdp->sd_flags)) {
+   fs_err(sdp, "recovery_result jid %d ignored due to io error.\n",
+  jid);
+   return;
+   }
+   if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) {
+   fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n",
+  jid);
+   return;
+   }
if (test_bit(DFL_NO_DLM_OPS, &ls->ls_recover_flags))
return;
 
-- 
2.19.1



Re: [Cluster-devel] [PATCH V11 02/19] block: introduce multi-page bvec helpers

2018-11-21 Thread Ming Lei
On Wed, Nov 21, 2018 at 05:08:11PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 11:06:11PM +0800, Ming Lei wrote:
> > bvec_iter_* is used for single-page bvec in current linus tree, and there 
> > are
> > lots of users now:
> > 
> > [linux]$ git grep -n "bvec_iter_*" ./ | wc
> > 191 995   13242
> > 
> > If we have to switch it first, it can be a big change, just wondering if 
> > Jens
> > is happy with that?
> 
> Your above grep statement seems to catch every use of struct bvec_iter,
> due to the *.
> 
> Most uses of bvec_iter_ are either in the block headers, or are
> ceph wrappers that match the above and can easily be redefined.

OK, looks you are right, seems not so widely used:

$ git grep -n -w -E 
"bvec_iter_len|bvec_iter_bvec|bvec_iter_advance|bvec_iter_page|bvec_iter_offset"
 ./  | wc
 36 1942907

I will switch to that given the effected driver are only dm, nvdimm and ceph.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()

2018-11-21 Thread Ming Lei
On Wed, Nov 21, 2018 at 05:10:25PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 11:31:36PM +0800, Ming Lei wrote:
> > > But while looking over this I wonder why we even need the max_seg_len
> > > here.  The only thing __bvec_iter_advance does it to move bi_bvec_done
> > > and bi_idx forward, with corresponding decrements of bi_size.  As far
> > > as I can tell the only thing that max_seg_len does is that we need
> > > to more iterations of the while loop to archive the same thing.
> > > 
> > > And actual bvec used by the caller will be obtained using
> > > bvec_iter_bvec or segment_iter_bvec depending on if they want multi-page
> > > or single-page variants.
> > 
> > Right, we let __bvec_iter_advance() serve for both multi-page and 
> > single-page
> > case, then we have to tell it via one way or another, now we use the 
> > constant
> > of 'max_seg_len'.
> > 
> > Or you suggest to implement two versions of __bvec_iter_advance()?
> 
> No - I think we can always use the code without any segment in
> bvec_iter_advance.  Because bvec_iter_advance only operates on the
> iteractor, the generation of an actual single-page or multi-page
> bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec
> helpers.  The only difference is how many bytes you can move the
> iterator forward in a single loop iteration - so if you pass in
> PAGE_SIZE as the max_seg_len you just will have to loop more often
> for a large enough bytes, but not actually do anything different.

Yeah, I see that.

The difference is made by bio_iter_iovec()/bio_iter_mp_iovec() in
__bio_for_each_segment()/__bio_for_each_bvec().


Thanks,
Ming



Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

2018-11-21 Thread Eiichi Tsukata
2018年11月21日(水) 13:54 Al Viro :
>
> On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote:
> > Some file systems (including ext4, xfs, ramfs ...) have the following
> > problem as I've described in the commit message of the 1/4 patch.
> >
> >   The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
> >   removed almost all locks in llseek() including SEEK_END. It based on the
> >   idea that write() updates size atomically. But in fact, write() can be
> >   divided into two or more parts in generic_perform_write() when pos
> >   straddles over the PAGE_SIZE, which results in updating size multiple
> >   times in one write(). It means that llseek() can see the size being
> >   updated during write().
>
> And?  Who has ever promised anything that insane?  write(2) can take an 
> arbitrary
> amount of time; another process doing lseek() on independently opened 
> descriptor
> is *not* going to wait for that (e.g. page-in of the buffer being written, 
> which
> just happens to be mmapped from a file on NFS over RFC1149 link).

Thanks.

The lock I added in NFS was nothing but slow down lseek() because a file size is
updated atomically. Even `spin_lock(&inode->i_lock)` is unnecessary.

I'll fix the commit message which only refers to specific local file
systems that use
generic_perform_write() and remove unnecessary locks in some
distributed file systems
(e.g. nfs, cifs, or more) by replacing generic_file_llseek() with
generic_file_llseek_unlocked()
so that `tail` don't have to wait for avian carriers.



Re: [Cluster-devel] [PATCH v1 3/4] f2fs: fix race between llseek SEEK_END and write

2018-11-21 Thread Eiichi Tsukata
2018年11月21日(水) 18:23 Christoph Hellwig :
>
> On Wed, Nov 21, 2018 at 11:43:59AM +0900, Eiichi Tsukata wrote:
> > This patch itself seems to be just a cleanup but with the
> > commit b25bd1d9fd87 ("vfs: fix race between llseek SEEK_END and write")
> > it fixes race.
>
> Please move this patch to the beginning of the series and replace
> the commit log with something like the one below.  Note that your
> commit id is different from the one that will appear once applied
> upstream, so the aboe isn't too helpful.
>
> ---
> f2fs: use generic_file_llseek
>
> f2fs always passes inode->i_sb->s_maxbytes to generic_file_llseek_size,
> and thus should simply use generic_file_llseek.  For now this is a just
> a cleanup, but it will allow f2fs to pick up a race fix in
> generic_file_llseek for free.

Thanks, I'll fix it in v2.



Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

2018-11-21 Thread Al Viro
On Thu, Nov 22, 2018 at 02:40:50PM +0900, Eiichi Tsukata wrote:
> 2018年11月21日(水) 13:54 Al Viro :
> >
> > On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote:
> > > Some file systems (including ext4, xfs, ramfs ...) have the following
> > > problem as I've described in the commit message of the 1/4 patch.
> > >
> > >   The commit ef3d0fd27e90 ("vfs: do (nearly) lockless 
> > > generic_file_llseek")
> > >   removed almost all locks in llseek() including SEEK_END. It based on the
> > >   idea that write() updates size atomically. But in fact, write() can be
> > >   divided into two or more parts in generic_perform_write() when pos
> > >   straddles over the PAGE_SIZE, which results in updating size multiple
> > >   times in one write(). It means that llseek() can see the size being
> > >   updated during write().
> >
> > And?  Who has ever promised anything that insane?  write(2) can take an 
> > arbitrary
> > amount of time; another process doing lseek() on independently opened 
> > descriptor
> > is *not* going to wait for that (e.g. page-in of the buffer being written, 
> > which
> > just happens to be mmapped from a file on NFS over RFC1149 link).
> 
> Thanks.
> 
> The lock I added in NFS was nothing but slow down lseek() because a file size 
> is
> updated atomically. Even `spin_lock(&inode->i_lock)` is unnecessary.
> 
> I'll fix the commit message which only refers to specific local file
> systems that use
> generic_perform_write() and remove unnecessary locks in some
> distributed file systems
> (e.g. nfs, cifs, or more) by replacing generic_file_llseek() with
> generic_file_llseek_unlocked()
> so that `tail` don't have to wait for avian carriers.

fd = open("/mnt/slw/foo", O_RDONLY);
p = mmap(NULL, 8192, PROT_READ, MAP_SHARED, fd, 0);
local_fd = open("/tmp/foo", O_RDWR);
write(local_fd, p, 8192);

and there you go - extremely slow write(2).  To file on a local filesystem.

Can you show me where does POSIX/SuS/whatever it's called these days promise
that kind of atomicity?  TBH, I would be very surprised if any Unix promised
to have file size updated no more than once per write(2).  Any userland code
that relies on such property is, as minimum, non-portable and I would argue
that it is outright broken.

Note, BTW, that your example depends upon rather non-obvious details of echo
implementation, starting with the bufferization logics used by particular
shell.  And AFAICS ash(1) ends up with possibility of more than one write(2)
anyway - get SIGSTOP delivered, followed by SIGCONT and you've got just that.
Transparently for echo(1).  I'm fairly sure that the same holds for anything
that isn't outright broken - write(2) *IS* interruptible (must be, for obvious
reasons) and that pair of signals will lead to short write if it comes at the
right time.  The only thing userland can do (and does) is to issue further
write(2) on anything that hadn't been written the last time around.