Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-23 Thread Matthew Wilcox
On Tue, May 22, 2018 at 08:37:28PM +0200, Michal Hocko wrote:
> So why is this any better than the current code. Sure I am not a great
> fan of GFP_ZONE_TABLE because of how it is incomprehensible but this
> doesn't look too much better, yet we are losing a check for incompatible
> gfp flags. The diffstat looks really sound but then you just look and
> see that the large part is the comment that at least explained the gfp
> zone modifiers somehow and the debugging code. So what is the selling
> point?

I have a plan, but it's not exactly fully-formed yet.

One of the big problems we have today is that we have a lot of users
who have constraints on the physical memory they want to allocate,
but we have very limited abilities to provide them with what they're
asking for.  The various different ZONEs have different meanings on
different architectures and are generally a mess.

If we had eight ZONEs, we could offer:

ZONE_16M// 24 bit
ZONE_256M   // 28 bit
ZONE_LOWMEM // CONFIG_32BIT only
ZONE_4G // 32 bit
ZONE_64G// 36 bit
ZONE_1T // 40 bit
ZONE_ALL// everything larger
ZONE_MOVABLE// movable allocations; no physical address guarantees

#ifdef CONFIG_64BIT
#define ZONE_NORMAL ZONE_ALL
#else
#define ZONE_NORMAL ZONE_LOWMEM
#endif

This would cover most driver DMA mask allocations; we could tweak the
offered zones based on analysis of what people need.

#define GFP_HIGHUSER(GFP_USER | ZONE_ALL)
#define GFP_HIGHUSER_MOVABLE(GFP_USER | ZONE_MOVABLE)

One other thing I want to see is that fallback from zones happens from
highest to lowest normally (ie if you fail to allocate in 1T, then you
try to allocate from 64G), but movable allocations hapen from lowest
to highest.  So ZONE_16M ends up full of page cache pages which are
readily evictable for the rare occasions when we need to allocate memory
below 16MB.

I'm sure there are lots of good reasons why this won't work, which is
why I've been hesitant to propose it before now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/4] btrfs: lzo: Add header length check to avoid slab out of bounds access

2018-05-23 Thread Qu Wenruo


On 2018年05月24日 10:09, Misono Tomohiro wrote:
> On 2018/05/23 17:23, Qu Wenruo wrote:
>> James Harvey reported that some corrupted compressed extent data can
>> lead to various kernel memory corruption.
>>
>> Such corrupted extent data belongs to inode with NODATASUM flags, thus
>> data csum won't help us detecting such bug.
>>
>> If lucky enough, kasan could catch it like:
>> ==
>> BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs]
>> Write of size 4096 at addr 8800606cb0f8 by task kworker/u16:0/2338
>>
>> CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G   O  
>> 4.17.0-rc5-custom+ #50
>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>> Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
>> Call Trace:
>>  dump_stack+0xc2/0x16b
>>  print_address_description+0x6a/0x270
>>  kasan_report+0x260/0x380
>>  memcpy+0x34/0x50
>>  lzo_decompress_bio+0x384/0x7a0 [btrfs]
>>  end_compressed_bio_read+0x99f/0x10b0 [btrfs]
>>  bio_endio+0x32e/0x640
>>  normal_work_helper+0x15a/0xea0 [btrfs]
>>  process_one_work+0x7e3/0x1470
>>  worker_thread+0x1b0/0x1170
>>  kthread+0x2db/0x390
>>  ret_from_fork+0x22/0x40
>> ...
>> ==
>>
>> The offending compressed data has the following info:
>>
>> Header:  length 32768(Looks completely valid)
>> Segment 0 Header:length 3472882419   (Obvious out of bounds)
>>
>> Then when handling segment 0, since it's over the current page, we need
>> the compressed data to workspace, then such large size would trigger
>> out-of-bounds memory access, screwing up the whole kernel.
>>
>> Fix it by adding extra checks on header and segment headers to ensure we
>> won't access out-of-bounds, and even checks the decompressed data won't
>> be out-of-bounds.
>>
>> Reported-by: James Harvey 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/lzo.c | 35 ++-
>>  1 file changed, 34 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
>> index ec5db393c758..4f4de460b08d 100644
>> --- a/fs/btrfs/lzo.c
>> +++ b/fs/btrfs/lzo.c
>> @@ -293,6 +293,7 @@ static int lzo_decompress_bio(struct list_head *ws, 
>> struct compressed_bio *cb)
>>  unsigned long working_bytes;
>>  size_t in_len;
>>  size_t out_len;
>> +size_t max_segment_len = lzo1x_worst_compress(PAGE_SIZE);
>>  unsigned long in_offset;
>>  unsigned long in_page_bytes_left;
>>  unsigned long tot_in;
>> @@ -306,6 +307,18 @@ static int lzo_decompress_bio(struct list_head *ws, 
>> struct compressed_bio *cb)
>>  
>>  data_in = kmap(pages_in[0]);
>>  tot_len = read_compress_length(data_in);
>> +/*
>> + * Compressed data header check.
>> + *
>> + * The real compressed size can't exceed extent length, and all pages
>> + * should be used (a full pending page is not possible).
>> + * If this happens it means the compressed extent is corrupted.
>> + */
>> +if (tot_len > min_t(size_t, BTRFS_MAX_COMPRESSED, srclen) ||
>> +tot_len < srclen - PAGE_SIZE) {
>> +ret = -EUCLEAN;
>> +goto done;
>> +}
>>  
>>  tot_in = LZO_LEN;
>>  in_offset = LZO_LEN;
> 
> Just 1 line below here is:
> tot_len = min_t(size_t, srclen, tot_len);
> but this is not needed because always tot_len <= srclen, considered above if, 
> right?

Right, github version updated.

Thanks,
Qu


> 
>> @@ -320,6 +333,17 @@ static int lzo_decompress_bio(struct list_head *ws, 
>> struct compressed_bio *cb)
>>  in_offset += LZO_LEN;
>>  tot_in += LZO_LEN;
>>  
>> +/*
>> + * Segment header check.
>> + *
>> + * The segment length must not exceed max lzo compression
>> + * size, nor the total compressed size
>> + */
>> +if (in_len > max_segment_len || tot_in + in_len > tot_len) {
>> +ret = -EUCLEAN;
>> +goto done;
>> +}
>> +
>>  tot_in += in_len;
>>  working_bytes = in_len;
>>  may_late_unmap = need_unmap = false;
>> @@ -370,7 +394,7 @@ static int lzo_decompress_bio(struct list_head *ws, 
>> struct compressed_bio *cb)
>>  }
>>  }
>>  
>> -out_len = lzo1x_worst_compress(PAGE_SIZE);
>> +out_len = max_segment_len;
>>  ret = lzo1x_decompress_safe(buf, in_len, workspace->buf,
>>  _len);
>>  if (need_unmap)
>> @@ -380,6 +404,15 @@ static int lzo_decompress_bio(struct list_head *ws, 
>> struct compressed_bio *cb)
>>  ret = -EIO;
>>  break;
>>  }
>> +/*
>> + * Decompressed data length check.
>> +   

Re: [PATCH v3 3/4] btrfs: lzo: Add header length check to avoid slab out of bounds access

2018-05-23 Thread Misono Tomohiro
On 2018/05/23 17:23, Qu Wenruo wrote:
> James Harvey reported that some corrupted compressed extent data can
> lead to various kernel memory corruption.
> 
> Such corrupted extent data belongs to inode with NODATASUM flags, thus
> data csum won't help us detecting such bug.
> 
> If lucky enough, kasan could catch it like:
> ==
> BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs]
> Write of size 4096 at addr 8800606cb0f8 by task kworker/u16:0/2338
> 
> CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G   O  
> 4.17.0-rc5-custom+ #50
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
> Call Trace:
>  dump_stack+0xc2/0x16b
>  print_address_description+0x6a/0x270
>  kasan_report+0x260/0x380
>  memcpy+0x34/0x50
>  lzo_decompress_bio+0x384/0x7a0 [btrfs]
>  end_compressed_bio_read+0x99f/0x10b0 [btrfs]
>  bio_endio+0x32e/0x640
>  normal_work_helper+0x15a/0xea0 [btrfs]
>  process_one_work+0x7e3/0x1470
>  worker_thread+0x1b0/0x1170
>  kthread+0x2db/0x390
>  ret_from_fork+0x22/0x40
> ...
> ==
> 
> The offending compressed data has the following info:
> 
> Header:   length 32768(Looks completely valid)
> Segment 0 Header: length 3472882419   (Obvious out of bounds)
> 
> Then when handling segment 0, since it's over the current page, we need
> the compressed data to workspace, then such large size would trigger
> out-of-bounds memory access, screwing up the whole kernel.
> 
> Fix it by adding extra checks on header and segment headers to ensure we
> won't access out-of-bounds, and even checks the decompressed data won't
> be out-of-bounds.
> 
> Reported-by: James Harvey 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/lzo.c | 35 ++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
> index ec5db393c758..4f4de460b08d 100644
> --- a/fs/btrfs/lzo.c
> +++ b/fs/btrfs/lzo.c
> @@ -293,6 +293,7 @@ static int lzo_decompress_bio(struct list_head *ws, 
> struct compressed_bio *cb)
>   unsigned long working_bytes;
>   size_t in_len;
>   size_t out_len;
> + size_t max_segment_len = lzo1x_worst_compress(PAGE_SIZE);
>   unsigned long in_offset;
>   unsigned long in_page_bytes_left;
>   unsigned long tot_in;
> @@ -306,6 +307,18 @@ static int lzo_decompress_bio(struct list_head *ws, 
> struct compressed_bio *cb)
>  
>   data_in = kmap(pages_in[0]);
>   tot_len = read_compress_length(data_in);
> + /*
> +  * Compressed data header check.
> +  *
> +  * The real compressed size can't exceed extent length, and all pages
> +  * should be used (a full pending page is not possible).
> +  * If this happens it means the compressed extent is corrupted.
> +  */
> + if (tot_len > min_t(size_t, BTRFS_MAX_COMPRESSED, srclen) ||
> + tot_len < srclen - PAGE_SIZE) {
> + ret = -EUCLEAN;
> + goto done;
> + }
>  
>   tot_in = LZO_LEN;
>   in_offset = LZO_LEN;

Just 1 line below here is:
tot_len = min_t(size_t, srclen, tot_len);
but this is not needed because always tot_len <= srclen, considered above if, 
right?

> @@ -320,6 +333,17 @@ static int lzo_decompress_bio(struct list_head *ws, 
> struct compressed_bio *cb)
>   in_offset += LZO_LEN;
>   tot_in += LZO_LEN;
>  
> + /*
> +  * Segment header check.
> +  *
> +  * The segment length must not exceed max lzo compression
> +  * size, nor the total compressed size
> +  */
> + if (in_len > max_segment_len || tot_in + in_len > tot_len) {
> + ret = -EUCLEAN;
> + goto done;
> + }
> +
>   tot_in += in_len;
>   working_bytes = in_len;
>   may_late_unmap = need_unmap = false;
> @@ -370,7 +394,7 @@ static int lzo_decompress_bio(struct list_head *ws, 
> struct compressed_bio *cb)
>   }
>   }
>  
> - out_len = lzo1x_worst_compress(PAGE_SIZE);
> + out_len = max_segment_len;
>   ret = lzo1x_decompress_safe(buf, in_len, workspace->buf,
>   _len);
>   if (need_unmap)
> @@ -380,6 +404,15 @@ static int lzo_decompress_bio(struct list_head *ws, 
> struct compressed_bio *cb)
>   ret = -EIO;
>   break;
>   }
> + /*
> +  * Decompressed data length check.
> +  * The uncompressed data should not exceed uncompressed extent
> +  * size.
> +  */
> + if (tot_out + out_len > cb->len) {
> +  

Notes on locking for pagacache consistency (was: [PATCH 01/10] mm: pagecache add lock)

2018-05-23 Thread Kent Overstreet
On Fri, May 18, 2018 at 01:45:49PM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote:
> > On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > > > Historically, the only problematic case has been direct IO, and people
> > > > have been willing to say "well, if you mix buffered and direct IO you
> > > > get what you deserve", and that's probably not unreasonable. But now we
> > > > have fallocate insert range and collapse range, and those are broken in
> > > > ways I frankly don't want to think about if they can't ensure 
> > > > consistency
> > > > with the page cache.
> > > 
> > > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> > > You may get pushback on the grounds that this ought to be a
> > > filesystem-specific lock rather than one embedded in the generic inode.
> > 
> > Honestly I think this probably should be in the core.  But IFF we move
> > it to the core the existing users of per-fs locks need to be moved
> > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > that copied the approach, and probably more if you audit deep enough.
> 
> I didn't know about i_mmap_sem, thanks
> 
> But, using a rw semaphore wouldn't work for dio writes, and I do need dio 
> writes
> to block pagecache adds in bcachefs since the dio write could overwrite
> uncompressed data or a reservation with compressed data, which means new 
> writes
> need a disk reservation.
> 
> Also I'm guessing ext4 takes the lock at the top of the read path? That sucks
> for reads that are already cached, the generic buffered read path can do 
> cached
> reads without taking any per inode locks - that's why I pushed the lock down 
> to
> only be taken when we have to add a page to the pagecache.
> 
> Definitely less ugly doing it that way though...

More notes on locking design: Mathew, this is very relevant to any sort of range
locking, too.

There's some really tricky issues related to dio writes + page faults. If your
particular filesystem doesn't care about minor page cache consistency issues
caused by dio writes most of this may not be relevant to you, but I honestly
would find it a little hard to believe this isn't an issue for _any_ other
filesystems.

Current situation today, for most filesystems: top of the dio write path shoots
down the region of the pagecache for the file it's writing to, with
filemap_write_and_wait_range() then invalidate_inode_pages2_range().

This is racy though and does _not_ gurarantee page cache consistency, i.e. we
can end up in a situation where the write completes, but we have stale data -
and worse, potentially stale metadata, in the buffer heads or whatever your
filesystem uses.

Ok, solving that is easy enough, we need a lock dio writes can take that
prevents pages from being added to a mapping for their duration (or alternately,
range locks, but range locks can be viewed as just an optimization).

This explains my locking design - if you have a lock that can be taken for "add"
or "block", where multiple threads can take it for add or block, but it can't be
in both states simultaneously then it does what we want, you can have multiple
outstanding dio writes or multiple buffered IO operations bringing pages in and
it doesn't add much overhead.

This isn't enough, though.

 - calling filemap_write_and_wait_range then invalidate_inode_pages2_range can
   race - the call to invalidate_inode_pages2_range will fail if any of the
   pages have been redirtied, and we'll have stale pages in the page cache.

   The ideal solution for this would be a function to do both, that removes
   pages from the page cache atomically with clearing PageDirty and kicking off
   writeback. Alternately, you can have .page_mkwrite and the buffered write
   path take the pagecache add lock when they have to dirty a page, but that
   kind of sucks.

 - pagefaults, via gup()
   
   this is the really annoying one, if userspace does a dio write where the buf
   they're writing is mmapped and overlaps with the part of the file they're
   writing to, yay fun

   we call gup() after shooting down the part of the pagecache we're writing
   to, so gup() just faults it back in again and we're left with stale data in
   the page cache again.

   the generic dio code tries to deal with this these days by calling
   invalidate_inode_pages2_range() again after the dio write completes. Again
   though, invalidate_inode_pages2_range() will fail if the pages are dirty, and
   we're left with stale data in the page cache.

   I _think_ (haven't tried this yet) it ought to be possible to move the second
   call to invalidate_inode_pages2_range() to immediately after the call to
   gup() - this change wouldn't make sense in the current code without locking,
   but it would make sense if the dio write path is holding a pagecache add lock
   to prevent anything but its own faults via gup() from bringing pages back in.

   Also need to 

Re: [PATCH 1/2] Btrfs: fix clone vs chattr NODATASUM race

2018-05-23 Thread David Sterba
On Tue, May 22, 2018 at 03:02:12PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> In btrfs_clone_files(), we must check the NODATASUM flag while the
> inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
> will change the flags after we check and we can end up with a party
> checksummed file.

The race window is only a few instructions in size, between the if and
the locks which is:

3834 if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
3835 return -EISDIR;

where the setflags must be run and toggle the nodatacow flag (provided
the file size is 0).  The clone will block on the inode lock, segflags
takes the inode lock, changes flags, releases log and clone continues.

Not impossible but still needs a lot of bad luck to hit unintentionally.

Reviewed-by: David Sterba 

> Fixes: 0e7b824c4ef9 ("Btrfs: don't make a file partly checksummed through 
> file clone")
> Signed-off-by: Omar Sandoval 
> ---
>  fs/btrfs/ioctl.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index cf0d3bc6f625..784e267aad32 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4280,11 +4280,6 @@ static noinline int btrfs_clone_files(struct file 
> *file, struct file *file_src,
>   src->i_sb != inode->i_sb)
>   return -EXDEV;
>  
> - /* don't make the dst file partly checksummed */
> - if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
> - (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
> - return -EINVAL;
> -
>   if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
>   return -EISDIR;
>  
> @@ -4294,6 +4289,13 @@ static noinline int btrfs_clone_files(struct file 
> *file, struct file *file_src,
>   inode_lock(src);
>   }
>  
> + /* don't make the dst file partly checksummed */
> + if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
> + (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
>   /* determine range to clone */
>   ret = -EINVAL;
>   if (off + len > src->i_size || off + len < off)
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Btrfs: fix partly checksummed file races

2018-05-23 Thread David Sterba
On Wed, May 23, 2018 at 10:14:14AM -0700, Omar Sandoval wrote:
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c

> The clone fix and device remove fix are in that branch, too. Let me know
> if you'd prefer it as patches.

The diff like that is fine, there likely will be more conflicts with
followup patches that need to be resolved anyway. If you do that and
find something worth pointing out then it's good to be mentioned, but
otherwise I'm not asking you or other reporter to also post the resolved
version. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-23 Thread Chris Mason



On 23 May 2018, at 2:37, Christoph Hellwig wrote:


On Tue, May 22, 2018 at 02:31:36PM -0400, Chris Mason wrote:

And what protects two writes from interleaving their results now?


page locks...ish, we at least won't have results interleaved in a 
single
page.  For btrfs it'll actually be multiple pages since we try to do 
more

than one at a time.


I think you are going to break just about every assumption people
have that any single write is going to be atomic vs another write.

E.g. this comes from the posix read definition reference by the
write definition:

"I/O is intended to be atomic to ordinary files and pipes and FIFOs.
Atomic means that all the bytes from a single operation that started 
out

together end up together, without interleaving from other I/O
operations. It is a known attribute of terminals that this is not
honored, and terminals are explicitly (and implicitly permanently)
excepted, making the behavior unspecified. The behavior for other 
device

types is also left unspecified, but the wording is intended to imply
that future standards might choose to specify atomicity (or not).
"

Without taking the inode lock (or some sort of range lock) you can
easily interleave data from two write requests.


"This volume of IEEE Std 1003.1-2001 does not specify behavior of 
concurrent writes to a file from multiple processes. Applications should 
use some form of concurrency control."


I'm always more worried about truncate than standards ;)  But just to be 
clear, I'm not disagreeing with you at all.  Interleaved writes just 
wasn't the first thing that jumped to my mind.




But we're not avoiding the inode lock completely, we're just dropping 
it for
the expensive parts of writing to the file.  A quick guess about what 
the

expensive parts are:


The way I read the patch it basically 'avoids' the inode lock for 
almost

the whole write call, just minus some setup.


Yeah, if we can get 90% of the way there by pushing some 
balance_dirty_pages() outside the lock (or whatever other expensive 
setup we're doing), I'd by much happier.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 5/9] drivers/block/zram/zram_drv: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MOVABLE to replace (__GFP_HIGHMEM | __GFP_MOVABLE).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.

__GFP_ZONE_MOVABLE contains encoded ZONE_MOVABLE and __GFP_MOVABLE flag.

With GFP_ZONE_TABLE, __GFP_HIGHMEM ORing __GFP_MOVABLE means gfp_zone
should return ZONE_MOVABLE. In order to keep that compatible with
GFP_ZONE_TABLE, replace (__GFP_HIGHMEM | __GFP_MOVABLE) with
__GFP_ZONE_MOVABLE.

Signed-off-by: Huaisheng Ye 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
Cc: Christoph Hellwig 
---
 drivers/block/zram/zram_drv.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0f3fadd..1bb5ca8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1004,14 +1004,12 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec,
handle = zs_malloc(zram->mem_pool, comp_len,
__GFP_KSWAPD_RECLAIM |
__GFP_NOWARN |
-   __GFP_HIGHMEM |
-   __GFP_MOVABLE);
+   __GFP_ZONE_MOVABLE);
if (!handle) {
zcomp_stream_put(zram->comp);
atomic64_inc(>stats.writestall);
handle = zs_malloc(zram->mem_pool, comp_len,
-   GFP_NOIO | __GFP_HIGHMEM |
-   __GFP_MOVABLE);
+   GFP_NOIO | __GFP_ZONE_MOVABLE);
if (handle)
goto compress_again;
return -ENOMEM;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 9/9] arch/x86/include/asm/page.h: update usage of movableflags

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

GFP_HIGHUSER_MOVABLE doesn't equal to GFP_HIGHUSER | __GFP_MOVABLE,
modify it to adapt patch of getting rid of GFP_ZONE_TABLE/BAD.

Signed-off-by: Huaisheng Ye 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Kate Stewart 
Cc: Greg Kroah-Hartman 
Cc: x...@kernel.org 
Cc: Philippe Ombredanne 
Cc: Christoph Hellwig 
---
 arch/x86/include/asm/page.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48..a47f42d 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -35,7 +35,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
 }
 
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
-   alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
+   alloc_page_vma((movableflags ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER) \
+   | __GFP_ZERO, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #ifndef __pa
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Btrfs: fix partly checksummed file races

2018-05-23 Thread Omar Sandoval
On Wed, May 23, 2018 at 12:17:14PM +0200, David Sterba wrote:
> On Tue, May 22, 2018 at 03:02:11PM -0700, Omar Sandoval wrote:
> > Based on kdave/for-next. Note that there's a Fixes: tag in there
> > referencing a commit in the for-next branch, so that would have to be
> > updated if the commit gets rebased. These patches are also available at
> > https://github.com/osandov/linux/tree/btrfs-nodatasum-race.
> 
> If the original patch is not in any released or frozen branch, then the
> fix should be folded to the original patch. The for-next branch is for
> preview, testing and catching bugs that slip past the review. And gets
> reassembled frequently so referencing a patch from there does not make
> sense.
> 
> Sending the fixups as patches is ok, replies to the original thread
> might get lost in the noise.

Ok, let's fold it in. I pushed Timofey's series with the fix folded in
here: https://github.com/osandov/linux/tree/btrfs-ioctl-fixes, based on
misc-next with Timofey's patches removed. The diff vs his original
patches is the same as my patch:

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 65d709002775..75c66ac77fd7 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3113,12 +3113,6 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
if (olen == 0)
return 0;
 
-   /* don't make the dst file partly checksummed */
-   if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
-   (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) {
-   return -EINVAL;
-   }
-
tail_len = olen % BTRFS_MAX_DEDUPE_LEN;
chunk_count = div_u64(olen, BTRFS_MAX_DEDUPE_LEN);
if (chunk_count == 0)
@@ -3151,6 +3145,13 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
else
btrfs_double_inode_lock(src, dst);
 
+   /* don't make the dst file partly checksummed */
+   if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
+   (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
for (i = 0; i < chunk_count; i++) {
ret = btrfs_extent_same_range(src, loff, BTRFS_MAX_DEDUPE_LEN,
  dst, dst_loff, );

The clone fix and device remove fix are in that branch, too. Let me know
if you'd prefer it as patches.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] mm: pagecache add lock

2018-05-23 Thread Kent Overstreet
On Wed, May 23, 2018 at 08:22:39AM -0700, Christoph Hellwig wrote:
> On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote:
> > > 
> > > Honestly I think this probably should be in the core.  But IFF we move
> > > it to the core the existing users of per-fs locks need to be moved
> > > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > > that copied the approach, and probably more if you audit deep enough.
> > 
> > I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
> > merging bcachefs. Sorry, but that's a bit crazy.
> 
> It isn't crazy at all.  In general we expect people to do their fair
> share of core work to get their pet feature in.  How much is acceptable
> is a difficult question and not black and white.
> 
> But if you want to grow a critical core structure you better take a stab
> at converting existing users.  Without that the tradeoff of growing
> that core structure is simply not given.
> 
> Or to put it in words for this exact feature:  unless your new field
> is also used by mainstream file systems it is not going to be added.

Christoph, I'm really not someone you can accuse of avoiding my share of core
work and refactoring and you know it.

When are you going to get around to converting existing users of fs/direct-io.c
to iomap so it can be deleted? The kernel is carrying around two dio
implementations right now thanks to you. Not a good situation, is it?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 8/9] include/linux/highmem.h: update usage of movableflags

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

GFP_HIGHUSER_MOVABLE doesn't equal to GFP_HIGHUSER | __GFP_MOVABLE,
modify it to adapt patch of getting rid of GFP_ZONE_TABLE/BAD.

Signed-off-by: Huaisheng Ye 
Cc: Kate Stewart 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Philippe Ombredanne 
Cc: Christoph Hellwig 
---
 include/linux/highmem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 0690679..5383c9e 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -159,8 +159,8 @@ static inline void clear_user_highpage(struct page *page, 
unsigned long vaddr)
struct vm_area_struct *vma,
unsigned long vaddr)
 {
-   struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags,
-   vma, vaddr);
+   struct page *page = alloc_page_vma(movableflags ?
+   GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER, vma, vaddr);
 
if (page)
clear_user_highpage(page, vaddr);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 7/9] mm/zsmalloc: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MOVABLE to replace (__GFP_HIGHMEM | __GFP_MOVABLE).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.

__GFP_ZONE_MOVABLE contains encoded ZONE_MOVABLE and __GFP_MOVABLE flag.

With GFP_ZONE_TABLE, __GFP_HIGHMEM ORing __GFP_MOVABLE means gfp_zone
should return ZONE_MOVABLE. In order to keep that compatible with
GFP_ZONE_TABLE, Use GFP_NORMAL_UNMOVABLE() to clear bottom 4 bits of
GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
Cc: Christoph Hellwig 
---
 mm/zsmalloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 61cb05d..e250c69 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -345,7 +345,7 @@ static void destroy_cache(struct zs_pool *pool)
 static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
 {
return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
-   gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   GFP_NORMAL_UNMOVABLE(gfp));
 }
 
 static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
@@ -356,7 +356,7 @@ static void cache_free_handle(struct zs_pool *pool, 
unsigned long handle)
 static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
 {
return kmem_cache_alloc(pool->zspage_cachep,
-   flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   GFP_NORMAL_UNMOVABLE(flags));
 }
 
 static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 4/9] fs/btrfs/extent_io: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MASK to replace (__GFP_DMA32 | __GFP_HIGHMEM).

In function alloc_extent_state, it is obvious that __GFP_DMA is not
the expecting zone type.

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.
__GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated with
each others by OR.

Use GFP_NORMAL() to clear bottom 3 bits of GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: Christoph Hellwig 
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e99b329..f41fc61 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -220,7 +220,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
 * The given mask might be not appropriate for the slab allocator,
 * drop the unsupported bits
 */
-   mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
+   mask = GFP_NORMAL(mask);
state = kmem_cache_alloc(extent_state_cache, mask);
if (!state)
return state;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 3/9] drivers/xen/swiotlb-xen: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MASK to replace (__GFP_DMA | __GFP_HIGHMEM).

In function xen_swiotlb_alloc_coherent, it is obvious that __GFP_DMA32
is not the expecting zone type.

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.
__GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated with
each others by OR.

Use GFP_NORMAL() to clear bottom 3 bits of GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Konrad Rzeszutek Wilk 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Christoph Hellwig 
---
 drivers/xen/swiotlb-xen.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index e1c6089..359 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -301,7 +301,7 @@ int __ref xen_swiotlb_init(int verbose, bool early)
* machine physical layout.  We can't allocate highmem
* because we can't return a pointer to it.
*/
-   flags &= ~(__GFP_DMA | __GFP_HIGHMEM);
+   flags = GFP_NORMAL(flags);
 
/* On ARM this function returns an ioremap'ped virtual address for
 * which virt_to_phys doesn't return the corresponding physical
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 2/9] include/linux/dma-mapping: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MASK to replace (__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.
__GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated with
each others by OR.

Use GFP_NORMAL() to clear bottom 3 bits of GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Christoph Hellwig 
---
 include/linux/dma-mapping.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f8ab1c0..8fe524d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -519,7 +519,7 @@ static inline void *dma_alloc_attrs(struct device *dev, 
size_t size,
return cpu_addr;
 
/* let the implementation decide on the zone to allocate from: */
-   flag &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM);
+   flag = GFP_NORMAL(flag);
 
if (!arch_dma_alloc_attrs(, ))
return NULL;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 1/9] include/linux/gfp.h: get rid of GFP_ZONE_TABLE/BAD

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Replace GFP_ZONE_TABLE and GFP_ZONE_BAD with encoded zone number.

Delete ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 from GFP bitmasks,
the bottom three bits of GFP mask is reserved for storing encoded
zone number.

The encoding method is XOR. Get zone number from enum zone_type,
then encode the number with ZONE_NORMAL by XOR operation.
The goal is to make sure ZONE_NORMAL can be encoded to zero. So,
the compatibility can be guaranteed, such as GFP_KERNEL and GFP_ATOMIC
can be used as before.

Reserve __GFP_MOVABLE in bit 3, so that it can continue to be used as
a flag. Same as before, __GFP_MOVABLE respresents movable migrate type
for ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. But when it is enabled with
__GFP_HIGHMEM, ZONE_MOVABLE shall be returned instead of ZONE_HIGHMEM.
__GFP_ZONE_MOVABLE is created to realize it.

With this patch, just enabling __GFP_MOVABLE and __GFP_HIGHMEM is not
enough to get ZONE_MOVABLE from gfp_zone. All subsystems should use
GFP_HIGHUSER_MOVABLE directly to achieve that.

Decode zone number directly from bottom three bits of flags in gfp_zone.
The theory of encoding and decoding is,
A ^ B ^ B = A

Suggested-by: Matthew Wilcox 
Signed-off-by: Huaisheng Ye 
Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Kate Stewart 
Cc: "Levin, Alexander (Sasha Levin)" 
Cc: Greg Kroah-Hartman 
Cc: Christoph Hellwig 
---
 include/linux/gfp.h | 107 ++--
 1 file changed, 20 insertions(+), 87 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b..f76ccd76 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -16,9 +16,7 @@
  */
 
 /* Plain integer GFP bitmasks. Do not use this directly. */
-#define ___GFP_DMA 0x01u
-#define ___GFP_HIGHMEM 0x02u
-#define ___GFP_DMA32   0x04u
+#define ___GFP_ZONE_MASK   0x07u
 #define ___GFP_MOVABLE 0x08u
 #define ___GFP_RECLAIMABLE 0x10u
 #define ___GFP_HIGH0x20u
@@ -53,11 +51,15 @@
  * without the underscores and use them consistently. The definitions here may
  * be used in bit comparisons.
  */
-#define __GFP_DMA  ((__force gfp_t)___GFP_DMA)
-#define __GFP_HIGHMEM  ((__force gfp_t)___GFP_HIGHMEM)
-#define __GFP_DMA32((__force gfp_t)___GFP_DMA32)
+#define __GFP_DMA  ((__force gfp_t)OPT_ZONE_DMA ^ ZONE_NORMAL)
+#define __GFP_HIGHMEM  ((__force gfp_t)OPT_ZONE_HIGHMEM ^ ZONE_NORMAL)
+#define __GFP_DMA32((__force gfp_t)OPT_ZONE_DMA32 ^ ZONE_NORMAL)
 #define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE 
allowed */
-#define GFP_ZONEMASK   (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
+#define GFP_ZONEMASK   ((__force gfp_t)___GFP_ZONE_MASK | ___GFP_MOVABLE)
+/* bottom 3 bits of GFP bitmasks are used for zone number encoded*/
+#define __GFP_ZONE_MASK ((__force gfp_t)___GFP_ZONE_MASK)
+#define __GFP_ZONE_MOVABLE \
+   ((__force gfp_t)(ZONE_MOVABLE ^ ZONE_NORMAL) | ___GFP_MOVABLE)
 
 /*
  * Page mobility and placement hints
@@ -268,6 +270,13 @@
  *   available and will not wake kswapd/kcompactd on failure. The _LIGHT
  *   version does not attempt reclaim/compaction at all and is by default used
  *   in page fault path, while the non-light is used by khugepaged.
+ *
+ * GFP_NORMAL() is used to clear bottom 3 bits of GFP bitmask. Actually it
+ *   returns encoded ZONE_NORMAL bits.
+ *
+ * GFP_NORMAL_UNMOVABLE() is similar to GFP_NORMAL, but it clear bottom 4 bits
+ *   of GFP bitmask. Excepting the encoded ZONE_NORMAL bits, it clears MOVABLE
+ *   flags as well.
  */
 #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
@@ -279,10 +288,12 @@
 #define GFP_DMA__GFP_DMA
 #define GFP_DMA32  __GFP_DMA32
 #define GFP_HIGHUSER   (GFP_USER | __GFP_HIGHMEM)
-#define GFP_HIGHUSER_MOVABLE   (GFP_HIGHUSER | __GFP_MOVABLE)
+#define GFP_HIGHUSER_MOVABLE   (GFP_USER | __GFP_ZONE_MOVABLE)
 #define GFP_TRANSHUGE_LIGHT((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 #define GFP_TRANSHUGE  (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
+#define GFP_NORMAL(gfp)((gfp) & ~__GFP_ZONE_MASK)
+#define GFP_NORMAL_UNMOVABLE(gfp) ((gfp) & ~GFP_ZONEMASK)
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
@@ -326,87 +337,9 @@ static inline bool gfpflags_allow_blocking(const gfp_t 
gfp_flags)
 #define OPT_ZONE_DMA32 ZONE_NORMAL
 #endif
 
-/*
- * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
- * zone to use given the lowest 4 bits of 

[RFC PATCH v3 0/9] get rid of GFP_ZONE_TABLE/BAD

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Changes since v2: [2]
* According to Christoph's suggestion, rebase patches to current
  mainline from v4.16.

* Follow the advice of Matthew, create macros like GFP_NORMAL and
  GFP_NORMAL_UNMOVABLE to clear bottom 3 and 4 bits of GFP bitmask.

* Delete some patches because of kernel updating.

[2]: https://marc.info/?l=linux-mm=152691610014027=2

Tested by Lenovo Thinksystem server.

Initmem setup node 0 [mem 0x1000-0x00043fff]
[0.00] On node 0 totalpages: 4111666
[0.00]   DMA zone: 64 pages used for memmap
[0.00]   DMA zone: 23 pages reserved
[0.00]   DMA zone: 3999 pages, LIFO batch:0
[0.00] mminit::memmap_init Initialising map node 0 zone 0 pfns 1 -> 
4096 
[0.00]   DMA32 zone: 10935 pages used for memmap
[0.00]   DMA32 zone: 699795 pages, LIFO batch:31
[0.00] mminit::memmap_init Initialising map node 0 zone 1 pfns 4096 -> 
1048576
[0.00]   Normal zone: 53248 pages used for memmap
[0.00]   Normal zone: 3407872 pages, LIFO batch:31
[0.00] mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 
-> 4456448
[0.00] mminit::memmap_init Initialising map node 0 zone 3 pfns 1 -> 
4456448
[0.00] Initmem setup node 1 [mem 0x00238000-0x00277fff]
[0.00] On node 1 totalpages: 4194304
[0.00]   Normal zone: 65536 pages used for memmap
[0.00]   Normal zone: 4194304 pages, LIFO batch:31
[0.00] mminit::memmap_init Initialising map node 1 zone 2 pfns 37224448 
-> 41418752
[0.00] mminit::memmap_init Initialising map node 1 zone 3 pfns 37224448 
-> 41418752
...
[0.00] mminit::zonelist general 0:DMA = 0:DMA
[0.00] mminit::zonelist general 0:DMA32 = 0:DMA32 0:DMA
[0.00] mminit::zonelist general 0:Normal = 0:Normal 0:DMA32 0:DMA 
1:Normal
[0.00] mminit::zonelist thisnode 0:DMA = 0:DMA
[0.00] mminit::zonelist thisnode 0:DMA32 = 0:DMA32 0:DMA
[0.00] mminit::zonelist thisnode 0:Normal = 0:Normal 0:DMA32 0:DMA
[0.00] mminit::zonelist general 1:Normal = 1:Normal 0:Normal 0:DMA32 
0:DMA
[0.00] mminit::zonelist thisnode 1:Normal = 1:Normal
[0.00] Built 2 zonelists, mobility grouping on.  Total pages: 8176164
[0.00] Policy zone: Normal
[0.00] Kernel command line: BOOT_IMAGE=/vmlinuz-4.17.0-rc6-gfp09+ 
root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap 
debug 
LANG=en_US.UTF-8 mminit_loglevel=4 console=tty0 console=ttyS0,115200n8 
memblock=debug
earlyprintk=serial,0x3f8,115200

---

Replace GFP_ZONE_TABLE and GFP_ZONE_BAD with encoded zone number.

Delete ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 from GFP bitmasks,
the bottom three bits of GFP mask is reserved for storing encoded
zone number.

The encoding method is XOR. Get zone number from enum zone_type,
then encode the number with ZONE_NORMAL by XOR operation.
The goal is to make sure ZONE_NORMAL can be encoded to zero. So,
the compatibility can be guaranteed, such as GFP_KERNEL and GFP_ATOMIC
can be used as before.

Reserve __GFP_MOVABLE in bit 3, so that it can continue to be used as
a flag. Same as before, __GFP_MOVABLE respresents movable migrate type
for ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. But when it is enabled with
__GFP_HIGHMEM, ZONE_MOVABLE shall be returned instead of ZONE_HIGHMEM.
__GFP_ZONE_MOVABLE is created to realize it.

With this patch, just enabling __GFP_MOVABLE and __GFP_HIGHMEM is not
enough to get ZONE_MOVABLE from gfp_zone. All callers should use
GFP_HIGHUSER_MOVABLE or __GFP_ZONE_MOVABLE directly to achieve that.

Decode zone number directly from bottom three bits of flags in gfp_zone.
The theory of encoding and decoding is,
A ^ B ^ B = A

Changes since v1:[1]

* Create __GFP_ZONE_MOVABLE and modify GFP_HIGHUSER_MOVABLE to help
  callers to get ZONE_MOVABLE. Try to create __GFP_ZONE_MASK to mask
  lowest 3 bits of GFP bitmasks.

* Modify some callers' gfp flag to update usage of address zone
  modifiers.

* Modify inline function gfp_zone to get better performance according
  to Matthew's suggestion.

[1]: https://marc.info/?l=linux-mm=152596791931266=2

---

Huaisheng Ye (9):
  include/linux/gfp.h: get rid of GFP_ZONE_TABLE/BAD
  include/linux/dma-mapping: update usage of zone modifiers
  drivers/xen/swiotlb-xen: update usage of zone modifiers
  fs/btrfs/extent_io: update usage of zone modifiers
  drivers/block/zram/zram_drv: update usage of zone modifiers
  mm/vmpressure: update usage of zone modifiers
  mm/zsmalloc: update usage of zone modifiers
  include/linux/highmem.h: update usage of movableflags
  arch/x86/include/asm/page.h: update usage of movableflags

 arch/x86/include/asm/page.h   |   3 +-
 drivers/block/zram/zram_drv.c |   6 +--
 drivers/xen/swiotlb-xen.c |   2 +-
 fs/btrfs/extent_io.c  |   2 +-
 include/linux/dma-mapping.h   |   2 +-
 include/linux/gfp.h   | 107 

RE: [External] Re: [RFC PATCH v2 00/12] get rid of GFP_ZONE_TABLE/BAD

2018-05-23 Thread Huaisheng HS1 Ye
From: Michal Hocko [mailto:mho...@kernel.org]
Sent: Wednesday, May 23, 2018 2:37 AM
> 
> On Mon 21-05-18 23:20:21, Huaisheng Ye wrote:
> > From: Huaisheng Ye 
> >
> > Replace GFP_ZONE_TABLE and GFP_ZONE_BAD with encoded zone number.
> >
> > Delete ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 from GFP bitmasks,
> > the bottom three bits of GFP mask is reserved for storing encoded
> > zone number.
> >
> > The encoding method is XOR. Get zone number from enum zone_type,
> > then encode the number with ZONE_NORMAL by XOR operation.
> > The goal is to make sure ZONE_NORMAL can be encoded to zero. So,
> > the compatibility can be guaranteed, such as GFP_KERNEL and GFP_ATOMIC
> > can be used as before.
> >
> > Reserve __GFP_MOVABLE in bit 3, so that it can continue to be used as
> > a flag. Same as before, __GFP_MOVABLE respresents movable migrate type
> > for ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. But when it is enabled with
> > __GFP_HIGHMEM, ZONE_MOVABLE shall be returned instead of ZONE_HIGHMEM.
> > __GFP_ZONE_MOVABLE is created to realize it.
> >
> > With this patch, just enabling __GFP_MOVABLE and __GFP_HIGHMEM is not
> > enough to get ZONE_MOVABLE from gfp_zone. All callers should use
> > GFP_HIGHUSER_MOVABLE or __GFP_ZONE_MOVABLE directly to achieve that.
> >
> > Decode zone number directly from bottom three bits of flags in gfp_zone.
> > The theory of encoding and decoding is,
> > A ^ B ^ B = A
> 
> So why is this any better than the current code. Sure I am not a great
> fan of GFP_ZONE_TABLE because of how it is incomprehensible but this
> doesn't look too much better, yet we are losing a check for incompatible
> gfp flags. The diffstat looks really sound but then you just look and
> see that the large part is the comment that at least explained the gfp
> zone modifiers somehow and the debugging code. So what is the selling
> point?

Dear Michal,

Let me try to reply your questions.
Exactly, GFP_ZONE_TABLE is too complicated. I think there are two advantages
from the series of patches.

1. XOR operation is simple and efficient, GFP_ZONE_TABLE/BAD need to do twice
shift operations, the first is for getting a zone_type and the second is for
checking the to be returned type is a correct or not. But with these patch XOR
operation just needs to use once. Because the bottom 3 bits of GFP bitmask have
been used to represent the encoded zone number, we can say there is no bad zone
number if all callers could use it without buggy way. Of course, the returned
zone type in gfp_zone needs to be no more than ZONE_MOVABLE.

2. GFP_ZONE_TABLE has limit with the amount of zone types. Current 
GFP_ZONE_TABLE
is 32 bits, in general, there are 4 zone types for most ofX86_64 platform, they
are ZONE_DMA, ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE. If we want to expand the
amount of zone types to larger than 4, the zone shift should be 3. That is to 
say,
a 32 bits zone table is not enough to store all zone types.
And the most painful thing is that, current GFP bitmasks' space is quite
space-constrained it only have four ___GFP_XXX could be used as below,

#define ___GFP_DMA  0x01u
#define ___GFP_HIGHMEM  0x02u
#define ___GFP_DMA320x04u
(___GFP_NORMAL equals to 0x00)

If we use the implementation of these patches, there is a maximum of 8 zone 
types
could be used. The method of encoding and decoding is quite simple and users 
could
have an intuitive feeling for this as below, and the most important is that, 
there
is no BAD zone types eventually.

A ^ B ^ B = A

And by the way, our v3 patches are ready, but the smtp of Gmail is quite 
unstable
for some firewall reason in my side, I will try to resend them ASAP.

Sincerely,
Huaisheng Ye


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] btrfs: remove the logged extents infrastructure

2018-05-23 Thread Josef Bacik
From: Josef Bacik 

This is no longer used anywhere, remove all of it.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ordered-data.c  | 123 ---
 fs/btrfs/ordered-data.h  |  20 ++-
 fs/btrfs/tree-log.c  |  16 --
 include/trace/events/btrfs.h |   1 -
 4 files changed, 3 insertions(+), 157 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 6db8bb2f2c28..88f858baf87d 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -427,129 +427,6 @@ int btrfs_dec_test_ordered_pending(struct inode *inode,
return ret == 0;
 }
 
-/* Needs to either be called under a log transaction or the log_mutex */
-void btrfs_get_logged_extents(struct btrfs_inode *inode,
- struct list_head *logged_list,
- const loff_t start,
- const loff_t end)
-{
-   struct btrfs_ordered_inode_tree *tree;
-   struct btrfs_ordered_extent *ordered;
-   struct rb_node *n;
-   struct rb_node *prev;
-
-   tree = >ordered_tree;
-   spin_lock_irq(>lock);
-   n = __tree_search(>tree, end, );
-   if (!n)
-   n = prev;
-   for (; n; n = rb_prev(n)) {
-   ordered = rb_entry(n, struct btrfs_ordered_extent, rb_node);
-   if (ordered->file_offset > end)
-   continue;
-   if (entry_end(ordered) <= start)
-   break;
-   if (test_and_set_bit(BTRFS_ORDERED_LOGGED, >flags))
-   continue;
-   list_add(>log_list, logged_list);
-   refcount_inc(>refs);
-   }
-   spin_unlock_irq(>lock);
-}
-
-void btrfs_put_logged_extents(struct list_head *logged_list)
-{
-   struct btrfs_ordered_extent *ordered;
-
-   while (!list_empty(logged_list)) {
-   ordered = list_first_entry(logged_list,
-  struct btrfs_ordered_extent,
-  log_list);
-   list_del_init(>log_list);
-   btrfs_put_ordered_extent(ordered);
-   }
-}
-
-void btrfs_submit_logged_extents(struct list_head *logged_list,
-struct btrfs_root *log)
-{
-   int index = log->log_transid % 2;
-
-   spin_lock_irq(>log_extents_lock[index]);
-   list_splice_tail(logged_list, >logged_list[index]);
-   spin_unlock_irq(>log_extents_lock[index]);
-}
-
-void btrfs_wait_logged_extents(struct btrfs_trans_handle *trans,
-  struct btrfs_root *log, u64 transid)
-{
-   struct btrfs_ordered_extent *ordered;
-   int index = transid % 2;
-
-   spin_lock_irq(>log_extents_lock[index]);
-   while (!list_empty(>logged_list[index])) {
-   struct inode *inode;
-   ordered = list_first_entry(>logged_list[index],
-  struct btrfs_ordered_extent,
-  log_list);
-   list_del_init(>log_list);
-   inode = ordered->inode;
-   spin_unlock_irq(>log_extents_lock[index]);
-
-   if (!test_bit(BTRFS_ORDERED_IO_DONE, >flags) &&
-   !test_bit(BTRFS_ORDERED_DIRECT, >flags)) {
-   u64 start = ordered->file_offset;
-   u64 end = ordered->file_offset + ordered->len - 1;
-
-   WARN_ON(!inode);
-   filemap_fdatawrite_range(inode->i_mapping, start, end);
-   }
-   wait_event(ordered->wait, test_bit(BTRFS_ORDERED_IO_DONE,
-  >flags));
-
-   /*
-* In order to keep us from losing our ordered extent
-* information when committing the transaction we have to make
-* sure that any logged extents are completed when we go to
-* commit the transaction.  To do this we simply increase the
-* current transactions pending_ordered counter and decrement it
-* when the ordered extent completes.
-*/
-   if (!test_bit(BTRFS_ORDERED_COMPLETE, >flags)) {
-   struct btrfs_ordered_inode_tree *tree;
-
-   tree = _I(inode)->ordered_tree;
-   spin_lock_irq(>lock);
-   if (!test_bit(BTRFS_ORDERED_COMPLETE, >flags)) 
{
-   set_bit(BTRFS_ORDERED_PENDING, >flags);
-   
atomic_inc(>transaction->pending_ordered);
-   }
-   spin_unlock_irq(>lock);
-   }
-   btrfs_put_ordered_extent(ordered);
-   spin_lock_irq(>log_extents_lock[index]);
-   }
-   spin_unlock_irq(>log_extents_lock[index]);
-}
-
-void 

[PATCH 2/4] btrfs: remove the wait ordered logic in the log_one_extent path

2018-05-23 Thread Josef Bacik
From: Josef Bacik 

Since we are waiting on all ordered extents at the start of the fsync()
path we don't need to wait on any logged ordered extents, and we don't
need to look up the checksums on the ordered extents as they will
already be on disk prior to getting here.  Rework this so we're only
looking up and copying the on-disk checksums for the extent range we
care about.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/tree-log.c | 118 +---
 1 file changed, 10 insertions(+), 108 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 43758e30aa7a..791b5a731456 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4082,127 +4082,30 @@ static int extent_cmp(void *priv, struct list_head *a, 
struct list_head *b)
return 0;
 }
 
-static int wait_ordered_extents(struct btrfs_trans_handle *trans,
-   struct inode *inode,
-   struct btrfs_root *root,
-   const struct extent_map *em,
-   const struct list_head *logged_list,
-   bool *ordered_io_error)
+static int log_extent_csums(struct btrfs_trans_handle *trans,
+   struct btrfs_inode *inode,
+   struct btrfs_root *root,
+   const struct extent_map *em)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
-   struct btrfs_ordered_extent *ordered;
struct btrfs_root *log = root->log_root;
-   u64 mod_start = em->mod_start;
-   u64 mod_len = em->mod_len;
-   const bool skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
u64 csum_offset;
u64 csum_len;
LIST_HEAD(ordered_sums);
int ret = 0;
 
-   *ordered_io_error = false;
-
-   if (test_bit(EXTENT_FLAG_PREALLOC, >flags) ||
+   if (inode->flags & BTRFS_INODE_NODATASUM ||
+   test_bit(EXTENT_FLAG_PREALLOC, >flags) ||
em->block_start == EXTENT_MAP_HOLE)
return 0;
 
-   /*
-* Wait far any ordered extent that covers our extent map. If it
-* finishes without an error, first check and see if our csums are on
-* our outstanding ordered extents.
-*/
-   list_for_each_entry(ordered, logged_list, log_list) {
-   struct btrfs_ordered_sum *sum;
-
-   if (!mod_len)
-   break;
-
-   if (ordered->file_offset + ordered->len <= mod_start ||
-   mod_start + mod_len <= ordered->file_offset)
-   continue;
-
-   if (!test_bit(BTRFS_ORDERED_IO_DONE, >flags) &&
-   !test_bit(BTRFS_ORDERED_IOERR, >flags) &&
-   !test_bit(BTRFS_ORDERED_DIRECT, >flags)) {
-   const u64 start = ordered->file_offset;
-   const u64 end = ordered->file_offset + ordered->len - 1;
-
-   WARN_ON(ordered->inode != inode);
-   filemap_fdatawrite_range(inode->i_mapping, start, end);
-   }
-
-   wait_event(ordered->wait,
-  (test_bit(BTRFS_ORDERED_IO_DONE, >flags) ||
-   test_bit(BTRFS_ORDERED_IOERR, >flags)));
-
-   if (test_bit(BTRFS_ORDERED_IOERR, >flags)) {
-   /*
-* Clear the AS_EIO/AS_ENOSPC flags from the inode's
-* i_mapping flags, so that the next fsync won't get
-* an outdated io error too.
-*/
-   filemap_check_errors(inode->i_mapping);
-   *ordered_io_error = true;
-   break;
-   }
-   /*
-* We are going to copy all the csums on this ordered extent, so
-* go ahead and adjust mod_start and mod_len in case this
-* ordered extent has already been logged.
-*/
-   if (ordered->file_offset > mod_start) {
-   if (ordered->file_offset + ordered->len >=
-   mod_start + mod_len)
-   mod_len = ordered->file_offset - mod_start;
-   /*
-* If we have this case
-*
-* |- logged extent -|
-*   |- ordered extent |
-*
-* Just don't mess with mod_start and mod_len, we'll
-* just end up logging more csums than we need and it
-* will be ok.
-*/
-   } else {
-   if (ordered->file_offset + ordered->len <
-   mod_start + mod_len) {
-   

[PATCH 3/4] btrfs: clean up the left over logged_list usage

2018-05-23 Thread Josef Bacik
From: Josef Bacik 

We no longer use this list we've passed around so remove it everywhere.
Also remove the extra checks for ordered/filemap errors as this is
handled higher up now that we're waiting on ordered_extents before
getting to the tree log code.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/tree-log.c | 32 ++--
 1 file changed, 2 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 791b5a731456..ef1b1a071bba 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4133,7 +4133,6 @@ static int log_one_extent(struct btrfs_trans_handle 
*trans,
  struct btrfs_inode *inode, struct btrfs_root *root,
  const struct extent_map *em,
  struct btrfs_path *path,
- const struct list_head *logged_list,
  struct btrfs_log_ctx *ctx)
 {
struct btrfs_root *log = root->log_root;
@@ -4145,17 +4144,11 @@ static int log_one_extent(struct btrfs_trans_handle 
*trans,
u64 block_len;
int ret;
int extent_inserted = 0;
-   bool ordered_io_err = false;
 
ret = log_extent_csums(trans, inode, root, em);
if (ret)
return ret;
 
-   if (ordered_io_err) {
-   ctx->io_err = -EIO;
-   return ctx->io_err;
-   }
-
btrfs_init_map_token();
 
ret = __btrfs_drop_extents(trans, log, >vfs_inode, path, 
em->start,
@@ -4226,7 +4219,6 @@ static int btrfs_log_changed_extents(struct 
btrfs_trans_handle *trans,
 struct btrfs_root *root,
 struct btrfs_inode *inode,
 struct btrfs_path *path,
-struct list_head *logged_list,
 struct btrfs_log_ctx *ctx,
 const u64 start,
 const u64 end)
@@ -4302,20 +4294,6 @@ static int btrfs_log_changed_extents(struct 
btrfs_trans_handle *trans,
}
 
list_sort(NULL, , extent_cmp);
-   btrfs_get_logged_extents(inode, logged_list, logged_start, logged_end);
-   /*
-* Some ordered extents started by fsync might have completed
-* before we could collect them into the list logged_list, which
-* means they're gone, not in our logged_list nor in the inode's
-* ordered tree. We want the application/user space to know an
-* error happened while attempting to persist file data so that
-* it can take proper action. If such error happened, we leave
-* without writing to the log tree and the fsync must report the
-* file data write error and not commit the current transaction.
-*/
-   ret = filemap_check_errors(inode->vfs_inode.i_mapping);
-   if (ret)
-   ctx->io_err = ret;
 process:
while (!list_empty()) {
em = list_entry(extents.next, struct extent_map, list);
@@ -4334,8 +4312,7 @@ static int btrfs_log_changed_extents(struct 
btrfs_trans_handle *trans,
 
write_unlock(>lock);
 
-   ret = log_one_extent(trans, inode, root, em, path, logged_list,
-ctx);
+   ret = log_one_extent(trans, inode, root, em, path, ctx);
write_lock(>lock);
clear_em_logging(tree, em);
free_extent_map(em);
@@ -4717,7 +4694,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
struct btrfs_key min_key;
struct btrfs_key max_key;
struct btrfs_root *log = root->log_root;
-   LIST_HEAD(logged_list);
u64 last_extent = 0;
int err = 0;
int ret;
@@ -5047,7 +5023,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
}
if (fast_search) {
ret = btrfs_log_changed_extents(trans, root, inode, dst_path,
-   _list, ctx, start, end);
+   ctx, start, end);
if (ret) {
err = ret;
goto out_unlock;
@@ -5098,10 +5074,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
inode->last_log_commit = inode->last_sub_trans;
spin_unlock(>lock);
 out_unlock:
-   if (unlikely(err))
-   btrfs_put_logged_extents(_list);
-   else
-   btrfs_submit_logged_extents(_list, log);
mutex_unlock(>log_mutex);
 
btrfs_free_path(path);
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] btrfs: always wait on ordered extents at fsync time

2018-05-23 Thread Josef Bacik
From: Josef Bacik 

There's a priority inversion that exists currently with btrfs fsync.  In
some cases we will collect outstanding ordered extents onto a list and
only wait on them at the very last second.  However this "very last
second" falls inside of a transaction handle, so if we are in a lower
priority cgroup we can end up holding the transaction open for longer
than needed, so if a high priority cgroup is also trying to fsync()
it'll see latency.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/file.c | 56 
 1 file changed, 4 insertions(+), 52 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5772f0cbedef..2b1c36612384 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
atomic_inc(>log_batch);
full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 _I(inode)->runtime_flags);
+
/*
-* We might have have had more pages made dirty after calling
-* start_ordered_ops and before acquiring the inode's i_mutex.
+* We have to do this here to avoid the priority inversion of waiting on
+* IO of a lower priority task while holding a transaciton open.
 */
-   if (full_sync) {
-   /*
-* For a full sync, we need to make sure any ordered operations
-* start and finish before we start logging the inode, so that
-* all extents are persisted and the respective file extent
-* items are in the fs/subvol btree.
-*/
-   ret = btrfs_wait_ordered_range(inode, start, len);
-   } else {
-   /*
-* Start any new ordered operations before starting to log the
-* inode. We will wait for them to finish in btrfs_sync_log().
-*
-* Right before acquiring the inode's mutex, we might have new
-* writes dirtying pages, which won't immediately start the
-* respective ordered operations - that is done through the
-* fill_delalloc callbacks invoked from the writepage and
-* writepages address space operations. So make sure we start
-* all ordered operations before starting to log our inode. Not
-* doing this means that while logging the inode, writeback
-* could start and invoke writepage/writepages, which would call
-* the fill_delalloc callbacks (cow_file_range,
-* submit_compressed_extents). These callbacks add first an
-* extent map to the modified list of extents and then create
-* the respective ordered operation, which means in
-* tree-log.c:btrfs_log_inode() we might capture all existing
-* ordered operations (with btrfs_get_logged_extents()) before
-* the fill_delalloc callback adds its ordered operation, and by
-* the time we visit the modified list of extent maps (with
-* btrfs_log_changed_extents()), we see and process the extent
-* map they created. We then use the extent map to construct a
-* file extent item for logging without waiting for the
-* respective ordered operation to finish - this file extent
-* item points to a disk location that might not have yet been
-* written to, containing random data - so after a crash a log
-* replay will make our inode have file extent items that point
-* to disk locations containing invalid data, as we returned
-* success to userspace without waiting for the respective
-* ordered operation to finish, because it wasn't captured by
-* btrfs_get_logged_extents().
-*/
-   ret = start_ordered_ops(inode, start, end);
-   }
+   ret = btrfs_wait_ordered_range(inode, start, len);
if (ret) {
inode_unlock(inode);
goto out;
@@ -2240,13 +2199,6 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
goto out;
}
}
-   if (!full_sync) {
-   ret = btrfs_wait_ordered_range(inode, start, len);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   goto out;
-   }
-   }
ret = btrfs_commit_transaction(trans);
} else {
ret = btrfs_end_transaction(trans);
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to 

Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-23 Thread Chris Mason

On 23 May 2018, at 3:26, robbieko wrote:


Chris Mason 於 2018-05-23 02:31 寫到:

On 22 May 2018, at 14:08, Christoph Hellwig wrote:


On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:

From: Robbie Ko 

This idea is from direct io. By this patch, we can make the 
buffered
write parallel, and improve the performance and latency. But 
because we
can not update isize without i_mutex, the unlocked buffered write 
just

can be done in front of the EOF.

We needn't worry about the race between buffered write and 
truncate,

because the truncate need wait until all the buffered write end.

And we also needn't worry about the race between dio write and 
punch hole,

because we have extent lock to protect our operation.

I ran fio to test the performance of this feature.


And what protects two writes from interleaving their results now?


page locks...ish, we at least won't have results interleaved in a
single page.  For btrfs it'll actually be multiple pages since we try
to do more than one at a time.

I haven't verified all the assumptions around truncate and fallocate
and friends expecting the dio special locking to be inside i_size.  
In

general this makes me a little uncomfortable.

But we're not avoiding the inode lock completely, we're just dropping
it for the expensive parts of writing to the file.  A quick guess
about what the expensive parts are:

1) balance_dirty_pages()
2) btrfs_btree_balance_dirty()
3) metadata reservations/enospc waiting.



The expensive part of buffered_write are:
1. prepare_pages()
--wait_on_page_writeback()
Because writeback submit page to PG_writeback.
We must wait until the page writeback IO ends.


Is this based on timing the fio job or something else?  We can trigger a 
stable page prep run before we take the mutex, but stable pages 
shouldn't be part of random IO workloads unless we're doing random IO 
inside a file that mostly fits in ram.


balance_dirty_pages() is a much more common waiting point, and doing 
that with the inode lock held definitely adds contention.




2. lock_and_cleanup_extent_if_need
--btrfs_start_ordered_extent
When a large number of ordered_extent queue is in 
endio_write_workers workqueue.
Buffered_write assumes that ordered_extent is the last one in the 
endio_write_workers workqueue,
and waits for all ordered_extents to be processed before because 
the workqueue is a FIFO.




This isn't completely accurate, but it falls into the stable pages 
bucket as well.  We can push a lighter version of the stable page wait 
before the inode lock when the IO is inside of i_size.  It won't 
completely remove the possibility that someone else dirties those pages 
again, but if the workload is random or really splitting up the file, 
it'll make the work done with the lock held much less.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: always wait on ordered extents at fsync time

2018-05-23 Thread Filipe Manana
On Tue, May 22, 2018 at 6:47 PM, Josef Bacik  wrote:
> From: Josef Bacik 
>
> There's a priority inversion that exists currently with btrfs fsync.  In
> some cases we will collect outstanding ordered extents onto a list and
> only wait on them at the very last second.  However this "very last
> second" falls inside of a transaction handle, so if we are in a lower
> priority cgroup we can end up holding the transaction open for longer
> than needed, so if a high priority cgroup is also trying to fsync()
> it'll see latency.
>
> Fix this by getting rid of all of the logged extents magic and simply
> wait on ordered extent before we star the tree log stuff.  This code has
> changed a lot since I first wrote it and really isn't the performance
> win it was originally because of the things we had to do around getting
> the right checksums.  Killing all of this makes our lives easier and
> gets rid of the priority inversion.

Much easier!

>
> Signed-off-by: Josef Bacik 
Reviewed-by: Filipe Manana 

Looks good to me.
Happy to see all that complexity go away and knowing it no longer
offers any benefit.

> ---
>  fs/btrfs/file.c  |  56 ++-
>  fs/btrfs/ordered-data.c  | 123 
>  fs/btrfs/ordered-data.h  |  20 +-
>  fs/btrfs/tree-log.c  | 166 
> ---
>  include/trace/events/btrfs.h |   1 -
>  5 files changed, 19 insertions(+), 347 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 5772f0cbedef..2b1c36612384 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
> atomic_inc(>log_batch);
> full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
>  _I(inode)->runtime_flags);
> +
> /*
> -* We might have have had more pages made dirty after calling
> -* start_ordered_ops and before acquiring the inode's i_mutex.
> +* We have to do this here to avoid the priority inversion of waiting 
> on
> +* IO of a lower priority task while holding a transaciton open.
>  */
> -   if (full_sync) {
> -   /*
> -* For a full sync, we need to make sure any ordered 
> operations
> -* start and finish before we start logging the inode, so that
> -* all extents are persisted and the respective file extent
> -* items are in the fs/subvol btree.
> -*/
> -   ret = btrfs_wait_ordered_range(inode, start, len);
> -   } else {
> -   /*
> -* Start any new ordered operations before starting to log the
> -* inode. We will wait for them to finish in btrfs_sync_log().
> -*
> -* Right before acquiring the inode's mutex, we might have new
> -* writes dirtying pages, which won't immediately start the
> -* respective ordered operations - that is done through the
> -* fill_delalloc callbacks invoked from the writepage and
> -* writepages address space operations. So make sure we start
> -* all ordered operations before starting to log our inode. 
> Not
> -* doing this means that while logging the inode, writeback
> -* could start and invoke writepage/writepages, which would 
> call
> -* the fill_delalloc callbacks (cow_file_range,
> -* submit_compressed_extents). These callbacks add first an
> -* extent map to the modified list of extents and then create
> -* the respective ordered operation, which means in
> -* tree-log.c:btrfs_log_inode() we might capture all existing
> -* ordered operations (with btrfs_get_logged_extents()) before
> -* the fill_delalloc callback adds its ordered operation, and 
> by
> -* the time we visit the modified list of extent maps (with
> -* btrfs_log_changed_extents()), we see and process the extent
> -* map they created. We then use the extent map to construct a
> -* file extent item for logging without waiting for the
> -* respective ordered operation to finish - this file extent
> -* item points to a disk location that might not have yet been
> -* written to, containing random data - so after a crash a log
> -* replay will make our inode have file extent items that 
> point
> -* to disk locations containing invalid data, as we returned
> -* success to userspace without waiting for the respective
> -* ordered operation to finish, because it wasn't captured by

Re: [PATCH 2/2] btrfs: always wait on ordered extents at fsync time

2018-05-23 Thread Nikolay Borisov


On 23.05.2018 18:38, Josef Bacik wrote:
> It's just removing all of the code that is no longer needed with the
> unconditional wait_ordered_extents, it's not that complicated.

Just because something is painfully obvious to you doesn't mean it's the
same for others. Especially given the current complexity of fsync code.
Even your 1 sentence reply that you are removing code which is obsoleted
by unconditional wait_ordered_extents is more revealing (albeit frankly
this needs a lot more explanation how the code is intertwined etc...)
than : "Fix this by getting rid of all of the logged extents magic and
simply wait on ordered extent before we star the tree log stuff"

> Thanks,
> 
> Josef
> 
> On Wed, May 23, 2018 at 8:24 AM, David Sterba  wrote:
>> On Tue, May 22, 2018 at 01:47:23PM -0400, Josef Bacik wrote:
>>> From: Josef Bacik 
>>>
>>> There's a priority inversion that exists currently with btrfs fsync.  In
>>> some cases we will collect outstanding ordered extents onto a list and
>>> only wait on them at the very last second.  However this "very last
>>> second" falls inside of a transaction handle, so if we are in a lower
>>> priority cgroup we can end up holding the transaction open for longer
>>> than needed, so if a high priority cgroup is also trying to fsync()
>>> it'll see latency.
>>>
>>> Fix this by getting rid of all of the logged extents magic and simply
>>> wait on ordered extent before we star the tree log stuff.  This code has
>>> changed a lot since I first wrote it and really isn't the performance
>>> win it was originally because of the things we had to do around getting
>>> the right checksums.  Killing all of this makes our lives easier and
>>> gets rid of the priority inversion.
>>>
>>> Signed-off-by: Josef Bacik 
>>> ---
>>>  fs/btrfs/file.c  |  56 ++-
>>>  fs/btrfs/ordered-data.c  | 123 
>>>  fs/btrfs/ordered-data.h  |  20 +-
>>>  fs/btrfs/tree-log.c  | 166 
>>> ---
>>>  include/trace/events/btrfs.h |   1 -
>>>  5 files changed, 19 insertions(+), 347 deletions(-)
>>
>> Seriously. Just by the diffstat, this patch is going to bring more
>> problems than it's supposed to solve. Please split it into reviewable
>> pieces and write less vague changelogs so also other people can
>> understand and review the actual changes made to the code. Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: always wait on ordered extents at fsync time

2018-05-23 Thread Josef Bacik
It's just removing all of the code that is no longer needed with the
unconditional wait_ordered_extents, it's not that complicated.
Thanks,

Josef

On Wed, May 23, 2018 at 8:24 AM, David Sterba  wrote:
> On Tue, May 22, 2018 at 01:47:23PM -0400, Josef Bacik wrote:
>> From: Josef Bacik 
>>
>> There's a priority inversion that exists currently with btrfs fsync.  In
>> some cases we will collect outstanding ordered extents onto a list and
>> only wait on them at the very last second.  However this "very last
>> second" falls inside of a transaction handle, so if we are in a lower
>> priority cgroup we can end up holding the transaction open for longer
>> than needed, so if a high priority cgroup is also trying to fsync()
>> it'll see latency.
>>
>> Fix this by getting rid of all of the logged extents magic and simply
>> wait on ordered extent before we star the tree log stuff.  This code has
>> changed a lot since I first wrote it and really isn't the performance
>> win it was originally because of the things we had to do around getting
>> the right checksums.  Killing all of this makes our lives easier and
>> gets rid of the priority inversion.
>>
>> Signed-off-by: Josef Bacik 
>> ---
>>  fs/btrfs/file.c  |  56 ++-
>>  fs/btrfs/ordered-data.c  | 123 
>>  fs/btrfs/ordered-data.h  |  20 +-
>>  fs/btrfs/tree-log.c  | 166 
>> ---
>>  include/trace/events/btrfs.h |   1 -
>>  5 files changed, 19 insertions(+), 347 deletions(-)
>
> Seriously. Just by the diffstat, this patch is going to bring more
> problems than it's supposed to solve. Please split it into reviewable
> pieces and write less vague changelogs so also other people can
> understand and review the actual changes made to the code. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] mm: pagecache add lock

2018-05-23 Thread Christoph Hellwig
On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote:
> > 
> > Honestly I think this probably should be in the core.  But IFF we move
> > it to the core the existing users of per-fs locks need to be moved
> > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > that copied the approach, and probably more if you audit deep enough.
> 
> I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
> merging bcachefs. Sorry, but that's a bit crazy.

It isn't crazy at all.  In general we expect people to do their fair
share of core work to get their pet feature in.  How much is acceptable
is a difficult question and not black and white.

But if you want to grow a critical core structure you better take a stab
at converting existing users.  Without that the tradeoff of growing
that core structure is simply not given.

Or to put it in words for this exact feature:  unless your new field
is also used by mainstream file systems it is not going to be added.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 3/9] drivers/xen/swiotlb-xen: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MASK to replace (__GFP_DMA | __GFP_HIGHMEM).

In function xen_swiotlb_alloc_coherent, it is obvious that __GFP_DMA32
is not the expecting zone type.

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.
__GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated with
each others by OR.

Use GFP_NORMAL() to clear bottom 3 bits of GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Konrad Rzeszutek Wilk 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Christoph Hellwig 
---
 drivers/xen/swiotlb-xen.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index e1c6089..359 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -301,7 +301,7 @@ int __ref xen_swiotlb_init(int verbose, bool early)
* machine physical layout.  We can't allocate highmem
* because we can't return a pointer to it.
*/
-   flags &= ~(__GFP_DMA | __GFP_HIGHMEM);
+   flags = GFP_NORMAL(flags);
 
/* On ARM this function returns an ioremap'ped virtual address for
 * which virt_to_phys doesn't return the corresponding physical
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 2/9] include/linux/dma-mapping: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MASK to replace (__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.
__GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated with
each others by OR.

Use GFP_NORMAL() to clear bottom 3 bits of GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Christoph Hellwig 
---
 include/linux/dma-mapping.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f8ab1c0..8fe524d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -519,7 +519,7 @@ static inline void *dma_alloc_attrs(struct device *dev, 
size_t size,
return cpu_addr;
 
/* let the implementation decide on the zone to allocate from: */
-   flag &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM);
+   flag = GFP_NORMAL(flag);
 
if (!arch_dma_alloc_attrs(, ))
return NULL;
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 0/9] get rid of GFP_ZONE_TABLE/BAD

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Changes since v2: [2]
* According to Christoph's suggestion, rebase patches to current
  mainline from v4.16.

* Follow the advice of Matthew, create macros like GFP_NORMAL and
  GFP_NORMAL_UNMOVABLE to clear bottom 3 and 4 bits of GFP bitmask.

* Delete some patches because of kernel updating.

[2]: https://marc.info/?l=linux-mm=152691610014027=2

Tested by Lenovo Thinksystem server.

Initmem setup node 0 [mem 0x1000-0x00043fff]
[0.00] On node 0 totalpages: 4111666
[0.00]   DMA zone: 64 pages used for memmap
[0.00]   DMA zone: 23 pages reserved
[0.00]   DMA zone: 3999 pages, LIFO batch:0
[0.00] mminit::memmap_init Initialising map node 0 zone 0 pfns 1 -> 
4096 
[0.00]   DMA32 zone: 10935 pages used for memmap
[0.00]   DMA32 zone: 699795 pages, LIFO batch:31
[0.00] mminit::memmap_init Initialising map node 0 zone 1 pfns 4096 -> 
1048576
[0.00]   Normal zone: 53248 pages used for memmap
[0.00]   Normal zone: 3407872 pages, LIFO batch:31
[0.00] mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 
-> 4456448
[0.00] mminit::memmap_init Initialising map node 0 zone 3 pfns 1 -> 
4456448
[0.00] Initmem setup node 1 [mem 0x00238000-0x00277fff]
[0.00] On node 1 totalpages: 4194304
[0.00]   Normal zone: 65536 pages used for memmap
[0.00]   Normal zone: 4194304 pages, LIFO batch:31
[0.00] mminit::memmap_init Initialising map node 1 zone 2 pfns 37224448 
-> 41418752
[0.00] mminit::memmap_init Initialising map node 1 zone 3 pfns 37224448 
-> 41418752
...
[0.00] mminit::zonelist general 0:DMA = 0:DMA
[0.00] mminit::zonelist general 0:DMA32 = 0:DMA32 0:DMA
[0.00] mminit::zonelist general 0:Normal = 0:Normal 0:DMA32 0:DMA 
1:Normal
[0.00] mminit::zonelist thisnode 0:DMA = 0:DMA
[0.00] mminit::zonelist thisnode 0:DMA32 = 0:DMA32 0:DMA
[0.00] mminit::zonelist thisnode 0:Normal = 0:Normal 0:DMA32 0:DMA
[0.00] mminit::zonelist general 1:Normal = 1:Normal 0:Normal 0:DMA32 
0:DMA
[0.00] mminit::zonelist thisnode 1:Normal = 1:Normal
[0.00] Built 2 zonelists, mobility grouping on.  Total pages: 8176164
[0.00] Policy zone: Normal
[0.00] Kernel command line: BOOT_IMAGE=/vmlinuz-4.17.0-rc6-gfp09+ 
root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap 
debug 
LANG=en_US.UTF-8 mminit_loglevel=4 console=tty0 console=ttyS0,115200n8 
memblock=debug
earlyprintk=serial,0x3f8,115200

---

Replace GFP_ZONE_TABLE and GFP_ZONE_BAD with encoded zone number.

Delete ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 from GFP bitmasks,
the bottom three bits of GFP mask is reserved for storing encoded
zone number.

The encoding method is XOR. Get zone number from enum zone_type,
then encode the number with ZONE_NORMAL by XOR operation.
The goal is to make sure ZONE_NORMAL can be encoded to zero. So,
the compatibility can be guaranteed, such as GFP_KERNEL and GFP_ATOMIC
can be used as before.

Reserve __GFP_MOVABLE in bit 3, so that it can continue to be used as
a flag. Same as before, __GFP_MOVABLE respresents movable migrate type
for ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. But when it is enabled with
__GFP_HIGHMEM, ZONE_MOVABLE shall be returned instead of ZONE_HIGHMEM.
__GFP_ZONE_MOVABLE is created to realize it.

With this patch, just enabling __GFP_MOVABLE and __GFP_HIGHMEM is not
enough to get ZONE_MOVABLE from gfp_zone. All callers should use
GFP_HIGHUSER_MOVABLE or __GFP_ZONE_MOVABLE directly to achieve that.

Decode zone number directly from bottom three bits of flags in gfp_zone.
The theory of encoding and decoding is,
A ^ B ^ B = A

Changes since v1:[1]

* Create __GFP_ZONE_MOVABLE and modify GFP_HIGHUSER_MOVABLE to help
  callers to get ZONE_MOVABLE. Try to create __GFP_ZONE_MASK to mask
  lowest 3 bits of GFP bitmasks.

* Modify some callers' gfp flag to update usage of address zone
  modifiers.

* Modify inline function gfp_zone to get better performance according
  to Matthew's suggestion.

[1]: https://marc.info/?l=linux-mm=152596791931266=2

---

Huaisheng Ye (9):
  include/linux/gfp.h: get rid of GFP_ZONE_TABLE/BAD
  include/linux/dma-mapping: update usage of zone modifiers
  drivers/xen/swiotlb-xen: update usage of zone modifiers
  fs/btrfs/extent_io: update usage of zone modifiers
  drivers/block/zram/zram_drv: update usage of zone modifiers
  mm/vmpressure: update usage of zone modifiers
  mm/zsmalloc: update usage of zone modifiers
  include/linux/highmem.h: update usage of movableflags
  arch/x86/include/asm/page.h: update usage of movableflags

 arch/x86/include/asm/page.h   |   3 +-
 drivers/block/zram/zram_drv.c |   6 +--
 drivers/xen/swiotlb-xen.c |   2 +-
 fs/btrfs/extent_io.c  |   2 +-
 include/linux/dma-mapping.h   |   2 +-
 include/linux/gfp.h   | 107 

[RFC PATCH v3 4/9] fs/btrfs/extent_io: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MASK to replace (__GFP_DMA32 | __GFP_HIGHMEM).

In function alloc_extent_state, it is obvious that __GFP_DMA is not
the expecting zone type.

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.
__GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated with
each others by OR.

Use GFP_NORMAL() to clear bottom 3 bits of GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: Christoph Hellwig 
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e99b329..f41fc61 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -220,7 +220,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
 * The given mask might be not appropriate for the slab allocator,
 * drop the unsupported bits
 */
-   mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
+   mask = GFP_NORMAL(mask);
state = kmem_cache_alloc(extent_state_cache, mask);
if (!state)
return state;
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 9/9] arch/x86/include/asm/page.h: update usage of movableflags

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

GFP_HIGHUSER_MOVABLE doesn't equal to GFP_HIGHUSER | __GFP_MOVABLE,
modify it to adapt patch of getting rid of GFP_ZONE_TABLE/BAD.

Signed-off-by: Huaisheng Ye 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Kate Stewart 
Cc: Greg Kroah-Hartman 
Cc: x...@kernel.org 
Cc: Philippe Ombredanne 
Cc: Christoph Hellwig 
---
 arch/x86/include/asm/page.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48..a47f42d 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -35,7 +35,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
 }
 
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
-   alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
+   alloc_page_vma((movableflags ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER) \
+   | __GFP_ZERO, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #ifndef __pa
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 8/9] include/linux/highmem.h: update usage of movableflags

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

GFP_HIGHUSER_MOVABLE doesn't equal to GFP_HIGHUSER | __GFP_MOVABLE,
modify it to adapt patch of getting rid of GFP_ZONE_TABLE/BAD.

Signed-off-by: Huaisheng Ye 
Cc: Kate Stewart 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Philippe Ombredanne 
Cc: Christoph Hellwig 
---
 include/linux/highmem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 0690679..5383c9e 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -159,8 +159,8 @@ static inline void clear_user_highpage(struct page *page, 
unsigned long vaddr)
struct vm_area_struct *vma,
unsigned long vaddr)
 {
-   struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags,
-   vma, vaddr);
+   struct page *page = alloc_page_vma(movableflags ?
+   GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER, vma, vaddr);
 
if (page)
clear_user_highpage(page, vaddr);
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 1/9] include/linux/gfp.h: get rid of GFP_ZONE_TABLE/BAD

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Replace GFP_ZONE_TABLE and GFP_ZONE_BAD with encoded zone number.

Delete ___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 from GFP bitmasks,
the bottom three bits of GFP mask is reserved for storing encoded
zone number.

The encoding method is XOR. Get zone number from enum zone_type,
then encode the number with ZONE_NORMAL by XOR operation.
The goal is to make sure ZONE_NORMAL can be encoded to zero. So,
the compatibility can be guaranteed, such as GFP_KERNEL and GFP_ATOMIC
can be used as before.

Reserve __GFP_MOVABLE in bit 3, so that it can continue to be used as
a flag. Same as before, __GFP_MOVABLE respresents movable migrate type
for ZONE_DMA, ZONE_DMA32, and ZONE_NORMAL. But when it is enabled with
__GFP_HIGHMEM, ZONE_MOVABLE shall be returned instead of ZONE_HIGHMEM.
__GFP_ZONE_MOVABLE is created to realize it.

With this patch, just enabling __GFP_MOVABLE and __GFP_HIGHMEM is not
enough to get ZONE_MOVABLE from gfp_zone. All subsystems should use
GFP_HIGHUSER_MOVABLE directly to achieve that.

Decode zone number directly from bottom three bits of flags in gfp_zone.
The theory of encoding and decoding is,
A ^ B ^ B = A

Suggested-by: Matthew Wilcox 
Signed-off-by: Huaisheng Ye 
Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Kate Stewart 
Cc: "Levin, Alexander (Sasha Levin)" 
Cc: Greg Kroah-Hartman 
Cc: Christoph Hellwig 
---
 include/linux/gfp.h | 107 ++--
 1 file changed, 20 insertions(+), 87 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b..f76ccd76 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -16,9 +16,7 @@
  */
 
 /* Plain integer GFP bitmasks. Do not use this directly. */
-#define ___GFP_DMA 0x01u
-#define ___GFP_HIGHMEM 0x02u
-#define ___GFP_DMA32   0x04u
+#define ___GFP_ZONE_MASK   0x07u
 #define ___GFP_MOVABLE 0x08u
 #define ___GFP_RECLAIMABLE 0x10u
 #define ___GFP_HIGH0x20u
@@ -53,11 +51,15 @@
  * without the underscores and use them consistently. The definitions here may
  * be used in bit comparisons.
  */
-#define __GFP_DMA  ((__force gfp_t)___GFP_DMA)
-#define __GFP_HIGHMEM  ((__force gfp_t)___GFP_HIGHMEM)
-#define __GFP_DMA32((__force gfp_t)___GFP_DMA32)
+#define __GFP_DMA  ((__force gfp_t)OPT_ZONE_DMA ^ ZONE_NORMAL)
+#define __GFP_HIGHMEM  ((__force gfp_t)OPT_ZONE_HIGHMEM ^ ZONE_NORMAL)
+#define __GFP_DMA32((__force gfp_t)OPT_ZONE_DMA32 ^ ZONE_NORMAL)
 #define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE 
allowed */
-#define GFP_ZONEMASK   (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
+#define GFP_ZONEMASK   ((__force gfp_t)___GFP_ZONE_MASK | ___GFP_MOVABLE)
+/* bottom 3 bits of GFP bitmasks are used for zone number encoded*/
+#define __GFP_ZONE_MASK ((__force gfp_t)___GFP_ZONE_MASK)
+#define __GFP_ZONE_MOVABLE \
+   ((__force gfp_t)(ZONE_MOVABLE ^ ZONE_NORMAL) | ___GFP_MOVABLE)
 
 /*
  * Page mobility and placement hints
@@ -268,6 +270,13 @@
  *   available and will not wake kswapd/kcompactd on failure. The _LIGHT
  *   version does not attempt reclaim/compaction at all and is by default used
  *   in page fault path, while the non-light is used by khugepaged.
+ *
+ * GFP_NORMAL() is used to clear bottom 3 bits of GFP bitmask. Actually it
+ *   returns encoded ZONE_NORMAL bits.
+ *
+ * GFP_NORMAL_UNMOVABLE() is similar to GFP_NORMAL, but it clear bottom 4 bits
+ *   of GFP bitmask. Excepting the encoded ZONE_NORMAL bits, it clears MOVABLE
+ *   flags as well.
  */
 #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
@@ -279,10 +288,12 @@
 #define GFP_DMA__GFP_DMA
 #define GFP_DMA32  __GFP_DMA32
 #define GFP_HIGHUSER   (GFP_USER | __GFP_HIGHMEM)
-#define GFP_HIGHUSER_MOVABLE   (GFP_HIGHUSER | __GFP_MOVABLE)
+#define GFP_HIGHUSER_MOVABLE   (GFP_USER | __GFP_ZONE_MOVABLE)
 #define GFP_TRANSHUGE_LIGHT((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 #define GFP_TRANSHUGE  (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
+#define GFP_NORMAL(gfp)((gfp) & ~__GFP_ZONE_MASK)
+#define GFP_NORMAL_UNMOVABLE(gfp) ((gfp) & ~GFP_ZONEMASK)
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
@@ -326,87 +337,9 @@ static inline bool gfpflags_allow_blocking(const gfp_t 
gfp_flags)
 #define OPT_ZONE_DMA32 ZONE_NORMAL
 #endif
 
-/*
- * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
- * zone to use given the lowest 4 bits of 

[RFC PATCH v3 7/9] mm/zsmalloc: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MOVABLE to replace (__GFP_HIGHMEM | __GFP_MOVABLE).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.

__GFP_ZONE_MOVABLE contains encoded ZONE_MOVABLE and __GFP_MOVABLE flag.

With GFP_ZONE_TABLE, __GFP_HIGHMEM ORing __GFP_MOVABLE means gfp_zone
should return ZONE_MOVABLE. In order to keep that compatible with
GFP_ZONE_TABLE, Use GFP_NORMAL_UNMOVABLE() to clear bottom 4 bits of
GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
Cc: Christoph Hellwig 
---
 mm/zsmalloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 61cb05d..e250c69 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -345,7 +345,7 @@ static void destroy_cache(struct zs_pool *pool)
 static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
 {
return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
-   gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   GFP_NORMAL_UNMOVABLE(gfp));
 }
 
 static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
@@ -356,7 +356,7 @@ static void cache_free_handle(struct zs_pool *pool, 
unsigned long handle)
 static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
 {
return kmem_cache_alloc(pool->zspage_cachep,
-   flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   GFP_NORMAL_UNMOVABLE(flags));
 }
 
 static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 5/9] drivers/block/zram/zram_drv: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MOVABLE to replace (__GFP_HIGHMEM | __GFP_MOVABLE).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.

__GFP_ZONE_MOVABLE contains encoded ZONE_MOVABLE and __GFP_MOVABLE flag.

With GFP_ZONE_TABLE, __GFP_HIGHMEM ORing __GFP_MOVABLE means gfp_zone
should return ZONE_MOVABLE. In order to keep that compatible with
GFP_ZONE_TABLE, replace (__GFP_HIGHMEM | __GFP_MOVABLE) with
__GFP_ZONE_MOVABLE.

Signed-off-by: Huaisheng Ye 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
Cc: Christoph Hellwig 
---
 drivers/block/zram/zram_drv.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0f3fadd..1bb5ca8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1004,14 +1004,12 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec,
handle = zs_malloc(zram->mem_pool, comp_len,
__GFP_KSWAPD_RECLAIM |
__GFP_NOWARN |
-   __GFP_HIGHMEM |
-   __GFP_MOVABLE);
+   __GFP_ZONE_MOVABLE);
if (!handle) {
zcomp_stream_put(zram->comp);
atomic64_inc(>stats.writestall);
handle = zs_malloc(zram->mem_pool, comp_len,
-   GFP_NOIO | __GFP_HIGHMEM |
-   __GFP_MOVABLE);
+   GFP_NOIO | __GFP_ZONE_MOVABLE);
if (handle)
goto compress_again;
return -ENOMEM;
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 6/9] mm/vmpressure: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye
From: Huaisheng Ye 

Use __GFP_ZONE_MOVABLE to replace (__GFP_HIGHMEM | __GFP_MOVABLE).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.

__GFP_ZONE_MOVABLE contains encoded ZONE_MOVABLE and __GFP_MOVABLE flag.

With GFP_ZONE_TABLE, __GFP_HIGHMEM ORing __GFP_MOVABLE means gfp_zone
should return ZONE_MOVABLE. In order to keep that compatible with
GFP_ZONE_TABLE, replace (__GFP_HIGHMEM | __GFP_MOVABLE) with
__GFP_ZONE_MOVABLE.

Signed-off-by: Huaisheng Ye 
Cc: Andrew Morton 
Cc: zhongjiang 
Cc: Minchan Kim 
Cc: Dan Carpenter 
Cc: David Rientjes 
Cc: Christoph Hellwig 
---
 mm/vmpressure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 85350ce..30a40e2 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -256,7 +256,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool 
tree,
 * Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so
 * we account it too.
 */
-   if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
+   if (!(gfp & (__GFP_ZONE_MOVABLE | __GFP_IO | __GFP_FS)))
return;
 
/*
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-23 Thread Liu Bo
On Wed, May 23, 2018 at 02:34:40PM +0200, David Sterba wrote:
> On Wed, May 23, 2018 at 10:16:55AM +0800, Su Yue wrote:
> > >>> [   47.692084] kernel BUG at fs/btrfs/locking.c:286!
> > >>
> > >> I saw the crash too but did not investigate the root cause. So I'll
> > >> remove the branch from for-next until it's fixed. Thanks for the report.
> > > 
> > > I think the problem stems from Qu's patch, which sets search_commit_root
> > > =1 but doesn't set skip_locking, as a result we don't lock the tree when
> > > we obtain a reference to the root node, yet later when traversing the
> > > tree due to skip_locking not being set we try to lock it, and this
> > > causes btrfs_assert_tree_locked to triggers. Can you test whether the
> > > following diff solves the issues:
> > > 
> > > 
> > > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > > index bc19a7d11c98..23fadb640c59 100644
> > > --- a/fs/btrfs/qgroup.c
> > > +++ b/fs/btrfs/qgroup.c
> > > @@ -2702,6 +2702,7 @@ static void btrfs_qgroup_rescan_worker(struct
> > > btrfs_work *work)
> > >  * should be recorded by qgroup
> > >  */
> > > path->search_commit_root = 1;
> > > +   path->skip_locking = 1;
> > > 
> > > err = 0;
> > > while (!err && !btrfs_fs_closing(fs_info)) {
> > > 
> > > 
> > > If it does, this only means we need to make skip_locking = 1 being
> > > conditional on search_commit_root being set and this situation should be
> > > handled in btrfs_search_slot.
> > 
> > After patching the change, btrfs/139 passes without BUG_ON.
> 
> Confirmed, I've added the fixup to for-next, Liu Bo's branch is there to
> and the crash did not show up.

Thanks a lot for the testing and fixes!

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] btrfs_search_slot cleanups

2018-05-23 Thread David Sterba
On Wed, May 23, 2018 at 10:16:55AM +0800, Su Yue wrote:
> >>> [   47.692084] kernel BUG at fs/btrfs/locking.c:286!
> >>
> >> I saw the crash too but did not investigate the root cause. So I'll
> >> remove the branch from for-next until it's fixed. Thanks for the report.
> > 
> > I think the problem stems from Qu's patch, which sets search_commit_root
> > =1 but doesn't set skip_locking, as a result we don't lock the tree when
> > we obtain a reference to the root node, yet later when traversing the
> > tree due to skip_locking not being set we try to lock it, and this
> > causes btrfs_assert_tree_locked to triggers. Can you test whether the
> > following diff solves the issues:
> > 
> > 
> > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > index bc19a7d11c98..23fadb640c59 100644
> > --- a/fs/btrfs/qgroup.c
> > +++ b/fs/btrfs/qgroup.c
> > @@ -2702,6 +2702,7 @@ static void btrfs_qgroup_rescan_worker(struct
> > btrfs_work *work)
> >  * should be recorded by qgroup
> >  */
> > path->search_commit_root = 1;
> > +   path->skip_locking = 1;
> > 
> > err = 0;
> > while (!err && !btrfs_fs_closing(fs_info)) {
> > 
> > 
> > If it does, this only means we need to make skip_locking = 1 being
> > conditional on search_commit_root being set and this situation should be
> > handled in btrfs_search_slot.
> 
> After patching the change, btrfs/139 passes without BUG_ON.

Confirmed, I've added the fixup to for-next, Liu Bo's branch is there to
and the crash did not show up.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: always wait on ordered extents at fsync time

2018-05-23 Thread David Sterba
On Tue, May 22, 2018 at 01:47:23PM -0400, Josef Bacik wrote:
> From: Josef Bacik 
> 
> There's a priority inversion that exists currently with btrfs fsync.  In
> some cases we will collect outstanding ordered extents onto a list and
> only wait on them at the very last second.  However this "very last
> second" falls inside of a transaction handle, so if we are in a lower
> priority cgroup we can end up holding the transaction open for longer
> than needed, so if a high priority cgroup is also trying to fsync()
> it'll see latency.
> 
> Fix this by getting rid of all of the logged extents magic and simply
> wait on ordered extent before we star the tree log stuff.  This code has
> changed a lot since I first wrote it and really isn't the performance
> win it was originally because of the things we had to do around getting
> the right checksums.  Killing all of this makes our lives easier and
> gets rid of the priority inversion.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/file.c  |  56 ++-
>  fs/btrfs/ordered-data.c  | 123 
>  fs/btrfs/ordered-data.h  |  20 +-
>  fs/btrfs/tree-log.c  | 166 
> ---
>  include/trace/events/btrfs.h |   1 -
>  5 files changed, 19 insertions(+), 347 deletions(-)

Seriously. Just by the diffstat, this patch is going to bring more
problems than it's supposed to solve. Please split it into reviewable
pieces and write less vague changelogs so also other people can
understand and review the actual changes made to the code. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root raveled during balance

2018-05-23 Thread Nikolay Borisov


On 23.05.2018 11:03, ein wrote:
> On 05/23/2018 08:32 AM, Nikolay Borisov wrote:
> 
> Nikolay, thank you for the answer.
> 
>>> [...]
>>> root@node0:~# dmesg | grep BTRFS | grep warn
>>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
>>> mirror 1
>>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
>>> mirror 1
>>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>>> mirror 1
>>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>>> mirror 1
>>>
>>> Above has been revealed during below command and quite high IO usage by
>>> few VMs (Linux on top Ext4 with firebird database, lots of random
>>> read/writes, two others with Windows 2016 and Windows Update in the
>>> background):
>>
>> I believe you are hitting the issue described here:
>>
>> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html
> 
> It make sense, fsck.ext4, gbak - firebird integrity checking tool,
> chkdsk and sfc /scannow don't show any errors internally within VM. As
> far I can tell the data inside VMs is not corrupted somehow.
> 
>> Essentially the way qemu operates on vm images atop btrfs is prone to
>> producing such errors. As a matter of fact, other filesystems also
>> suffer from this(i.e pages modified while being written, however due to
>> lack of CRC on the data they don't detect it). Can you confirm that
>> those inodes (312/314/319/219962/219915) belong to vm images files?
> 
> root@node0:/var/lib/libvirt# find  ./ -inum 312
> root@node0:/var/lib/libvirt# find  ./ -inum 314
> root@node0:/var/lib/libvirt# find  ./ -inum 319
> root@node0:/var/lib/libvirt# find  ./ -inum 219962
> ./images/rds.raw
> root@node0:/var/lib/libvirt# find  ./ -inum 219915
> ./images/database.raw
> 
> It seems so (219962, 219915):
> - rds.raw - Windows 2016 server, Remote Desktop Server, raw preallocated
> image, NTFS
> database.raw - Linux 3.8, Firebird DB server, raw preallocated image, Ext4
> 
>> IMHO the best course of action would be to disable checksumming for you
>> vm files.
>>
> 
> Do you mean '-o nodatasum' mount flag? Is it possible to disable
> checksumming for singe file by setting some magical chattr? Google
> thinks it's not possible to disable csums for a single file.

You can't disable checksumming for a single file. However, what you
could do is set a the "No CoW" flag via chattr +c /path/to/file since it
also disables checksumming. Bear in mind you can't set this flag to a
file which already has allocated blocks. So you'd have to create an
empty file, set the +C flag and then copy the data with dd for example.

On a different note - for database workloads and generally random
workloads it makes no sense to have CoW since you'd see very spikey io
performance.

> 
>> For some background I suggest you read the following LWN articles:
>>
>> https://lwn.net/Articles/486311/
>> https://lwn.net/Articles/442355/
>>
>>>
>>> when I changed BTRFS compress parameters. Or during umount (can't recall
>>> now):
>>>
>>> May  2 07:15:39 node0 kernel: [1168145.677431] WARNING: CPU: 6 PID: 3763
>>> at /build/linux-8B5M4n/linux-4.15.11/fs/direct-io.c:293
>>> dio_complete+0x1d6/0x220
>>> May  2 07:15:39 node0 kernel: [1168145.678811] Modules linked in: fuse
>>> ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs vhost_net vhost
>>> tap tun ebtable_filter ebtables ip6tab
>>> le_filter ip6_tables iptable_filter binfmt_misc bridge 8021q garp mrp
>>> stp llc snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal
>>> intel_powerclamp coretemp snd_hda_codec_realtek kvm
>>> _intel snd_hda_codec_generic kvm i915 irqbypass crct10dif_pclmul
>>> snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_hda_codec
>>> intel_cstate snd_hda_core iTCO_wdt 

Re: csum failed root raveled during balance

2018-05-23 Thread Austin S. Hemmelgarn

On 2018-05-23 06:09, ein wrote:

On 05/23/2018 11:09 AM, Duncan wrote:

ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:


IMHO the best course of action would be to disable checksumming for you
vm files.


Do you mean '-o nodatasum' mount flag? Is it possible to disable
checksumming for singe file by setting some magical chattr? Google
thinks it's not possible to disable csums for a single file.


You can use nocow (-C), but of course that has other restrictions (like
setting it on the files when they're zero-length, easiest done for
existing data by setting it on the containing dir and copying files (no
reflink) in) as well as the nocow effects, and nocow becomes cow1 after a
snapshot (which locks the existing copy in place so changes written to a
block /must/ be written elsewhere, thus the cow1, aka cow the first time
written after the snapshot but retain the nocow for repeated writes
between snapshots).

But if you're disabling checksumming anyway, nocow's likely the way to go.


Disabling checksumming only may be a way to go - we live without it
every day. But nocow @ VM files defeats whole purpose of using BTRFS for
me, even with huge performance penalty - backup reasons - I mean few
snapshots (20-30), send & receive.

Setting NOCOW on a file doesn't prevent it from being snapshotted, it 
just prevents COW operations from happening under most normal 
circumstances.  In essence, it prevents COW from happening except for 
writing right after the snapshot.  More specifically, the first write to 
a given block in a file set for NOCOW after taking a snapshot will 
trigger a _single_ COW operation for _only_ that block (unless you have 
autodefrag enabled too), after which that block will revert to not doing 
COW operations on write.  This way, you still get consistent and working 
snapshots, but you also don't take the performance hits from COW except 
right after taking a snapshot.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior (possible bugs) in btrfs

2018-05-23 Thread Filipe Manana
On Mon, Apr 30, 2018 at 5:04 PM, Vijay Chidambaram  wrote:
> Hi,
>
> We found two more cases where the btrfs behavior is a little strange.
> In one case, an fsync-ed file goes missing after a crash. In the
> other, a renamed file shows up in both directories after a crash.
>
> Workload 1:
>
> mkdir A
> mkdir B
> mkdir A/C
> creat B/foo
> fsync B/foo
> link B/foo A/C/foo
> fsync A
> -- crash --
>
> Expected state after recovery:
> B B/foo A A/C exist

Why don't you expect A/C/foo as well? I would expect it to be persisted.
With xfs we don't get A/C/foo persisted, but it's persisted with ext4 and f2fs.

Adding xfs folks in cc to confirm the expected behaviour.

>
> What we find:
> Only B B/foo exist
>
> A is lost even after explicit fsync to A.
>
> Workload 2:
>
> mkdir A
> mkdir A/C
> rename A/C B
> touch B/bar
> fsync B/bar
> rename B/bar A/bar
> rename A B (replacing B with A at this point)
> fsync B/bar
> -- crash --
>
> Expected contents after recovery:
> A/bar
>
> What we find after recovery:
> A/bar
> B/bar
>
> We think this breaks rename's atomicity guarantee. bar should be
> present in either A or B, but now it is present in both.
>
> Thanks,
> Vijay
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Btrfs: fix partly checksummed file races

2018-05-23 Thread David Sterba
On Tue, May 22, 2018 at 03:02:11PM -0700, Omar Sandoval wrote:
> Based on kdave/for-next. Note that there's a Fixes: tag in there
> referencing a commit in the for-next branch, so that would have to be
> updated if the commit gets rebased. These patches are also available at
> https://github.com/osandov/linux/tree/btrfs-nodatasum-race.

If the original patch is not in any released or frozen branch, then the
fix should be folded to the original patch. The for-next branch is for
preview, testing and catching bugs that slip past the review. And gets
reassembled frequently so referencing a patch from there does not make
sense.

Sending the fixups as patches is ok, replies to the original thread
might get lost in the noise.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root raveled during balance

2018-05-23 Thread ein
On 05/23/2018 11:09 AM, Duncan wrote:
> ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:
> 
>>> IMHO the best course of action would be to disable checksumming for you
>>> vm files.
>>
>> Do you mean '-o nodatasum' mount flag? Is it possible to disable
>> checksumming for singe file by setting some magical chattr? Google
>> thinks it's not possible to disable csums for a single file.
> 
> You can use nocow (-C), but of course that has other restrictions (like 
> setting it on the files when they're zero-length, easiest done for 
> existing data by setting it on the containing dir and copying files (no 
> reflink) in) as well as the nocow effects, and nocow becomes cow1 after a 
> snapshot (which locks the existing copy in place so changes written to a 
> block /must/ be written elsewhere, thus the cow1, aka cow the first time 
> written after the snapshot but retain the nocow for repeated writes 
> between snapshots).
> 
> But if you're disabling checksumming anyway, nocow's likely the way to go.

Disabling checksumming only may be a way to go - we live without it
every day. But nocow @ VM files defeats whole purpose of using BTRFS for
me, even with huge performance penalty - backup reasons - I mean few
snapshots (20-30), send & receive.


-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io

2018-05-23 Thread ethanlien

David Sterba 於 2018-05-23 00:28 寫到:

On Fri, Apr 27, 2018 at 03:05:24PM +0800, Ethan Lien wrote:

We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far onlys counts dirty data pages) 
and

wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - 
which

is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty(), 
but

since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker 
for

metadata writeback endio, sleep in btrfs_finish_ordered_io help us
throttle the size of dirty metadata pages.


The described scenario sounds interesting. Do you have some scripted
steps to reproduce it?


It needs some time to reproduce the problem. In our case,
1. Create 4 subvolumes.
2. Run fio on each subvolume:

[global]
direct=0
rw=randwrite
ioengine=libaio
bs=4k
iodepth=16
numjobs=1
group_reporting
size=128G
runtime=1800
norandommap
time_based
randrepeat=0

3. Take snapshot on each subvolume and repeat fio on existing files.
4. Repeat step 2&3 until we get really big btrees.
In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
metadata in each subvolume tree and 12GiB of metadata in extent tree.
5. Stop all fio, take snapshot again, and wait until all delayed work is 
completed.
6. Start all fio. Few seconds later we hit OOM when the flusher starts 
to work.


It can be reproduced even when using nocow write.


btrfs_btree_balance_dirty detects congestion and can skip the balancing
eventually, so the case when it actually does something and waits is 
the
point where things can go bad. From the last paragraph, it is clear 
that

you have considered that, that's good.

Have you also considered calling __btrfs_btree_balance_dirty with
flush_delayed=0 ? This would avoid the waiting and I'm not sure if it's
really required here to get out of the situation.


Yes, btrfs_btree_balance_dirty_nodelay seems to be a better choice, I'll 
add it to the v2 patch, thanks.



Anyway, I'll add the patch to for-next for more testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root raveled during balance

2018-05-23 Thread Duncan
ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:

>> IMHO the best course of action would be to disable checksumming for you
>> vm files.
>> 
>> 
> Do you mean '-o nodatasum' mount flag? Is it possible to disable
> checksumming for singe file by setting some magical chattr? Google
> thinks it's not possible to disable csums for a single file.

You can use nocow (-C), but of course that has other restrictions (like 
setting it on the files when they're zero-length, easiest done for 
existing data by setting it on the containing dir and copying files (no 
reflink) in) as well as the nocow effects, and nocow becomes cow1 after a 
snapshot (which locks the existing copy in place so changes written to a 
block /must/ be written elsewhere, thus the cow1, aka cow the first time 
written after the snapshot but retain the nocow for repeated writes 
between snapshots).

But if you're disabling checksumming anyway, nocow's likely the way to go.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/4] btrfs: lzo: Harden inline lzo compressed extent decompression

2018-05-23 Thread Qu Wenruo
For inlined extent, we only have one segment, thus less thing to check.
And further more, inlined extent always has csum in its leaf header,
it's less possible to have corrupted data.

Anyway, still check header and segment header.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/lzo.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index 4f4de460b08d..8604f6bc0a29 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -437,15 +437,24 @@ static int lzo_decompress(struct list_head *ws, unsigned 
char *data_in,
struct workspace *workspace = list_entry(ws, struct workspace, list);
size_t in_len;
size_t out_len;
+   size_t max_segment_len = lzo1x_worst_compress(PAGE_SIZE);
int ret = 0;
char *kaddr;
unsigned long bytes;
 
-   BUG_ON(srclen < LZO_LEN);
+   if (srclen < LZO_LEN || srclen > max_segment_len + LZO_LEN * 2)
+   return -EUCLEAN;
 
+   in_len = read_compress_length(data_in);
+   if (in_len != srclen)
+   return -EUCLEAN;
data_in += LZO_LEN;
 
in_len = read_compress_length(data_in);
+   if (in_len != srclen - LZO_LEN * 2) {
+   ret = -EUCLEAN;
+   goto out;
+   }
data_in += LZO_LEN;
 
out_len = PAGE_SIZE;
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/4] btrfs: lzo: Harden decompression callers to avoid kernel memory corruption.

2018-05-23 Thread Qu Wenruo
This patchset can be fetched from github:
https://github.com/adam900710/linux/tree/lzo_corruption
Which is based on v4.17-rc5.

James Harvey reported pretty strange kernel misbehavior where after
reading certain btrfs compressed data, kernel crash with unrelated
calltrace.
(https://bugzilla.kernel.org/show_bug.cgi?id=199707 and
 https://www.spinics.net/lists/linux-btrfs/msg77971.html)

Thanks for his comprehensive debug help, we located the problem to:

1) Bad lzo decompression verification check
   The direct cause.
   The lzo decompression code involves memory copy for compressed data
   who crosses page boundary.
   And if corrupted segment header contains invalid length, btrfs will
   do memory copy and then corrupt kernel memory.

   Fix it by add header length check for each segment, and just in case,
   also add uncompressed data length check. (Patch 3)
   The full explanation and kasan output can also be found in that patch.

2) Compressed extent created for NODATASUM inode
   The root cause.
   Will be addressed in another patchset.

Other patches are mostly cosmetic, like adding extra include for
compression.h, to avoid later compiling error. (Patch 1)
Or adding comment of how btrfs organise its compressed data (Patch 2)
And adding extra check for inlined compressed data even it's not really
possible to happen (Patch 4)

Changelog:
v2:
  Cosmetic updates. Mostly for comment and gramma.
  Comment and gramma fixes suggested by Nikolay.
  One 4 bytes <-> LE32 convert fixes in comment.
  Added reviewed-by tag for the 1st patch.

v3:
  Fix comment error for inlined lzo compressed extent. (Still has
  header), thanks David for pointing this out.
  Add example ascii graph as an example.
  Enhance inlined extent check, as header length must match with segment
  header length + LZO_LEN * 2.

Qu Wenruo (4):
  btrfs: compression: Add linux/sizes.h for compression.h
  btrfs: lzo: Add comment about the how btrfs records its lzo compressed
data
  btrfs: lzo: Add header length check to avoid slab out of bounds access
  btrfs: lzo: Harden inline lzo compressed extent decompression

 fs/btrfs/compression.h |  1 +
 fs/btrfs/lzo.c | 81 --
 2 files changed, 80 insertions(+), 2 deletions(-)

-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: delayed-ref: simplify btrfs_add_delayed_tree_ref()

2018-05-23 Thread Nikolay Borisov


On 23.05.2018 11:06, Su Yue wrote:
> Commit 5a5003df98d5 ("btrfs: delayed-ref: double free in
> btrfs_add_delayed_tree_ref()") fixed double free problem by creating
> an unnessesary label to jump.
> The elegant way is just to change "ref" to "head_ref" and keep
> btrfs_add_delayed_tree_ref() and btrfs_add_delayed_data_ref() in
> similar structure.

I agree, personally I'm a fan of multiple returns rather than jump
labels, because at this point you know the function terminates and
that's it.

> 
> This patch reverts commit 5a5003df98d5 ("btrfs: delayed-ref: double
> free in btrfs_add_delayed_tree_ref()") and frees the right head_ref.
> No functional change.
> 
> Signed-off-by: Su Yue 

Reviewed-by: Nikolay Borisov 

> ---
> This patch is based on for-next to avoid conflicts with patches
> already in for-next.
> 
>  fs/btrfs/delayed-ref.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 03dec673d12a..38f8d5d549ed 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -741,14 +741,20 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
> *fs_info,
>   ref->level = level;
>  
>   head_ref = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
> - if (!head_ref)
> - goto free_ref;
> + if (!head_ref) {
> + kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
> + return -ENOMEM;
> + }
>  
>   if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
>   is_fstree(ref_root)) {
>   record = kmalloc(sizeof(*record), GFP_NOFS);
> - if (!record)
> - goto free_head_ref;
> + if (!record) {
> + kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
> + kmem_cache_free(btrfs_delayed_ref_head_cachep,
> + head_ref);
> + return -ENOMEM;
> + }
>   }
>  
>   init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
> @@ -779,13 +785,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
> *fs_info,
>   btrfs_qgroup_trace_extent_post(fs_info, record);
>  
>   return 0;
> -
> -free_head_ref:
> - kmem_cache_free(btrfs_delayed_ref_head_cachep, head_ref);
> -free_ref:
> - kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
> -
> - return -ENOMEM;
>  }
>  
>  /*
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/4] btrfs: compression: Add linux/sizes.h for compression.h

2018-05-23 Thread Qu Wenruo
Since compression.h is using SZ_* macros, and if some user only includes
compression.h without linux/sizes.h, it will cause compile error.

One example is lzo.c, if it uses BTRFS_MAX_COMPRESSED, it would cause
compile error.

Fix it by adding linux/sizes.h in compression.h

Signed-off-by: Qu Wenruo 
Reviewed-by: Nikolay Borisov 
---
 fs/btrfs/compression.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index cc605f7b23fb..317703d9b073 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -6,6 +6,7 @@
 #ifndef BTRFS_COMPRESSION_H
 #define BTRFS_COMPRESSION_H
 
+#include 
 /*
  * We want to make sure that amount of RAM required to uncompress an extent is
  * reasonable, so we limit the total size in ram of a compressed extent to
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/4] btrfs: lzo: Add header length check to avoid slab out of bounds access

2018-05-23 Thread Qu Wenruo
James Harvey reported that some corrupted compressed extent data can
lead to various kernel memory corruption.

Such corrupted extent data belongs to inode with NODATASUM flags, thus
data csum won't help us detecting such bug.

If lucky enough, kasan could catch it like:
==
BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs]
Write of size 4096 at addr 8800606cb0f8 by task kworker/u16:0/2338

CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G   O  
4.17.0-rc5-custom+ #50
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
Call Trace:
 dump_stack+0xc2/0x16b
 print_address_description+0x6a/0x270
 kasan_report+0x260/0x380
 memcpy+0x34/0x50
 lzo_decompress_bio+0x384/0x7a0 [btrfs]
 end_compressed_bio_read+0x99f/0x10b0 [btrfs]
 bio_endio+0x32e/0x640
 normal_work_helper+0x15a/0xea0 [btrfs]
 process_one_work+0x7e3/0x1470
 worker_thread+0x1b0/0x1170
 kthread+0x2db/0x390
 ret_from_fork+0x22/0x40
...
==

The offending compressed data has the following info:

Header: length 32768(Looks completely valid)
Segment 0 Header:   length 3472882419   (Obvious out of bounds)

Then when handling segment 0, since it's over the current page, we need
the compressed data to workspace, then such large size would trigger
out-of-bounds memory access, screwing up the whole kernel.

Fix it by adding extra checks on header and segment headers to ensure we
won't access out-of-bounds, and even checks the decompressed data won't
be out-of-bounds.

Reported-by: James Harvey 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/lzo.c | 35 ++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index ec5db393c758..4f4de460b08d 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -293,6 +293,7 @@ static int lzo_decompress_bio(struct list_head *ws, struct 
compressed_bio *cb)
unsigned long working_bytes;
size_t in_len;
size_t out_len;
+   size_t max_segment_len = lzo1x_worst_compress(PAGE_SIZE);
unsigned long in_offset;
unsigned long in_page_bytes_left;
unsigned long tot_in;
@@ -306,6 +307,18 @@ static int lzo_decompress_bio(struct list_head *ws, struct 
compressed_bio *cb)
 
data_in = kmap(pages_in[0]);
tot_len = read_compress_length(data_in);
+   /*
+* Compressed data header check.
+*
+* The real compressed size can't exceed extent length, and all pages
+* should be used (a full pending page is not possible).
+* If this happens it means the compressed extent is corrupted.
+*/
+   if (tot_len > min_t(size_t, BTRFS_MAX_COMPRESSED, srclen) ||
+   tot_len < srclen - PAGE_SIZE) {
+   ret = -EUCLEAN;
+   goto done;
+   }
 
tot_in = LZO_LEN;
in_offset = LZO_LEN;
@@ -320,6 +333,17 @@ static int lzo_decompress_bio(struct list_head *ws, struct 
compressed_bio *cb)
in_offset += LZO_LEN;
tot_in += LZO_LEN;
 
+   /*
+* Segment header check.
+*
+* The segment length must not exceed max lzo compression
+* size, nor the total compressed size
+*/
+   if (in_len > max_segment_len || tot_in + in_len > tot_len) {
+   ret = -EUCLEAN;
+   goto done;
+   }
+
tot_in += in_len;
working_bytes = in_len;
may_late_unmap = need_unmap = false;
@@ -370,7 +394,7 @@ static int lzo_decompress_bio(struct list_head *ws, struct 
compressed_bio *cb)
}
}
 
-   out_len = lzo1x_worst_compress(PAGE_SIZE);
+   out_len = max_segment_len;
ret = lzo1x_decompress_safe(buf, in_len, workspace->buf,
_len);
if (need_unmap)
@@ -380,6 +404,15 @@ static int lzo_decompress_bio(struct list_head *ws, struct 
compressed_bio *cb)
ret = -EIO;
break;
}
+   /*
+* Decompressed data length check.
+* The uncompressed data should not exceed uncompressed extent
+* size.
+*/
+   if (tot_out + out_len > cb->len) {
+   ret = -EUCLEAN;
+   break;
+   }
 
buf_start = tot_out;
tot_out += out_len;
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH v3 2/4] btrfs: lzo: Add comment about the how btrfs records its lzo compressed data

2018-05-23 Thread Qu Wenruo
Although it's not that complex, but such comment could still save
several minutes for newer reader/reviewer.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/lzo.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index 0667ea07f766..ec5db393c758 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -17,6 +17,41 @@
 
 #define LZO_LEN4
 
+/*
+ * Btrfs LZO compression format
+ *
+ * Regular and inlined LZO compressed data extents consist of:
+ * 1.  Header
+ * Fixed size. LZO_LEN (4) bytes long, LE32.
+ * Records the total size (*includes* the header) of real compressed data.
+ *
+ * 2.  Segment(s)
+ * Variable size. Each segment includes one segment header, with data
+ * payload followed.
+ * One regular LZO compressed extent can have one or more segments.
+ * While For inlined LZO compressed extent, only *ONE* segment is allowed.
+ * One segment represents at most one page of uncompressed data.
+ *
+ * 2.1 Segment header
+ * Fixed size. LZO_LEN (4) bytes long, LE32.
+ * Records the total size of the segment (*excludes* the header).
+ * Segment header *NEVER* crosses page boundary, thus it's possible to
+ * have pending zero at page end.
+ *
+ * 2.2 Data Payload
+ * Variable size. Size up limit should be lzo1x_worst_compress(PAGE_SIZE).
+ *
+ * Example:
+ * Page 1:
+ *  0 0x2   0x4   0x6   0x8   0xa   0xc   0xe 0x10
+ * 0x   |  Header   | SegHdr 01 | Data payload 01 ... |
+ * ...
+ * 0x0ff0   | SegHdr  N | Data payload  N ...  |00|
+ *  ^^ pending zero
+ * Page 2:
+ * 0x1000   | SegHdr N+1| Data payload N+1 ...|
+ */
+
 struct workspace {
void *mem;
void *buf;  /* where decompressed data goes */
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root raveled during balance

2018-05-23 Thread ein
On 05/23/2018 08:32 AM, Nikolay Borisov wrote:

Nikolay, thank you for the answer.

>> [...]
>> root@node0:~# dmesg | grep BTRFS | grep warn
>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
>> mirror 1
>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
>> mirror 1
>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>> mirror 1
>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>> mirror 1
>>
>> Above has been revealed during below command and quite high IO usage by
>> few VMs (Linux on top Ext4 with firebird database, lots of random
>> read/writes, two others with Windows 2016 and Windows Update in the
>> background):
> 
> I believe you are hitting the issue described here:
> 
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html

It make sense, fsck.ext4, gbak - firebird integrity checking tool,
chkdsk and sfc /scannow don't show any errors internally within VM. As
far I can tell the data inside VMs is not corrupted somehow.

> Essentially the way qemu operates on vm images atop btrfs is prone to
> producing such errors. As a matter of fact, other filesystems also
> suffer from this(i.e pages modified while being written, however due to
> lack of CRC on the data they don't detect it). Can you confirm that
> those inodes (312/314/319/219962/219915) belong to vm images files?

root@node0:/var/lib/libvirt# find  ./ -inum 312
root@node0:/var/lib/libvirt# find  ./ -inum 314
root@node0:/var/lib/libvirt# find  ./ -inum 319
root@node0:/var/lib/libvirt# find  ./ -inum 219962
./images/rds.raw
root@node0:/var/lib/libvirt# find  ./ -inum 219915
./images/database.raw

It seems so (219962, 219915):
- rds.raw - Windows 2016 server, Remote Desktop Server, raw preallocated
image, NTFS
database.raw - Linux 3.8, Firebird DB server, raw preallocated image, Ext4

> IMHO the best course of action would be to disable checksumming for you
> vm files.
> 

Do you mean '-o nodatasum' mount flag? Is it possible to disable
checksumming for singe file by setting some magical chattr? Google
thinks it's not possible to disable csums for a single file.

> For some background I suggest you read the following LWN articles:
> 
> https://lwn.net/Articles/486311/
> https://lwn.net/Articles/442355/
> 
>>
>> when I changed BTRFS compress parameters. Or during umount (can't recall
>> now):
>>
>> May  2 07:15:39 node0 kernel: [1168145.677431] WARNING: CPU: 6 PID: 3763
>> at /build/linux-8B5M4n/linux-4.15.11/fs/direct-io.c:293
>> dio_complete+0x1d6/0x220
>> May  2 07:15:39 node0 kernel: [1168145.678811] Modules linked in: fuse
>> ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs vhost_net vhost
>> tap tun ebtable_filter ebtables ip6tab
>> le_filter ip6_tables iptable_filter binfmt_misc bridge 8021q garp mrp
>> stp llc snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal
>> intel_powerclamp coretemp snd_hda_codec_realtek kvm
>> _intel snd_hda_codec_generic kvm i915 irqbypass crct10dif_pclmul
>> snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_hda_codec
>> intel_cstate snd_hda_core iTCO_wdt iTCO_vendor_support
>>  intel_uncore drm_kms_helper snd_hwdep wmi_bmof intel_rapl_perf joydev
>> evdev pcspkr snd_pcm snd_timer drm snd soundcore i2c_algo_bit sg mei_me
>> lpc_ich shpchp mfd_core mei ie31200_e
>> dac wmi video button ib_iser rdma_cm iw_cm ib_cm ib_core configfs
>> iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables
>> May  2 07:15:39 node0 kernel: [1168145.685202]  x_tables autofs4 ext4
>> crc16 mbcache jbd2 fscrypto ecb btrfs zstd_decompress zstd_compress
>> xxhash raid456 async_raid6_recov async_mem
>> cpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic
>> raid0 multipath linear hid_generic usbhid hid dm_mod raid10 

[PATCH] btrfs: delayed-ref: simplify btrfs_add_delayed_tree_ref()

2018-05-23 Thread Su Yue
Commit 5a5003df98d5 ("btrfs: delayed-ref: double free in
btrfs_add_delayed_tree_ref()") fixed double free problem by creating
an unnessesary label to jump.
The elegant way is just to change "ref" to "head_ref" and keep
btrfs_add_delayed_tree_ref() and btrfs_add_delayed_data_ref() in
similar structure.

This patch reverts commit 5a5003df98d5 ("btrfs: delayed-ref: double
free in btrfs_add_delayed_tree_ref()") and frees the right head_ref.
No functional change.

Signed-off-by: Su Yue 
---
This patch is based on for-next to avoid conflicts with patches
already in for-next.

 fs/btrfs/delayed-ref.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 03dec673d12a..38f8d5d549ed 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -741,14 +741,20 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
ref->level = level;
 
head_ref = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
-   if (!head_ref)
-   goto free_ref;
+   if (!head_ref) {
+   kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
+   return -ENOMEM;
+   }
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
is_fstree(ref_root)) {
record = kmalloc(sizeof(*record), GFP_NOFS);
-   if (!record)
-   goto free_head_ref;
+   if (!record) {
+   kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
+   kmem_cache_free(btrfs_delayed_ref_head_cachep,
+   head_ref);
+   return -ENOMEM;
+   }
}
 
init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
@@ -779,13 +785,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
btrfs_qgroup_trace_extent_post(fs_info, record);
 
return 0;
-
-free_head_ref:
-   kmem_cache_free(btrfs_delayed_ref_head_cachep, head_ref);
-free_ref:
-   kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
-
-   return -ENOMEM;
 }
 
 /*
-- 
2.17.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-23 Thread Nikolay Borisov


On 23.05.2018 09:37, Christoph Hellwig wrote:
> On Tue, May 22, 2018 at 02:31:36PM -0400, Chris Mason wrote:
>>> And what protects two writes from interleaving their results now?
>>
>> page locks...ish, we at least won't have results interleaved in a single
>> page.  For btrfs it'll actually be multiple pages since we try to do more
>> than one at a time.
> 
> I think you are going to break just about every assumption people
> have that any single write is going to be atomic vs another write.
> 
> E.g. this comes from the posix read definition reference by the
> write definition:
> 
> "I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> Atomic means that all the bytes from a single operation that started out
> together end up together, without interleaving from other I/O
> operations. It is a known attribute of terminals that this is not
> honored, and terminals are explicitly (and implicitly permanently)
> excepted, making the behavior unspecified. The behavior for other device
> types is also left unspecified, but the wording is intended to imply
> that future standards might choose to specify atomicity (or not).
> "
> 
> Without taking the inode lock (or some sort of range lock) you can
> easily interleave data from two write requests.
> 
>> But we're not avoiding the inode lock completely, we're just dropping it for
>> the expensive parts of writing to the file.  A quick guess about what the
>> expensive parts are:
> 
> The way I read the patch it basically 'avoids' the inode lock for almost
> the whole write call, just minus some setup.

You are right, this drops the inode_lock for some rather crucial ops:
prepare_pages - this one allocates pages in the page cache mapping for
this inode and locks them

lock_and_cleanup_extent_if_need - this one completes any previous writes
for the range currenlty being written and locks the range of extents.

Looking at the code I *think* the locking provided by
prepare_pages/lock_and_cleanup_extent_if_need *should* provide the
necessary exclusivity to guarantee atomicity w.r.t to different writes.

So Robbie,

This reliance on btrfs-internal locks to provide exclusivity *must* be
explicitly documented in the commit log.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/4] btrfs: lzo: Add comment about the how btrfs records its lzo compressed data

2018-05-23 Thread Qu Wenruo


On 2018年05月22日 22:00, David Sterba wrote:
> On Mon, May 21, 2018 at 01:19:25PM +0800, Qu Wenruo wrote:
>> Although it's not that complex, but such comment could still save
>> several minutes for newer reader/reviewer.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/lzo.c | 23 +++
>>  1 file changed, 23 insertions(+)
>>
>> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
>> index 0667ea07f766..d0c6789ff78f 100644
>> --- a/fs/btrfs/lzo.c
>> +++ b/fs/btrfs/lzo.c
>> @@ -17,6 +17,29 @@
>>  
>>  #define LZO_LEN 4
>>  
>> +/*
>> + * Btrfs LZO compression format
>> + *
>> + * Regular LZO compressed data extent consists of:
>> + * 1.  Header
>> + * Fixed size. LZO_LEN (4) bytes long, LE32.
>> + * Records the total size (*includes* the header) of real compressed 
>> data.
>> + *
>> + * 2.  Segment(s)
>> + * Variable size. Includes one segment header, and then data payload.
>> + * One regular LZO compressed extent can have one or more segments.
>> + *
>> + * 2.1 Segment header
>> + * Fixed size. LZO_LEN (4) bytes long, LE32.
>> + * Records the total size of the segment (*excludes* the header).
>> + *
>> + * 2.2 Data Payload
>> + * Variable size. Size up limit should be 
>> lzo1x_worst_compress(PAGE_SIZE).
>> + *
>> + * While for inlined LZO compressed data extent, it doesn't have Header, 
>> just
>> + * *ONE* Segment.
> 
> It does have header that has the same value as the 1st segment header so
> it's skipped.

Indeed, I just missed the first "data_in += LZO_LEN;" line.

And in that case, the header should also be checked.

> 
> There's also one catch in the format that if there's a less then 4 bytes
> left to the page size, the new segment starts on the new page.

I'll also update the patch to reflect this.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


lening

2018-05-23 Thread Funding Trusts Finance


Goede dag,
  
We zijn Funding Trusts Finance verstrekt leningen per postadvertentie. Wij 
bieden verschillende soorten leningen of projectleningen (korte en lange 
termijnleningen, persoonlijke leningen, leningen aan bedrijven enz.) Met een 
rentetarief van 3%. We verstrekken leningen aan mensen in nood, ongeacht hun 
locatie, geslacht, burgerlijke staat, opleiding of baan, maar moeten een legale 
manier van terugbetaling hebben. Onze leningen variren tussen 5.000,00 
tot 20.000.000,00 US Dollar of Euro of Pond met een maximale duur van 15 jaar. 
Als u genteresseerd bent in meer informatie, hebben we investeerders die 
genteresseerd zijn in het financieren van projecten van groot volume. De 
procedures zijn als volgt: -

1-De klant moet een korte samenvatting van het project verzenden. Dit moet het 
totale bedrag omvatten dat vereist is voor het project, geschat rendement op 
investering, terugbetalingsperiode van de lening, dit mag niet meer dan 20 jaar 
zijn

Neem contact met ons op: i...@fundingtrustsfinance.com

INFORMATIE NODIG

Jullie namen:
Adres: ...
Telefoon: ...
Benodigde hoeveelheid: ...
Looptijd: ...
Beroep: ...
Maandelijks inkomensniveau: ..
Geslacht: ..
Geboortedatum: ...
Staat: ...
Land: .
Doel: .

"Miljoenen mensen in de wereld hebben een kredietprobleem van een of andere 
vorm. Je bent niet de enige. We hebben een hele reeks leningopties die kunnen 
helpen. Ontdek nu je opties!"

Met vriendelijke groet,
Ronny Hens,
E-mail: i...@fundingtrustsfinance.com
WEBSITE: www.fundingtrustfinance.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/2] btrfs: qgroup: Search commit root for rescan to avoid missing extent

2018-05-23 Thread Nikolay Borisov


On 23.05.2018 10:29, Qu Wenruo wrote:
> When doing qgroup rescan using the following script (modified from
> btrfs/017 test case), we can sometimes hit qgroup corruption.
> 
> --
> umount $dev &> /dev/null
> umount $mnt &> /dev/null
> 
> mkfs.btrfs -f -n 64k $dev
> mount $dev $mnt
> 
> extent_size=8192
> 
> xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
> btrfs subvolume snapshot $mnt $mnt/snap
> 
> xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
> xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
> xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
> btrfs quota enable $mnt
> 
>  # -W is the new option to only wait rescan while not starting new one
> btrfs quota rescan -W $mnt
> btrfs qgroup show -prce $mnt
> umount $mnt
> 
>  # Need to patch btrfs-progs to report qgroup mismatch as error
> btrfs check $dev || _fail
> --
> 
> For fast machine, we can hit some corruption which missed accounting
> tree blocks:
> --
> qgroupid rfer excl max_rfer max_excl parent  child
>      --  -
> 0/5   8.00KiB0.00B none none --- ---
> 0/257 8.00KiB0.00B none none --- ---
> --
> 
> This is due to the fact that we're always searching commit root for
> btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
> from current transaction, not commit root.
> 
> And if our tree blocks get modified in current transaction, we won't
> find any owner in commit root, thus causing the corruption.
> 
> Fix it by searching commit root for extent tree for
> qgroup_rescan_leaf().
> 
> Reported-by: Nikolay Borisov 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
> changelog:
> v2:
>   Remove unused tree_mod_seq_elem for the 1st patch.
> v3:
>   Also skip locking to avoid lock assert() with Liu Bo's
>   btrfs_search_slot() patchset.
> ---
>  fs/btrfs/qgroup.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 9fb758d5077a..a88330088d03 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2590,7 +2590,6 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, 
> struct btrfs_path *path,
>   struct btrfs_key found;
>   struct extent_buffer *scratch_leaf = NULL;
>   struct ulist *roots = NULL;
> - struct seq_list tree_mod_seq_elem = SEQ_LIST_INIT(tree_mod_seq_elem);
>   u64 num_bytes;
>   int slot;
>   int ret;
> @@ -2625,7 +2624,6 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, 
> struct btrfs_path *path,
> btrfs_header_nritems(path->nodes[0]) - 1);
>   fs_info->qgroup_rescan_progress.objectid = found.objectid + 1;
>  
> - btrfs_get_tree_mod_seq(fs_info, _mod_seq_elem);
>   scratch_leaf = btrfs_clone_extent_buffer(path->nodes[0]);
>   if (!scratch_leaf) {
>   ret = -ENOMEM;
> @@ -2664,7 +2662,6 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, 
> struct btrfs_path *path,
>   btrfs_tree_read_unlock_blocking(scratch_leaf);
>   free_extent_buffer(scratch_leaf);
>   }
> - btrfs_put_tree_mod_seq(fs_info, _mod_seq_elem);
>  
>   return ret;
>  }
> @@ -2681,6 +2678,12 @@ static void btrfs_qgroup_rescan_worker(struct 
> btrfs_work *work)
>   path = btrfs_alloc_path();
>   if (!path)
>   goto out;
> + /*
> +  * Rescan should only search for commit root, and any later difference
> +  * should be recorded by qgroup
> +  */
> + path->search_commit_root = 1;
> + path->skip_locking = 1;
>  
>   err = 0;
>   while (!err && !btrfs_fs_closing(fs_info)) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/2] btrfs: qgroup: Search commit root for rescan to avoid missing extent

2018-05-23 Thread Qu Wenruo
When doing qgroup rescan using the following script (modified from
btrfs/017 test case), we can sometimes hit qgroup corruption.

--
umount $dev &> /dev/null
umount $mnt &> /dev/null

mkfs.btrfs -f -n 64k $dev
mount $dev $mnt

extent_size=8192

xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
btrfs subvolume snapshot $mnt $mnt/snap

xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
btrfs quota enable $mnt

 # -W is the new option to only wait rescan while not starting new one
btrfs quota rescan -W $mnt
btrfs qgroup show -prce $mnt
umount $mnt

 # Need to patch btrfs-progs to report qgroup mismatch as error
btrfs check $dev || _fail
--

For fast machine, we can hit some corruption which missed accounting
tree blocks:
--
qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/5   8.00KiB0.00B none none --- ---
0/257 8.00KiB0.00B none none --- ---
--

This is due to the fact that we're always searching commit root for
btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
from current transaction, not commit root.

And if our tree blocks get modified in current transaction, we won't
find any owner in commit root, thus causing the corruption.

Fix it by searching commit root for extent tree for
qgroup_rescan_leaf().

Reported-by: Nikolay Borisov 
Signed-off-by: Qu Wenruo 
---
changelog:
v2:
  Remove unused tree_mod_seq_elem for the 1st patch.
v3:
  Also skip locking to avoid lock assert() with Liu Bo's
  btrfs_search_slot() patchset.
---
 fs/btrfs/qgroup.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 9fb758d5077a..a88330088d03 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2590,7 +2590,6 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, struct 
btrfs_path *path,
struct btrfs_key found;
struct extent_buffer *scratch_leaf = NULL;
struct ulist *roots = NULL;
-   struct seq_list tree_mod_seq_elem = SEQ_LIST_INIT(tree_mod_seq_elem);
u64 num_bytes;
int slot;
int ret;
@@ -2625,7 +2624,6 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, struct 
btrfs_path *path,
  btrfs_header_nritems(path->nodes[0]) - 1);
fs_info->qgroup_rescan_progress.objectid = found.objectid + 1;
 
-   btrfs_get_tree_mod_seq(fs_info, _mod_seq_elem);
scratch_leaf = btrfs_clone_extent_buffer(path->nodes[0]);
if (!scratch_leaf) {
ret = -ENOMEM;
@@ -2664,7 +2662,6 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, struct 
btrfs_path *path,
btrfs_tree_read_unlock_blocking(scratch_leaf);
free_extent_buffer(scratch_leaf);
}
-   btrfs_put_tree_mod_seq(fs_info, _mod_seq_elem);
 
return ret;
 }
@@ -2681,6 +2678,12 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work 
*work)
path = btrfs_alloc_path();
if (!path)
goto out;
+   /*
+* Rescan should only search for commit root, and any later difference
+* should be recorded by qgroup
+*/
+   path->search_commit_root = 1;
+   path->skip_locking = 1;
 
err = 0;
while (!err && !btrfs_fs_closing(fs_info)) {
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


lening

2018-05-23 Thread Funding Trusts Finance


Goede dag,
  
We zijn Funding Trusts Finance verstrekt leningen per postadvertentie. Wij 
bieden verschillende soorten leningen of projectleningen (korte en lange 
termijnleningen, persoonlijke leningen, leningen aan bedrijven enz.) Met een 
rentetarief van 3%. We verstrekken leningen aan mensen in nood, ongeacht hun 
locatie, geslacht, burgerlijke staat, opleiding of baan, maar moeten een legale 
manier van terugbetaling hebben. Onze leningen variren tussen 5.000,00 
tot 20.000.000,00 US Dollar of Euro of Pond met een maximale duur van 15 jaar. 
Als u genteresseerd bent in meer informatie, hebben we investeerders die 
genteresseerd zijn in het financieren van projecten van groot volume. De 
procedures zijn als volgt: -

1-De klant moet een korte samenvatting van het project verzenden. Dit moet het 
totale bedrag omvatten dat vereist is voor het project, geschat rendement op 
investering, terugbetalingsperiode van de lening, dit mag niet meer dan 20 jaar 
zijn

Neem contact met ons op: i...@fundingtrustsfinance.com

INFORMATIE NODIG

Jullie namen:
Adres: ...
Telefoon: ...
Benodigde hoeveelheid: ...
Looptijd: ...
Beroep: ...
Maandelijks inkomensniveau: ..
Geslacht: ..
Geboortedatum: ...
Staat: ...
Land: .
Doel: .

"Miljoenen mensen in de wereld hebben een kredietprobleem van een of andere 
vorm. Je bent niet de enige. We hebben een hele reeks leningopties die kunnen 
helpen. Ontdek nu je opties!"

Met vriendelijke groet,
Ronny Hens,
E-mail: i...@fundingtrustsfinance.com
WEBSITE: www.fundingtrustfinance.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-23 Thread robbieko

Chris Mason 於 2018-05-23 02:31 寫到:

On 22 May 2018, at 14:08, Christoph Hellwig wrote:


On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:

From: Robbie Ko 

This idea is from direct io. By this patch, we can make the buffered
write parallel, and improve the performance and latency. But because 
we
can not update isize without i_mutex, the unlocked buffered write 
just

can be done in front of the EOF.

We needn't worry about the race between buffered write and truncate,
because the truncate need wait until all the buffered write end.

And we also needn't worry about the race between dio write and punch 
hole,

because we have extent lock to protect our operation.

I ran fio to test the performance of this feature.


And what protects two writes from interleaving their results now?


page locks...ish, we at least won't have results interleaved in a
single page.  For btrfs it'll actually be multiple pages since we try
to do more than one at a time.

I haven't verified all the assumptions around truncate and fallocate
and friends expecting the dio special locking to be inside i_size.  In
general this makes me a little uncomfortable.

But we're not avoiding the inode lock completely, we're just dropping
it for the expensive parts of writing to the file.  A quick guess
about what the expensive parts are:

1) balance_dirty_pages()
2) btrfs_btree_balance_dirty()
3) metadata reservations/enospc waiting.



The expensive part of buffered_write are:
1. prepare_pages()
--wait_on_page_writeback()
Because writeback submit page to PG_writeback.
We must wait until the page writeback IO ends.

2. lock_and_cleanup_extent_if_need
--btrfs_start_ordered_extent
When a large number of ordered_extent queue is in 
endio_write_workers workqueue.
Buffered_write assumes that ordered_extent is the last one in the 
endio_write_workers workqueue,
and waits for all ordered_extents to be processed before because the 
workqueue is a FIFO.


Thanks.
Robbie Ko


Can I bribe you to benchmark how much each of those things is
impacting the iops/latency benefits?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-23 Thread robbieko

Omar Sandoval 於 2018-05-23 01:28 寫到:

On Wed, May 16, 2018 at 11:52:37AM +0800, robbieko wrote:

From: Robbie Ko 

This idea is from direct io. By this patch, we can make the buffered
write parallel, and improve the performance and latency. But because 
we

can not update isize without i_mutex, the unlocked buffered write just
can be done in front of the EOF.

We needn't worry about the race between buffered write and truncate,
because the truncate need wait until all the buffered write end.


I'm not convinced that this race isn't an issue. Consider this:

__btrfs_buffered_write()btrfs_setsize()
inode_dio_begin()
inode_unlock()
inode_lock()
truncate_setsize() /* Updates i_size */
truncate_pagecache() /* Locks and unlocks pages 
*/
inode_dio_wait()
prepare_pages() /* Locks pages */
btrfs_dirty_pages() /* Updates i_size without i_rwsem! */

I think moving inode_dio_wait() to before truncate_setsize() in
btrfs_setsize() would close this race, but I haven't thought about it
long enough to determine whether that still works for the original
reason the inode_dio_wait() was added.


OK. I will correct this part.

The original use of inode_dio_wait is to avoid dio read and truncate 
race.

We can fix like below:
-   /* we don't support swapfiles, so vmtruncate shouldn't fail */
-   truncate_setsize(inode, newsize);
-
/* Disable nonlocked read DIO to avoid the end less truncate */
btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
inode_dio_wait(inode);
+
+   /* we don't support swapfiles, so vmtruncate shouldn't fail */
+   truncate_setsize(inode, newsize);
+
btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));

We just make sure to update isize before restoring unlocked dio read.

Thanks.




And we also needn't worry about the race between dio write and punch 
hole,

because we have extent lock to protect our operation.

I ran fio to test the performance of this feature.

== Hardware ==
CPU: Intel® Xeon® D-1531
SSD: Intel S3710 200G
Volume : RAID 5 , SSD * 6

== config file ==
[global]
group_reporting
time_based
thread=1
norandommap
ioengine=libaio
bs=4k
iodepth=32
size=16G
runtime=180
numjobs=8
rw=randwrite

[file1]
filename=/mnt/btrfs/nocow/testfile

== result (iops) ==
lock = 68470
unlocked = 94242

== result (clat) ==
lock
 lat (usec): min=184, max=1209.9K, avg=3738.35, stdev=20869.49
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  330], 10.00th=[  334], 20.00th=[  
346],
  | 30.00th=[  370], 40.00th=[  386], 50.00th=[  406], 60.00th=[  
446],
  | 70.00th=[  516], 80.00th=[  612], 90.00th=[  828], 
95.00th=[10432],
  | 99.00th=[84480], 99.50th=[117248], 99.90th=[226304], 
99.95th=[333824],

  | 99.99th=[692224]

unlocked
 lat (usec): min=10, max=218208, avg=2691.44, stdev=5380.82
 clat percentiles (usec):
  |  1.00th=[  302],  5.00th=[  390], 10.00th=[  442], 20.00th=[  
502],
  | 30.00th=[  548], 40.00th=[  596], 50.00th=[  652], 60.00th=[  
724],
  | 70.00th=[  916], 80.00th=[ 5024], 90.00th=[ 5664], 
95.00th=[10048],
  | 99.00th=[29568], 99.50th=[39168], 99.90th=[54016], 
99.95th=[59648],

  | 99.99th=[78336]

Signed-off-by: Robbie Ko 
---
 fs/btrfs/file.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 41ab907..8eac540 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1600,6 +1600,7 @@ static noinline ssize_t 
__btrfs_buffered_write(struct file *file,

int ret = 0;
bool only_release_metadata = false;
bool force_page_uptodate = false;
+   bool relock = false;

nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
PAGE_SIZE / (sizeof(struct page *)));
@@ -1609,6 +1610,18 @@ static noinline ssize_t 
__btrfs_buffered_write(struct file *file,

if (!pages)
return -ENOMEM;

+   inode_dio_begin(inode);


This would need a comment as to why we're using inode_dio_begin() for
buffered I/O.


+   /*
+* If the write is beyond the EOF, we need update
+* the isize, but it is protected by i_mutex. So we can
+* not unlock the i_mutex at this case.
+*/
+   if (pos + iov_iter_count(i) <= i_size_read(inode)) {
+   inode_unlock(inode);
+   relock = true;
+   }
+
while (iov_iter_count(i) > 0) {
size_t offset = pos & (PAGE_SIZE - 1);
size_t sector_offset;
@@ -1808,6 +1821,10 @@ static noinline ssize_t 
__btrfs_buffered_write(struct file *file,

}
}

+   inode_dio_end(inode);
+   if (relock)
+   inode_lock(inode);
+
extent_changeset_free(data_reserved);
return num_written ? num_written : ret;
 }
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe 

Re: [PATCH] Btrfs: implement unlocked buffered write

2018-05-23 Thread Christoph Hellwig
On Tue, May 22, 2018 at 02:31:36PM -0400, Chris Mason wrote:
> > And what protects two writes from interleaving their results now?
> 
> page locks...ish, we at least won't have results interleaved in a single
> page.  For btrfs it'll actually be multiple pages since we try to do more
> than one at a time.

I think you are going to break just about every assumption people
have that any single write is going to be atomic vs another write.

E.g. this comes from the posix read definition reference by the
write definition:

"I/O is intended to be atomic to ordinary files and pipes and FIFOs.
Atomic means that all the bytes from a single operation that started out
together end up together, without interleaving from other I/O
operations. It is a known attribute of terminals that this is not
honored, and terminals are explicitly (and implicitly permanently)
excepted, making the behavior unspecified. The behavior for other device
types is also left unspecified, but the wording is intended to imply
that future standards might choose to specify atomicity (or not).
"

Without taking the inode lock (or some sort of range lock) you can
easily interleave data from two write requests.

> But we're not avoiding the inode lock completely, we're just dropping it for
> the expensive parts of writing to the file.  A quick guess about what the
> expensive parts are:

The way I read the patch it basically 'avoids' the inode lock for almost
the whole write call, just minus some setup.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/3] btrfs: balance: add kernel log for end or paused

2018-05-23 Thread Anand Jain
Add a kernel log when the balance ends, either for cancel or completed
or if it is paused.

Signed-off-by: Anand Jain 
---
v3->v4: nothing.
v2->v3: nothing.
v1->v2: Moved from 2/3 to 3/3
 fs/btrfs/volumes.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index df884e4fb0a2..d07158491505 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4054,6 +4054,13 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
ret = __btrfs_balance(fs_info);
 
mutex_lock(_info->balance_mutex);
+   if (ret == -ECANCELED && atomic_read(_info->balance_pause_req))
+   btrfs_info(fs_info, "balance: paused");
+   else if (ret == -ECANCELED && atomic_read(_info->balance_cancel_req))
+   btrfs_info(fs_info, "balance: canceled");
+   else
+   btrfs_info(fs_info, "balance: ended with status: %d", ret);
+
clear_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags);
 
if (bargs) {
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/3] btrfs: add helper function describe_block_group()

2018-05-23 Thread Anand Jain
Improve on describe_relocation() add a common helper function to describe
the block groups.

Signed-off-by: Anand Jain 
---
v3->v4: Just pass full flag name in the define DESCRIBE_FLAG(flag,..),
 so that it can be used at couple of more places.
Rename describe_block_groups() to btrfs_describe_block_groups().
Drop useless return u32.
v3: Born.

 fs/btrfs/relocation.c | 30 +++---
 fs/btrfs/volumes.c| 36 
 fs/btrfs/volumes.h|  1 +
 3 files changed, 40 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 879b76fa881a..0434cebc2ea2 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4320,37 +4320,13 @@ static struct reloc_control *alloc_reloc_control(void)
 static void describe_relocation(struct btrfs_fs_info *fs_info,
struct btrfs_block_group_cache *block_group)
 {
-   char buf[128];  /* prefixed by a '|' that'll be dropped */
-   u64 flags = block_group->flags;
+   char buf[128];
 
-   /* Shouldn't happen */
-   if (!flags) {
-   strcpy(buf, "|NONE");
-   } else {
-   char *bp = buf;
-
-#define DESCRIBE_FLAG(f, d) \
-   if (flags & BTRFS_BLOCK_GROUP_##f) { \
-   bp += snprintf(bp, buf - bp + sizeof(buf), "|%s", d); \
-   flags &= ~BTRFS_BLOCK_GROUP_##f; \
-   }
-   DESCRIBE_FLAG(DATA, "data");
-   DESCRIBE_FLAG(SYSTEM,   "system");
-   DESCRIBE_FLAG(METADATA, "metadata");
-   DESCRIBE_FLAG(RAID0,"raid0");
-   DESCRIBE_FLAG(RAID1,"raid1");
-   DESCRIBE_FLAG(DUP,  "dup");
-   DESCRIBE_FLAG(RAID10,   "raid10");
-   DESCRIBE_FLAG(RAID5,"raid5");
-   DESCRIBE_FLAG(RAID6,"raid6");
-   if (flags)
-   snprintf(bp, buf - bp + sizeof(buf), "|0x%llx", flags);
-#undef DESCRIBE_FLAG
-   }
+   btrfs_describe_block_groups(block_group->flags, buf, sizeof(buf));
 
btrfs_info(fs_info,
   "relocating block group %llu flags %s",
-  block_group->key.objectid, buf + 1);
+  block_group->key.objectid, buf);
 }
 
 /*
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b6757b53c297..44f4799bf06f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -126,6 +126,42 @@ const char *get_raid_name(enum btrfs_raid_types type)
return btrfs_raid_array[type].raid_name;
 }
 
+void btrfs_describe_block_groups(u64 bg_flags, char *buf, u32 size_buf)
+{
+   int i;
+   u32 ret;
+   char *bp = buf;
+   u64 flags = bg_flags;
+
+   if (!flags) {
+   snprintf(bp, size_buf, "%s", "NONE");
+   return;
+   }
+
+#define DESCRIBE_FLAG(f, d) \
+   do { \
+   if (flags & f) { \
+   bp += snprintf(bp, buf - bp + size_buf, "%s|", d); \
+   flags &= ~f; \
+   } \
+   } while (0)
+
+   DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_DATA, "data");
+   DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_SYSTEM, "system");
+   DESCRIBE_FLAG(BTRFS_BLOCK_GROUP_METADATA, "metadata");
+   DESCRIBE_FLAG(BTRFS_AVAIL_ALLOC_BIT_SINGLE, "single");
+   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
+   DESCRIBE_FLAG(btrfs_raid_array[i].bg_flag,
+ btrfs_raid_array[i].raid_name);
+#undef DESCRIBE_FLAG
+
+   if (flags)
+   bp += snprintf(bp, buf - bp + size_buf, "0x%llx|", flags);
+
+   ret = bp - buf - 1;
+   buf[ret] = '\0'; /* remove last | */
+}
+
 static int init_first_rw_device(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info);
 static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 5139ec8daf4c..5b9637a3ed48 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -433,6 +433,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *path);
 int btrfs_balance(struct btrfs_fs_info *fs_info,
  struct btrfs_balance_control *bctl,
  struct btrfs_ioctl_balance_args *bargs);
+void btrfs_describe_block_groups(u64 flags, char *buf, u32 size_buf);
 int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info);
 int btrfs_recover_balance(struct btrfs_fs_info *fs_info);
 int btrfs_pause_balance(struct btrfs_fs_info *fs_info);
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/3] btrfs: balance: add args info during start and resume

2018-05-23 Thread Anand Jain
Balance arg info is an important information to be reviewed for the
system audit. So this patch adds them to the kernel log.

Example:

->btrfs bal start --full-balance -f -mprofiles=raid1,convert=single,soft 
-dlimit=10..20,usage=50 /btrfs

 kernel: BTRFS info (device sdc): balance: start -f -d usage=50,limit=10..20 -m 
soft,profiles=raid1,convert=single -s soft,profiles=raid1,convert=single

Signed-off-by: Anand Jain 
---
v3->v4: Change log updated with new example.
Log format is changed to almost match with the cli.
snprintf drop %s for inline string.
Print range as =%u..%u instead of min=%umax=%u.
soft is under the args now.
force is printed as -f.

v2->v3: Change log updated.
Make use of describe_block_group() added in 1/4
Drop using strcat instead use snprintf.
Logs string format updated as shown in the example above.

v1->v2: Change log update.
Move adding the logs for balance complete and end to a new patch

 fs/btrfs/volumes.c | 107 +++--
 1 file changed, 104 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 44f4799bf06f..85f954f7f93e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3806,6 +3806,108 @@ static inline int validate_convert_profile(struct 
btrfs_balance_args *bctl_arg,
 (bctl_arg->target & ~allowed)));
 }
 
+static u32 describe_balance_args(struct btrfs_balance_args *bargs, char *buf,
+u32 size_buf)
+{
+   char *bp = buf;
+   u64 flags = bargs->flags;
+   char tmp_buf[128];
+
+   if (!flags)
+   return 0;
+
+   if (flags & BTRFS_BALANCE_ARGS_SOFT)
+   bp += snprintf(bp, buf - bp + size_buf, "soft,");
+
+   if (flags & BTRFS_BALANCE_ARGS_PROFILES) {
+   btrfs_describe_block_groups(bargs->profiles, tmp_buf,
+   sizeof(tmp_buf));
+   bp += snprintf(bp, buf - bp + size_buf, "profiles=%s,",
+  tmp_buf);
+   }
+
+   if (flags & BTRFS_BALANCE_ARGS_USAGE)
+   bp += snprintf(bp, buf - bp + size_buf, "usage=%llu,",
+  bargs->usage);
+
+   if (flags & BTRFS_BALANCE_ARGS_USAGE_RANGE)
+   bp += snprintf(bp, buf - bp + size_buf, "usage=%u..%u,",
+  bargs->usage_min, bargs->usage_max);
+
+   if (flags & BTRFS_BALANCE_ARGS_DEVID)
+   bp += snprintf(bp, buf - bp + size_buf, "devid=%llu,",
+  bargs->devid);
+
+   if (flags & BTRFS_BALANCE_ARGS_DRANGE)
+   bp += snprintf(bp, buf - bp + size_buf, "drange=%llu..%llu,",
+  bargs->pstart, bargs->pend);
+
+   if (flags & BTRFS_BALANCE_ARGS_VRANGE)
+   bp += snprintf(bp, buf - bp + size_buf, "vrange%llu..%llu,",
+  bargs->vstart, bargs->vend);
+
+   if (flags & BTRFS_BALANCE_ARGS_LIMIT)
+   bp += snprintf(bp, buf - bp + size_buf, "limit=%llu,",
+  bargs->limit);
+
+   if (flags & BTRFS_BALANCE_ARGS_LIMIT_RANGE)
+   bp += snprintf(bp, buf - bp + size_buf, "limit=%u..%u,",
+  bargs->limit_min, bargs->limit_max);
+
+   if (flags & BTRFS_BALANCE_ARGS_STRIPES_RANGE)
+   bp += snprintf(bp, buf - bp + size_buf, "stripes=%u..%u,",
+  bargs->stripes_min, bargs->stripes_max);
+
+   if (flags & BTRFS_BALANCE_ARGS_CONVERT) {
+   int index = btrfs_bg_flags_to_raid_index(bargs->target);
+
+   bp += snprintf(bp, buf - bp + size_buf, "convert=%s,",
+  get_raid_name(index));
+   }
+
+   buf[bp - buf - 1] = '\0'; /* remove last , */
+   return bp - buf - 1;
+}
+
+static void describe_balance_start_or_resume(struct btrfs_fs_info *fs_info)
+{
+   u32 sz = 1024;
+   char tmp_buf[64];
+   char *buf;
+   char *bp;
+   struct btrfs_balance_control *bctl = fs_info->balance_ctl;
+
+   buf = kzalloc(sz, GFP_KERNEL);
+   if (!buf)
+   return;
+
+   bp = buf;
+   if (bctl->flags & BTRFS_BALANCE_FORCE)
+   bp += snprintf(bp, buf - bp + sz, "-f ");
+
+   if (bctl->flags & BTRFS_BALANCE_DATA) {
+   describe_balance_args(>data, tmp_buf, sizeof(tmp_buf));
+   bp += snprintf(bp, buf - bp + sz, "-d %s ", tmp_buf);
+   }
+
+   if (bctl->flags & BTRFS_BALANCE_METADATA) {
+   describe_balance_args(>meta, tmp_buf, sizeof(tmp_buf));
+   bp += snprintf(bp, buf - bp + sz, "-m %s ", tmp_buf);
+   }
+
+   if (bctl->flags & BTRFS_BALANCE_SYSTEM) {
+   describe_balance_args(>sys, tmp_buf, sizeof(tmp_buf));
+   bp += snprintf(bp, buf - bp + sz, "-s %s ", tmp_buf);

Re: csum failed root raveled during balance

2018-05-23 Thread Nikolay Borisov


On 22.05.2018 23:05, ein wrote:
> Hello devs,
> 
> I tested BTRFS in production for about a month:
> 
> 21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00
> 
> Without power blackout, hardware failure, SSD's SMART is flawless etc.
> The tests ended with:
> 
> root@node0:~# dmesg | grep BTRFS | grep warn
> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
> mirror 1
> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
> mirror 1
> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
> mirror 1
> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
> mirror 1
> 
> Above has been revealed during below command and quite high IO usage by
> few VMs (Linux on top Ext4 with firebird database, lots of random
> read/writes, two others with Windows 2016 and Windows Update in the
> background):

I believe you are hitting the issue described here:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html

Essentially the way qemu operates on vm images atop btrfs is prone to
producing such errors. As a matter of fact, other filesystems also
suffer from this(i.e pages modified while being written, however due to
lack of CRC on the data they don't detect it). Can you confirm that
those inodes (312/314/319/219962/219915) belong to vm images files?

IMHO the best course of action would be to disable checksumming for you
vm files.


For some background I suggest you read the following LWN articles:

https://lwn.net/Articles/486311/
https://lwn.net/Articles/442355/

> 
> when I changed BTRFS compress parameters. Or during umount (can't recall
> now):
> 
> May  2 07:15:39 node0 kernel: [1168145.677431] WARNING: CPU: 6 PID: 3763
> at /build/linux-8B5M4n/linux-4.15.11/fs/direct-io.c:293
> dio_complete+0x1d6/0x220
> May  2 07:15:39 node0 kernel: [1168145.678811] Modules linked in: fuse
> ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs vhost_net vhost
> tap tun ebtable_filter ebtables ip6tab
> le_filter ip6_tables iptable_filter binfmt_misc bridge 8021q garp mrp
> stp llc snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal
> intel_powerclamp coretemp snd_hda_codec_realtek kvm
> _intel snd_hda_codec_generic kvm i915 irqbypass crct10dif_pclmul
> snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_hda_codec
> intel_cstate snd_hda_core iTCO_wdt iTCO_vendor_support
>  intel_uncore drm_kms_helper snd_hwdep wmi_bmof intel_rapl_perf joydev
> evdev pcspkr snd_pcm snd_timer drm snd soundcore i2c_algo_bit sg mei_me
> lpc_ich shpchp mfd_core mei ie31200_e
> dac wmi video button ib_iser rdma_cm iw_cm ib_cm ib_core configfs
> iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables
> May  2 07:15:39 node0 kernel: [1168145.685202]  x_tables autofs4 ext4
> crc16 mbcache jbd2 fscrypto ecb btrfs zstd_decompress zstd_compress
> xxhash raid456 async_raid6_recov async_mem
> cpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic
> raid0 multipath linear hid_generic usbhid hid dm_mod raid10 raid1 md_mod
> sd_mod crc32c_intel ahci i2c_i801 lib
> ahci aesni_intel xhci_pci aes_x86_64 ehci_pci libata crypto_simd
> xhci_hcd ehci_hcd cryptd glue_helper e1000e scsi_mod ptp usbcore
> pps_core usb_common fan thermal
> May  2 07:15:39 node0 kernel: [1168145.689057] CPU: 6 PID: 3763 Comm:
> kworker/6:2 Not tainted 4.15.0-0.bpo.2-amd64 #1 Debian 4.15.11-1~bpo9+1
> May  2 07:15:39 node0 kernel: [1168145.690347] Hardware name: LENOVO
> ThinkServer TS140/ThinkServer TS140, BIOS FBKTB3AUS 06/16/2015
> May  2 07:15:39 node0 kernel: [1168145.691659] Workqueue: dio/dm-0
> dio_aio_complete_work
> May  2 07:15:39 node0 kernel: [1168145.692935] RIP:
> 0010:dio_complete+0x1d6/0x220
> May  2 07:15:39 node0 kernel: [1168145.694275] RSP:
> 0018:9abc68447e50 EFLAGS: 

[PATCH v4 0/3] btrfs: balance: improve kernel logs

2018-05-23 Thread Anand Jain
v3->v4:
 Pls ref to individual patches.

Based on misc-next.

v2->v3:
  Inspried by describe_relocation(), improves it, makes it a helper
  function and use it to log the balance operations.

Kernel logs are very important for the forensic investigations of the
issues, these patchs make balance logs easy to review.

Anand Jain (3):
  btrfs: add helper function describe_block_group()
  btrfs: balance: add args info during start and resume
  btrfs: balance: add kernel log for end or paused

 fs/btrfs/relocation.c |  30 ++-
 fs/btrfs/volumes.c| 146 --
 fs/btrfs/volumes.h|   1 +
 3 files changed, 147 insertions(+), 30 deletions(-)

-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] btrfs: add helper function describe_block_group()

2018-05-23 Thread Anand Jain



On 05/22/2018 08:26 PM, David Sterba wrote:

On Mon, May 21, 2018 at 02:37:43PM +0800, Anand Jain wrote:

Improve on describe_relocation() add a common helper function to describe
the block groups.

Signed-off-by: Anand Jain 
---
v3: Born.

  fs/btrfs/relocation.c | 30 +++---
  fs/btrfs/volumes.c| 44 
  fs/btrfs/volumes.h|  1 +
  3 files changed, 48 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 879b76fa881a..b9b32b0e4ba4 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4320,37 +4320,13 @@ static struct reloc_control *alloc_reloc_control(void)
  static void describe_relocation(struct btrfs_fs_info *fs_info,
struct btrfs_block_group_cache *block_group)
  {
-   char buf[128];  /* prefixed by a '|' that'll be dropped */
-   u64 flags = block_group->flags;
+   char buf[128];
  
-	/* Shouldn't happen */

-   if (!flags) {
-   strcpy(buf, "|NONE");
-   } else {
-   char *bp = buf;
-
-#define DESCRIBE_FLAG(f, d) \
-   if (flags & BTRFS_BLOCK_GROUP_##f) { \
-   bp += snprintf(bp, buf - bp + sizeof(buf), "|%s", d); \
-   flags &= ~BTRFS_BLOCK_GROUP_##f; \
-   }
-   DESCRIBE_FLAG(DATA, "data");
-   DESCRIBE_FLAG(SYSTEM,   "system");
-   DESCRIBE_FLAG(METADATA, "metadata");
-   DESCRIBE_FLAG(RAID0,"raid0");
-   DESCRIBE_FLAG(RAID1,"raid1");
-   DESCRIBE_FLAG(DUP,  "dup");
-   DESCRIBE_FLAG(RAID10,   "raid10");
-   DESCRIBE_FLAG(RAID5,"raid5");
-   DESCRIBE_FLAG(RAID6,"raid6");
-   if (flags)
-   snprintf(bp, buf - bp + sizeof(buf), "|0x%llx", flags);
-#undef DESCRIBE_FLAG
-   }
+   describe_block_groups(block_group->flags, buf, sizeof(buf));
  
  	btrfs_info(fs_info,

   "relocating block group %llu flags %s",
-  block_group->key.objectid, buf + 1);
+  block_group->key.objectid, buf);
  }
  
  /*

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 43bd65395106..02b1d14f510d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -126,6 +126,50 @@ const char *get_raid_name(enum btrfs_raid_types type)
return btrfs_raid_array[type].raid_name;
  }
  
+u32 describe_block_groups(u64 bg_flags, char *describe, u32 sz)

+{
+   int i;
+   u32 ret;
+   char *dp = describe;
+   u64 flags = bg_flags;
+
+   if (!flags)
+   return snprintf(dp, sz, "%s", "NONE");
+
+#define DESCRIBE_FLAG(f, d) \
+   do { \
+   if (flags & BTRFS_BLOCK_GROUP_##f) { \
+   dp += snprintf(dp, describe - dp + sz, "%s|", d); \
+   flags &= ~BTRFS_BLOCK_GROUP_##f; \
+   } \
+   } while (0)
+   DESCRIBE_FLAG(DATA, "data");
+   DESCRIBE_FLAG(SYSTEM,   "system");
+   DESCRIBE_FLAG(METADATA, "metadata");
+#undef DESCRIBE_FLAG


Adding the convenience macro for 3 uses and then opencoding it 2 times
does not make much sense to me. Why don't you change that to use the 1st
argument as the exact flag to mask out and the 2nd argument as the
string to print?



+
+   if (flags & BTRFS_AVAIL_ALLOC_BIT_SINGLE) {
+   dp += snprintf(dp, describe - dp + sz,
+  "%s|", "single");
+   flags &= ~BTRFS_AVAIL_ALLOC_BIT_SINGLE;
+   }
+
+   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {


And this would work inside the loop too.


fixed it.



+   if (flags & btrfs_raid_array[i].bg_flag) {
+   dp += snprintf(dp, describe - dp + sz,
+  "%s|", btrfs_raid_array[i].raid_name);
+   flags &= ~btrfs_raid_array[i].bg_flag;
+   }
+   }
+
+   if (flags)
+   dp += snprintf(dp, describe - dp + sz, "0x%llx|", flags);
+
+   ret = dp - describe - 1;
+   describe[ret] = '\0'; /* remove last | */
+   return ret;
+}
+
  static int init_first_rw_device(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info);
  static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 5139ec8daf4c..52ac258daa17 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -433,6 +433,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *path);
  int btrfs_balance(struct btrfs_fs_info *fs_info,
  struct btrfs_balance_control *bctl,
  struct btrfs_ioctl_balance_args *bargs);
+u32 describe_block_groups(u64 flags, char *describe, u32 sz);


Please prefix an exported function with btrfs_ and don't use the 

Re: [PATCH v2] btrfs: handle failures of set_extent_bits in add_excluded_extent

2018-05-23 Thread Nikolay Borisov


On 23.05.2018 06:37, Gu Jinxiang wrote:
> set_extent_bits may fail, return the result in add_excluded_extent.
> 
> Signed-off-by: Gu Jinxiang 

Reviewed-by: Nikolay Borisov 

> Changelog:
> v2-v1:
>  1.remove goto to make the function run linearly.
>  2.change commit description not pointing out the failure detail,
>since set_extent_bits's failure type may be added.

The changelog should really go under the scissors line but no need to
resend.

> ---
>  fs/btrfs/extent-tree.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 75cfb80d2551..65ef3456fa62 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -215,11 +215,15 @@ static int add_excluded_extent(struct btrfs_fs_info 
> *fs_info,
>  u64 start, u64 num_bytes)
>  {
>   u64 end = start + num_bytes - 1;
> - set_extent_bits(_info->freed_extents[0],
> + int ret = 0;
> +
> + ret = set_extent_bits(_info->freed_extents[0],
>   start, end, EXTENT_UPTODATE);
> - set_extent_bits(_info->freed_extents[1],
> + if (ret)
> + return ret;
> + ret = set_extent_bits(_info->freed_extents[1],
>   start, end, EXTENT_UPTODATE);
> - return 0;
> + return ret;
>  }
>  
>  static void free_excluded_extents(struct btrfs_fs_info *fs_info,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] btrfs: balance: add args info during start and resume

2018-05-23 Thread Anand Jain



On 05/22/2018 08:35 PM, David Sterba wrote:

On Mon, May 21, 2018 at 02:37:44PM +0800, Anand Jain wrote:

Balance arg info is an important information to be reviewed for the
system audit. So this patch adds them to the kernel log.

Example:

-> btrfs bal start -dprofiles='raid1|single',convert=raid5 
-mprofiles='raid1|single',convert=raid5 /btrfs
  kernel: BTRFS info (device sdb): balance: start 

Re: [PATCH 2/2] Btrfs: fix dedupe vs chattr NODATASUM race

2018-05-23 Thread Nikolay Borisov


On 23.05.2018 01:02, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> In btrfs_extent_same(), we must check the NODATASUM flag while the
> inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
> will change the flags after we check. This was correct until a recent
> change.
> 
> Fixes: 0a38effca37c ("Btrfs: split btrfs_extent_same")
> Signed-off-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/ioctl.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 784e267aad32..c2263bf4d6f5 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3586,12 +3586,6 @@ static int btrfs_extent_same(struct inode *src, u64 
> loff, u64 olen,
>   if (olen == 0)
>   return 0;
>  
> - /* don't make the dst file partly checksummed */
> - if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
> - (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) {
> - return -EINVAL;
> - }
> -
>   tail_len = olen % BTRFS_MAX_DEDUPE_LEN;
>   chunk_count = div_u64(olen, BTRFS_MAX_DEDUPE_LEN);
>   if (chunk_count == 0)
> @@ -3624,6 +3618,13 @@ static int btrfs_extent_same(struct inode *src, u64 
> loff, u64 olen,
>   else
>   btrfs_double_inode_lock(src, dst);
>  
> + /* don't make the dst file partly checksummed */
> + if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
> + (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
>   for (i = 0; i < chunk_count; i++) {
>   ret = btrfs_extent_same_range(src, loff, BTRFS_MAX_DEDUPE_LEN,
> dst, dst_loff, );
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: fix clone vs chattr NODATASUM race

2018-05-23 Thread Nikolay Borisov


On 23.05.2018 01:02, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> In btrfs_clone_files(), we must check the NODATASUM flag while the
> inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
> will change the flags after we check and we can end up with a party
> checksummed file.
> 
> Fixes: 0e7b824c4ef9 ("Btrfs: don't make a file partly checksummed through 
> file clone")
> Signed-off-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/ioctl.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index cf0d3bc6f625..784e267aad32 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4280,11 +4280,6 @@ static noinline int btrfs_clone_files(struct file 
> *file, struct file *file_src,
>   src->i_sb != inode->i_sb)
>   return -EXDEV;
>  
> - /* don't make the dst file partly checksummed */
> - if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
> - (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
> - return -EINVAL;
> -
>   if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
>   return -EISDIR;
>  
> @@ -4294,6 +4289,13 @@ static noinline int btrfs_clone_files(struct file 
> *file, struct file *file_src,
>   inode_lock(src);
>   }
>  
> + /* don't make the dst file partly checksummed */
> + if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
> + (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
>   /* determine range to clone */
>   ret = -EINVAL;
>   if (off + len > src->i_size || off + len < off)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html