Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-06 Thread Eryu Guan
On Thu, Apr 06, 2017 at 11:28:01AM -0500, Eric Sandeen wrote:
> On 4/6/17 11:26 AM, Theodore Ts'o wrote:
> > On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote:
> >>
> >> Test fails with ext3/2 when driving with ext4 driver, fiemap changed
> >> after umount/mount cycle, then changed back to original result after
> >> sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.)
> > 
> > I haven't had time to look at this, but I'm not sure this test is a
> > reasonable one on the face of it.
> > 
> > A file system may choose to optimize a file's extent tree for whatever
> > reason it wants, whenever it wants, including on an unmount --- and
> > that would not be an invalid thing to do.  So to have an xfstests that
> > causes a test failure if a file system were to, say, do some cleanup
> > at mount or unmount time, or when the file is next opened, to merge
> > adjacent extents together (and hence change what is returned by
> > FIEMAP) might be strange, or even weird --- but is this any of user
> > space's business?  Or anything we want to enforce as wrong wrong wrong
> > by xfstests?

So I was asking for a review from ext4 side instead of queuing it for
next xfstests update :)

> 
> I had the same question.  If the exact behavior isn't defined anywhere,
> I don't know what we can be testing, TBH.

Agreed, I was about to ask for the expected behavior today if there was
no new review comments on this patch.

Thanks for the comments and review!

Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is btrfs-convert able to deal with sparse files in a ext4 filesystem?

2017-04-06 Thread Duncan
Andrei Borzenkov posted on Sun, 02 Apr 2017 09:30:46 +0300 as excerpted:

> 02.04.2017 03:59, Duncan пишет:
>> 
>> 4) In fact, since an in-place convert is almost certainly going to take
>> more time than a blow-away and restore from backup,
> 
> This caught my eyes. Why? In-place convert just needs to recreate
> metadata. If you have multi-terabyte worth of data copying them twice
> hardly can be faster.

Why twice?  If you care about the data by definition you already have 
backups, so it's only copying back from those backups to the newly 
created filesystem, right?

And if you don't have backups, then by definition, you don't care about 
the data, at least not enough to be worth the hassle of a backup and thus 
arguably not enough to be worth the hassle of a convert, so just blow it 
away with a new mkfs and start from scratch since you self-evidently 
didn't care enough about the data for it to be worth a backup anyway, no 
problem.

And actually, it's not even a single extra copy that you won't have to do 
anyway, if you schedule your new filesystem creation as part of your 
normal backup regime in place of what would otherwise be a full backup 
that you now don't have to make, so copying the data from the old 
filesystem to the new one is simply replacing the full backup that you'd 
otherwise be doing at the same point in time.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-06 Thread John Petrini
Interesting. That's the first time I'm hearing this. If that's the
case I feel like it's a stretch to call it RAID10 at all. It sounds a
lot more like basic replication similar to Ceph only Ceph understands
failure domains and therefore can be configured to handle device
failure (albeit at a higher level)

I do of course keep backups but I chose RAID10 for the mix of
performance and reliability. It doesn't seems worth it losing 50% of
my usable space for the performance gain alone.

Thank you for letting me know about this. Knowing that I think I may
have to reconsider my choice here. I've really been enjoying the
flexibility of BTRS which is why I switched to it in the first place
but with experimental RAID5/6 and what you've just told me I'm
beginning to doubt that it's the right choice.

What's more concerning is that I haven't found a good way to monitor
BTRFS. I might be able to accept that the array can only handle a
single drive failure if I was confident that I could detect it but so
far I haven't found a good solution for this.
___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   coredial.com
//
Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: jpetr...@coredial.com


Interested in sponsoring PartnerConnex 2017? Learn more.

The information transmitted is intended only for the person or entity
to which it is addressed and may contain confidential and/or
privileged material. Any review, retransmission,  dissemination or
other use of, or taking of any action in reliance upon, this
information by persons or entities other than the intended recipient
is prohibited. If you received this in error, please contact the
sender and delete the material from any computer.


On Thu, Apr 6, 2017 at 10:42 PM, Chris Murphy  wrote:
> On Thu, Apr 6, 2017 at 7:31 PM, John Petrini  wrote:
>> Hi Chris,
>>
>> I've followed your advice and converted the system chunk to raid10. I
>> hadn't noticed it was raid0 and it's scary to think that I've been
>> running this array for three months like that. Thank you for saving me
>> a lot of pain down the road!
>
> For what it's worth, it is imperative to keep frequent backups with
> Btrfs raid10, it is in some ways more like raid0+1. It can only
> tolerate the loss of a single device. It will continue to function
> with 2+ devices in a very deceptive degraded state, until it
> inevitably hits dual missing chunks of metadata or data, and then it
> will faceplant. And then you'll be looking at a scrape operation.
>
> It's not like raid10 where you can lose one of each mirrored pair.
> Btrfs raid10 mirrors chunks, not drives. So your metadata and data are
> all distributed across all of the drives, and that in effect means you
> can only lose 1 drive. If you lose a 2nd drive, some amount of
> metadata and data will have been lost.
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mix ssd and hdd in single volume

2017-04-06 Thread Duncan
Roman Mamedov posted on Mon, 03 Apr 2017 13:41:07 +0500 as excerpted:

> On Mon, 3 Apr 2017 11:30:44 +0300 Marat Khalili  wrote:
> 
>> You may want to look here: https://www.synology.com/en-global/dsm/Btrfs
>> . Somebody forgot to tell Synology, which already supports btrfs in all
>> hardware-capable devices. I think Rubicon has been crossed in
>> 'mass-market NAS[es]', for good or not.
> 
> AFAIR Synology did not come to this list asking for (any kind of) advice
> prior to implementing that (else they would have gotten the same kind of
> post from Duncan and others)[.]  I don't remember seeing them actively
> contribute improvements or fixes especially for the RAID5 or RAID6
> features (which they ADVERTISE on that page as a fully working part of
> their product).

> That doesn't seem honest to end users or playing nicely with the
> upstream developers. What the upstream gets instead is just those
> end-users coming here one by one some years later, asking how to fix
> a broken Btrfs RAID5 on an embedded box running some 3.10 or 3.14
> kernel.

And of course then the user gets the real state of btrfs and of btrfs 
raid56 mode, particularly back that far, explained to them.  Along with 
that we'll explain that any data on it is in all likelihood lost data, 
with little to no chance at recovery.

And we'll point out that if there was serious value in the data, they 
would have investigated the state of the filesystem before they put that 
data on it, and of course, as I've already said, they'd have had backups 
for anything that was of more value than the time/resources/hassle of 
doing those backups.

And if they're lucky, that NAS will have /been/ the backup, and they'll 
still have the actual working copy at least, and can make another backup 
ASAP just in case that working copy dies too.

But if they're unlucky...

Of course the user will then blame the manufacturer, but by that time the 
warranty will be up, and even if not, while they /might/ get their money 
back, that won't get their data back.

And the manufacturer will get a bad name, but by then having taken the 
money and run they'll be on to something else or perhaps be bought out by 
someone bigger or be out of business.

And all the user will be able to do is chalk it up to experience, and 
mourn the loss of their kids' baby pictures/videos or their wedding 
videos, or whatever.  If they're /really/ lucky, they'll have put them on 
facebook or youtube or whatever, and can retrieve at least those, from 
there.

Meanwhile, the user, having been once burned, may never use the by then 
much improved btrfs, or even worse, never trust anything Linux, again.

Oh, well.  The best we can do here is warn those that /do/ value their 
data enough to do their research first, so they /do/ have those backups 
or at least use something a bit more mature than btrfs raid56 mode.  Of 
course and continue to work on full btrfs stabilization.  And I like to 
think we're reasonably good at those warnings, anyway.  The 
stabilization, too, but that takes time and patience, plus coder skill, 
the last of which which I personally don't have, so I just pitch in where 
I can, answering questions, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-06 Thread Qu Wenruo
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
 # mount /dev/vdb5 /mnt/btrfs
 # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:25088..25215   128   0x1
 # umount /mnt/btrfs
 # mount /dev/vdb5 /mnt/btrfs
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..31]: 25088..2511932   0x0
   1: [32..63]:25120..2515132   0x0
   2: [64..95]:25152..2518332   0x0
   3: [96..127]:   25184..2521532   0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:25088..25215   128   0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   | |- btrfs_get_extent(start=0 len=64k)
   ||  Found on-disk (ino, EXTENT_DATA, 0)
   ||- add_extent_mapping()
   ||- Return (em->start=0, len=16k)
   |
   |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
   |
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   | |- btrfs_get_extent(start=16k len=48k)
   ||  Found on-disk (ino, EXTENT_DATA, 16k)
   ||- add_extent_mapping()
   ||  |- try_merge_map()
   || Merge with previous em start=0 len=16k
   || resulting em start=0 len=32k
   ||- Return (em->start=0, len=32K)<< Merged result
   |- Stripe off the unrelated range (0~16K) of return em
   |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
  ^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.

It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.

Signed-off-by: Qu Wenruo 
---
v2:
  Since fiemap_extent_info has a limit for number of fiemap_extent, it's 
possible
  that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which 
can
  cause kernel warning if we fiemap is called on large compressed file.
v3:
  Rename finish_fiemap_extent() to check_fiemap_extent(), as in v3 we ensured
  submit_fiemap_extent() to submit fiemap cache, so it just acts as a
  sanity check.
  Remove BTRFS_MAX_EXTENT_SIZE limit in submit_fiemap_extent(), as
  extent map can be larger than BTRFS_MAX_EXTENT_SIZE.
  Don't do backward jump, suggested by David.
  Better sanity check and recoverable fix.

To David:
  What about adding a btrfs_debug_warn(), which will only call WARN_ON(1) if
  BTRFS_CONFIG_DEBUG is specified for recoverable bug?

  And modify ASSERT() to always WARN_ON() and exit error code?
---
 fs/btrfs/extent_io.c | 124 ++-
 1 file changed, 122 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e8192..c4cb65d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4353,6 +4353,123 @@ static struct extent_map *get_extent_skip_holes(struct 
inode *inode,
return NULL;
 }
 
+/*
+ * To cache previous fiemap extent
+ *
+ * Will be used for merging fiemap extent
+ */
+struct fiemap_cache {
+   u64 offset;
+   u64 phys;
+   u64 len;
+   u32 flags;
+   bool cached;
+};
+
+/*
+ * Helper to submit fiemap extent.
+ *
+ * Will try to merge current fiemap extent specified by @offset, @phys,
+ * @len and @flags with cached one.
+ * And only when we fails to merge, cached one will be submitted as
+ * fiemap extent.
+ *
+ * Return value is the same as fiemap_fill_next_extent().
+ */
+static int submit_fiemap_extent(struct fiemap_extent_info *fieinfo,
+   struct fiemap_cache *cache,
+   u64 offset, u64 phys, u64 len, u32 flags)
+{
+   int ret = 0;
+
+   if (!cache->cached)
+   goto assign;
+
+   /*
+* Sanity check, extent_fiemap() should have ensured that new
+* fiemap extent won't overlap with cahced one.
+* Not recoverable.
+*
+* NOTE: Physical address can overlap, due to compression
+   

Re: Volume appears full but TB's of space available

2017-04-06 Thread Chris Murphy
On Thu, Apr 6, 2017 at 7:31 PM, John Petrini  wrote:
> Hi Chris,
>
> I've followed your advice and converted the system chunk to raid10. I
> hadn't noticed it was raid0 and it's scary to think that I've been
> running this array for three months like that. Thank you for saving me
> a lot of pain down the road!

For what it's worth, it is imperative to keep frequent backups with
Btrfs raid10, it is in some ways more like raid0+1. It can only
tolerate the loss of a single device. It will continue to function
with 2+ devices in a very deceptive degraded state, until it
inevitably hits dual missing chunks of metadata or data, and then it
will faceplant. And then you'll be looking at a scrape operation.

It's not like raid10 where you can lose one of each mirrored pair.
Btrfs raid10 mirrors chunks, not drives. So your metadata and data are
all distributed across all of the drives, and that in effect means you
can only lose 1 drive. If you lose a 2nd drive, some amount of
metadata and data will have been lost.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-06 Thread John Petrini
Hi Chris,

I've followed your advice and converted the system chunk to raid10. I
hadn't noticed it was raid0 and it's scary to think that I've been
running this array for three months like that. Thank you for saving me
a lot of pain down the road!

Also thank you for the clarification on the output - this is making a
lot more sense.

Regards,

John Petrini
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-06 Thread Chris Murphy
On Thu, Apr 6, 2017 at 7:15 PM, John Petrini  wrote:
> Okay so I came across this bug report:
> https://bugzilla.redhat.com/show_bug.cgi?id=1243986
>
> It looks like I'm just misinterpreting the output of btrfs fi df. What
> should I be looking at to determine the actual free space? Is Free
> (estimated):   13.83TiB (min: 13.83TiB) the proper metric?
>
> Simply running df does not seem to report the usage properly
>
> /dev/sdj  25T   11T  5.9T  65% /mnt/storage-array


Free should be correct. And df -h should be IEC units, so I'd expect
it to be closer to the value of btrfs fi us than this. But the code
has changed over time, I'm not sure when the last adjustment was made.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-06 Thread Chris Murphy
On Thu, Apr 6, 2017 at 6:47 PM, John Petrini  wrote:

> sudo btrfs fi df /mnt/storage-array/
> Data, RAID10: total=10.72TiB, used=10.72TiB
> System, RAID0: total=128.00MiB, used=944.00KiB
> Metadata, RAID10: total=14.00GiB, used=12.63GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

The third line is kinda scary. System chunk is raid0, so ostensibly a
single device failure means the entire array is lost.

The fastest way to fix it is:

btrfs balance start -mconvert=raid10,soft 

That will make the system chunk raid10.


>
> sudo btrfs fi usage /mnt/storage-array/
> Overall:
> Device size:   49.12TiB
> Device allocated:   21.47TiB
> Device unallocated:   27.65TiB
> Device missing:  0.00B
> Used:   21.45TiB
> Free (estimated):   13.83TiB (min: 13.83TiB)
> Data ratio:   2.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB (used: 0.00B)
>
> Data,RAID10: Size:10.72TiB, Used:10.71TiB

This is saying you have 10.72T of data. But because it's raid10, it
will take up 2x that much space. This is what's reflected by the
Overall: Used: value of 21.45T, plus some extra for metadata which is
also 2x.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume appears full but TB's of space available

2017-04-06 Thread John Petrini
Okay so I came across this bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1243986

It looks like I'm just misinterpreting the output of btrfs fi df. What
should I be looking at to determine the actual free space? Is Free
(estimated):   13.83TiB (min: 13.83TiB) the proper metric?

Simply running df does not seem to report the usage properly

/dev/sdj  25T   11T  5.9T  65% /mnt/storage-array

Thank you,

John Petrini
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: add missing memset while reading compressed inline extents

2017-04-06 Thread Qu Wenruo



At 04/07/2017 12:07 AM, Filipe Manana wrote:

On Wed, Mar 22, 2017 at 2:37 AM, Qu Wenruo  wrote:



At 03/09/2017 10:05 AM, Zygo Blaxell wrote:


On Wed, Mar 08, 2017 at 10:27:33AM +, Filipe Manana wrote:


On Wed, Mar 8, 2017 at 3:18 AM, Zygo Blaxell
 wrote:


From: Zygo Blaxell 

This is a story about 4 distinct (and very old) btrfs bugs.

Commit c8b978188c ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).

Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
fixed bug #1:  uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read.  The fix
was to add a memset in btrfs_get_extent to zero out the hole.

Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
fixed bug #2:  compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases.  This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.

There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record.  This memset doesn't cover the region between the end of the
decompressed data and the end of the page.  It has also moved around a
few times over the years, so there's no single patch to refer to.

This patch fixes bug #3:  compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same:  zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.

The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless
quirk
of behavior somewhere in the btrfs write code.  In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.

A fast reproducer for bug #0 is presented below.  A few offending
extents
are also created in the wild during large rsync transfers with the -S
flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well.  Once an offending
file is created, it can present different content to userspace each
time it is read.

Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
kernel back to v3.5 has this behavior.  There are fossil records of this
bug's effects in commits all the way back to v2.6.32.  I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.

It is not clear whether bug #0 is worth fixing.  A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.

Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak.  So enough about
bug #0, let's get back to bug #3 (this patch).

An example of on-disk structure leading to data corruption:

item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
inode generation 50 transid 50 size 47424 nbytes 49141
block group 0 mode 100644 links 1 uid 0 gid 0
rdev 0 flags 0x0(none)
item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
inode ref index 3 namelen 10 name: DB_File.so
item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
inline extent data size 1341 ram 4085 compress(zlib)
item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
extent data disk byte 5367308288 nr 20480
extent data offset 0 nr 45056 ram 45056
extent compression(zlib)



So this case is actually different from the reproducer below, because
once a file has prealloc extents, future writes will never be
compressed. That is, the extent at offset 4096 can not have compressed
content using steps like the reproducer below. I can only imagine that
after an falloc to create the extent at 4K, we cloned some compressed
extent from some other file into that offset.



The above comes from one of my captures of the bug appearing in the wild,
not the reproducer from Liu Bo.  I'll make that clearer.



In fact, I found that btrfs/137 with compress=lzo seems to trigger such
"inline-then-regular" case in a very high possibility.

If abort the test just after populating the fs, I found the possibility to
have inline-then-regular extent layout is over 70%.


100%, as explained in your fstests patch thread.



Although in btrfs/137 case, it's "inline-then-hole"

Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-06 Thread Qu Wenruo



At 04/07/2017 12:28 AM, Eric Sandeen wrote:

On 4/6/17 11:26 AM, Theodore Ts'o wrote:

On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote:


Test fails with ext3/2 when driving with ext4 driver, fiemap changed
after umount/mount cycle, then changed back to original result after
sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.)


I haven't had time to look at this, but I'm not sure this test is a
reasonable one on the face of it.

A file system may choose to optimize a file's extent tree for whatever
reason it wants, whenever it wants, including on an unmount --- and
that would not be an invalid thing to do.  So to have an xfstests that
causes a test failure if a file system were to, say, do some cleanup
at mount or unmount time, or when the file is next opened, to merge
adjacent extents together (and hence change what is returned by
FIEMAP) might be strange, or even weird --- but is this any of user
space's business?  Or anything we want to enforce as wrong wrong wrong
by xfstests?


I had the same question.  If the exact behavior isn't defined anywhere,
I don't know what we can be testing, TBH.


For unmount cleanup, it's acceptable.

But if we're getting different result even we didn't modify the fs 
during the same mount duration, fiemap still changes, then it's at least 
anti-instinct.


And unfortunately, btrfs and ext3 shares the same problem here:

=== Fiemap after cycle mount ===
/mnt/scratch/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..95]: 139264..139359  96 0x1000
   1: [96..127]:   139360..139391  32 0x1000
   2: [128..511]:  138112..138495 384 0x1000
   3: [512..1023]: 138752..139263 512 0x1000
   4: [1024..2047]:143360..1443831024 0x1000
   5: [2048..4095]:145408..1474552048 0x1001
=== Fiemap after cycle mount and sleep ===
/mnt/scratch/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:139264..139391 128 0x1000
   1: [128..511]:  138112..138495 384 0x1000
   2: [512..1023]: 138752..139263 512 0x1000
   3: [1024..2047]:143360..1443831024 0x1000
   4: [2048..4095]:145408..1474552048 0x1001

I was using the same excuse just as you guys, until I found btrfs is 
merging extent maps at read time, which causes the problem.

That's why the 2nd fiemap in btrfs returns merged result.

We fix btrfs by caching fiemap extent result before calling 
fiemap_fill_next_extent(), and try to merge with cached result.

Which fixes the problem quite easy.

Thanks,
Qu


-Eric


   - Ted






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Volume appears full but TB's of space available

2017-04-06 Thread John Petrini
Hello List,

I have a volume that appears to be full despite having multiple
Terabytes of free space available. Just yesterday I ran a re-balance
but it didn't change anything. I've just added two more disks to the
array and am currently in the process of another re-balance but the
available space has not increased.

Currently I can still write to the volume (I haven't tried any large
writes) so I'm not sure if this is just a reporting issue or if writes
will eventually fail.

Any help is appreciated. Here's the details:

uname -a
Linux yuengling.johnpetrini.com 4.4.0-66-generic #87-Ubuntu SMP Fri
Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

btrfs --version
btrfs-progs v4.4

sudo btrfs fi df /mnt/storage-array/
Data, RAID10: total=10.72TiB, used=10.72TiB
System, RAID0: total=128.00MiB, used=944.00KiB
Metadata, RAID10: total=14.00GiB, used=12.63GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

sudo btrfs fi show /mnt/storage-array/
Label: none  uuid: e113ab87-7869-4ec7-9508-95691f455018
Total devices 10 FS bytes used 10.73TiB
devid1 size 4.55TiB used 2.65TiB path /dev/sdj
devid2 size 4.55TiB used 2.65TiB path /dev/sdk
devid3 size 3.64TiB used 2.65TiB path /dev/sdd
devid4 size 3.64TiB used 2.65TiB path /dev/sdf
devid5 size 3.64TiB used 2.65TiB path /dev/sdg
devid6 size 3.64TiB used 2.65TiB path /dev/sde
devid7 size 3.64TiB used 2.65TiB path /dev/sdb
devid8 size 3.64TiB used 2.65TiB path /dev/sdc
devid9 size 9.10TiB used 149.00GiB path /dev/sdh
devid   10 size 9.10TiB used 149.00GiB path /dev/sdi

sudo btrfs fi usage /mnt/storage-array/
Overall:
Device size:   49.12TiB
Device allocated:   21.47TiB
Device unallocated:   27.65TiB
Device missing:  0.00B
Used:   21.45TiB
Free (estimated):   13.83TiB (min: 13.83TiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB (used: 0.00B)

Data,RAID10: Size:10.72TiB, Used:10.71TiB
   /dev/sdb1.32TiB
   /dev/sdc1.32TiB
   /dev/sdd1.32TiB
   /dev/sde1.32TiB
   /dev/sdf1.32TiB
   /dev/sdg1.32TiB
   /dev/sdh   72.00GiB
   /dev/sdi   72.00GiB
   /dev/sdj1.32TiB
   /dev/sdk1.32TiB

Metadata,RAID10: Size:14.00GiB, Used:12.63GiB
   /dev/sdb1.75GiB
   /dev/sdc1.75GiB
   /dev/sdd1.75GiB
   /dev/sde1.75GiB
   /dev/sdf1.75GiB
   /dev/sdg1.75GiB
   /dev/sdj1.75GiB
   /dev/sdk1.75GiB

System,RAID0: Size:128.00MiB, Used:944.00KiB
   /dev/sdb   16.00MiB
   /dev/sdc   16.00MiB
   /dev/sdd   16.00MiB
   /dev/sde   16.00MiB
   /dev/sdf   16.00MiB
   /dev/sdg   16.00MiB
   /dev/sdj   16.00MiB
   /dev/sdk   16.00MiB

Unallocated:
   /dev/sdb2.31TiB
   /dev/sdc2.31TiB
   /dev/sdd2.31TiB
   /dev/sde2.31TiB
   /dev/sdf2.31TiB
   /dev/sdg2.31TiB
   /dev/sdh9.03TiB
   /dev/sdi9.03TiB
   /dev/sdj3.22TiB
   /dev/sdk3.22TiB

Thank You,

John Petrini
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-06 Thread Qu Wenruo



At 04/07/2017 12:02 AM, Filipe Manana wrote:

On Thu, Apr 6, 2017 at 2:28 AM, Qu Wenruo  wrote:

Btrfs allows inline file extent if and only if
1) It's at offset 0
2) It's smaller than min(max_inline, page_size)
   Although we don't specify if the size is before compression or after
   compression.
   At least according to current behavior, we are only limiting the size
   after compression.
3) It's the only file extent
   So that if we append existing inline extent, it should be converted
   to regular file extents.

However several users in btrfs mail list have reported invalid inline
file extent, which only meets the first two condition, but with regular
file extents following.

The bug is here for a long long time, so long that we have modified
kernel and btrfs-progs to accept such case, but it's still not designed
behavior, and must be detected and fixed.


So after looking at this better, I'm not convinced it's a problem.
Other than making btrfs/137 fail when compression is enabled, what
problems have you observed?


Inline-then-regular file extent layout itself is a problem.

Just check the behavior of plain btrfs.

Inline extent must be the *only* extent, hole/regular/prealloc extent 
shouldn't co-exist with inline extent.


Such append should convert the inline extent to regular.

Furthermore, this behavior also exposes the confusion of max_inline.
If we are limiting inline extent size by its compressed size other than 
plain size, then we're hiding a new limit, page size.


Thanks,
Qu



btrfs/137 fails due to the incremental send code not being prepared
for this case, which does not seem harmful to me because the inline
extent represents 4096 bytes of uncompressed data. It would be a
problem only if the uncompressed data was less than 4096 bytes.

So unless there's evidence that this particular case causes problems
somewhere, I don't think it's useful to have this test.

As for btrfs/137, I'm sending a fix for the incremental send code.

More comments below anyway.



Signed-off-by: Qu Wenruo 
---
v2:
  All suggested by Eryu Guan
  Use loop_counts instead of runtime to avoid naming confusion.
  Fix whitespace issues
  Use fsync instead of sync
  Use $XFS_IO_PROG instead of calling xfs_io directly
  Always output corruption possibility into seqres.full

v3:
  All suggested by Filipe Manana
  Fix gramma errors
  Use simpler reproducer
  Add test case to compress group
---
 tests/btrfs/140 | 111 
 tests/btrfs/140.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 114 insertions(+)
 create mode 100755 tests/btrfs/140
 create mode 100644 tests/btrfs/140.out

diff --git a/tests/btrfs/140 b/tests/btrfs/140
new file mode 100755
index 000..183d9cd
--- /dev/null
+++ b/tests/btrfs/140
@@ -0,0 +1,111 @@
+#! /bin/bash
+# FS QA Test 140
+#
+# Check if btrfs will create inline-then-regular file layout.
+#
+# Btrfs only allows inline file extent if file is small enough, and any
+# incoming write enlarges the file beyond max_inline should replace inline
+# extents with regular extents.
+# This is a long standing bug, so fsck won't detect it and kernel will allow
+# normal read on it, but should be avoided.
+#
+#---
+# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_btrfs_command inspect-internal dump-tree
+
+inline_size=2048
+page_size=$(get_page_size)
+loop_counts=$(($LOAD_FACTOR * 4)) #The possibility is over 80%, just in case


Sorry I forgot about this before.
But why is the possibility not 100%? The code that skips the inline
extent expansion/conversion is 100% deterministic
(btrfs_file_write_iter() -> btrfs_cont_expand() ->
btrfs_truncate_block()) si

Re: Do different btrfs volumes compete for CPU?

2017-04-06 Thread Martin
On 05/04/17 08:04, Marat Khalili wrote:
> On 04/04/17 20:36, Peter Grandi wrote:
>> SATA works for external use, eSATA works well, but what really
>> matters is the chipset of the adapter card.
> eSATA might be sound electrically, but mechanically it is awful. Try to
> run it for months in a crowded server room, and inevitably you'll get
> disconnections and data corruption. Tried different cables, brackets --
> same result. If you ever used eSATA connector, you'd feel it.

Been using eSATA here for multiple disk packs continuously connected for
a few years now for 48TB of data (not enough room in the host for the
disks).

Never suffered am eSATA disconnect.

Had the usual cooling fan fails and HDD fails due to old age.


All just a case of ensuring undisturbed clean cabling and a good UPS?...

(BTRFS spanning four disks per external pack has worked well also.)

Good luck,
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] nowait aio: xfs

2017-04-06 Thread Darrick J. Wong
On Mon, Apr 03, 2017 at 11:52:11PM -0700, Christoph Hellwig wrote:
> > +   if (unaligned_io) {
> > +   /* If we are going to wait for other DIO to finish, bail */
> > +   if ((iocb->ki_flags & IOCB_NOWAIT) &&
> > +atomic_read(&inode->i_dio_count))
> > +   return -EAGAIN;
> > inode_dio_wait(inode);
> 
> This checks i_dio_count twice in the nowait case, I think it should be:
> 
>   if (iocb->ki_flags & IOCB_NOWAIT) {
>   if (atomic_read(&inode->i_dio_count))
>   return -EAGAIN;
>   } else {
>   inode_dio_wait(inode);
>   }
> 
> > if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> > if (flags & IOMAP_DIRECT) {
> > +   /* A reflinked inode will result in CoW alloc */
> > +   if (flags & IOMAP_NOWAIT) {
> > +   error = -EAGAIN;
> > +   goto out_unlock;
> > +   }
> 
> This is a bit pessimistic - just because the inode has any shared
> extents we could still write into unshared ones.  For now I think this
> pessimistic check is fine, but the comment should be corrected.

Consider what happens in both _reflink_{allocate,reserve}_cow.  If there
is already an existing reservation in the CoW fork then we'll have to
CoW and therefore can't satisfy the NOWAIT flag.  If there isn't already
anything in the CoW fork, then we have to see if there are shared blocks
by calling _reflink_trim_around_shared.  That performs a refcountbt
lookup, which involves locking the AGF, so we also can't satisfy NOWAIT.

IOWs, I think this hunk has to move outside the IOMAP_DIRECT check to
cover both write-to-reflinked-file cases.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix invalid dereference in btrfs_retry_endio

2017-04-06 Thread Liu Bo
On Thu, Apr 06, 2017 at 04:21:50PM +0200, David Sterba wrote:
> On Wed, Apr 05, 2017 at 02:04:19PM -0700, Liu Bo wrote:
> > When doing directIO repair, we have this oops
> > 
> > [ 1458.532816] general protection fault:  [#1] SMP
> > ...
> > [ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper 
> > [btrfs]
> > [ 1458.536893] task: 88082a42d100 task.stack: c90002b3c000
> > [ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
> > ...
> > [ 1458.543261] Call Trace:
> > [ 1458.543958]  ? rcu_read_lock_sched_held+0xc4/0xd0
> > [ 1458.544374]  bio_endio+0xed/0x100
> > [ 1458.544750]  end_workqueue_fn+0x3c/0x40 [btrfs]
> > [ 1458.545257]  normal_work_helper+0x9f/0x900 [btrfs]
> > [ 1458.545762]  btrfs_endio_repair_helper+0x12/0x20 [btrfs]
> > [ 1458.546224]  process_one_work+0x34d/0xb70
> > [ 1458.546570]  ? process_one_work+0x29e/0xb70
> > [ 1458.546938]  worker_thread+0x1cf/0x960
> > [ 1458.547263]  ? process_one_work+0xb70/0xb70
> > [ 1458.547624]  kthread+0x17d/0x180
> > [ 1458.547909]  ? kthread_create_on_node+0x70/0x70
> > [ 1458.548300]  ret_from_fork+0x31/0x40
> > 
> > It turns out that btrfs_retry_endio is trying to get inode from a directIO
> > page.
> > 
> > This fixes the problem by using the preservd inode pointer, done->inode.
> > btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.
> > 
> > Also cleanup unused @start(which is too trivial for a separate patch).
> > 
> > Cc: David Sterba 
> > Signed-off-by: Liu Bo 
> 
> Reviewed-by: David Sterba 
> 
> > ---
> >  fs/btrfs/inode.c | 14 --
> >  1 file changed, 4 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 3ec5a05..8e71ed7 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -7910,7 +7910,6 @@ struct btrfs_retry_complete {
> >  static void btrfs_retry_endio_nocsum(struct bio *bio)
> >  {
> > struct btrfs_retry_complete *done = bio->bi_private;
> > -   struct inode *inode;
> > struct bio_vec *bvec;
> > int i;
> >  
> > @@ -7918,12 +7917,12 @@ static void btrfs_retry_endio_nocsum(struct bio 
> > *bio)
> > goto end;
> >  
> > ASSERT(bio->bi_vcnt == 1);
> > -   inode = bio->bi_io_vec->bv_page->mapping->host;
> > -   ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(inode));
> > +   ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(done->inode));
> 
> I was interested how it got here in the first place, inode from mapping
> hasn't been used in the initial support for DIO repair. So it was added
> in the subpage blocksize preparation patchset (2dabb3248453b), slipped
> through the review.

Yes, it's a regression, and I've found another regression in this commit.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-06 Thread Eric Sandeen
On 4/6/17 11:26 AM, Theodore Ts'o wrote:
> On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote:
>>
>> Test fails with ext3/2 when driving with ext4 driver, fiemap changed
>> after umount/mount cycle, then changed back to original result after
>> sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.)
> 
> I haven't had time to look at this, but I'm not sure this test is a
> reasonable one on the face of it.
> 
> A file system may choose to optimize a file's extent tree for whatever
> reason it wants, whenever it wants, including on an unmount --- and
> that would not be an invalid thing to do.  So to have an xfstests that
> causes a test failure if a file system were to, say, do some cleanup
> at mount or unmount time, or when the file is next opened, to merge
> adjacent extents together (and hence change what is returned by
> FIEMAP) might be strange, or even weird --- but is this any of user
> space's business?  Or anything we want to enforce as wrong wrong wrong
> by xfstests?

I had the same question.  If the exact behavior isn't defined anywhere,
I don't know what we can be testing, TBH.

-Eric

>  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-06 Thread Theodore Ts'o
On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote:
> 
> Test fails with ext3/2 when driving with ext4 driver, fiemap changed
> after umount/mount cycle, then changed back to original result after
> sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.)

I haven't had time to look at this, but I'm not sure this test is a
reasonable one on the face of it.

A file system may choose to optimize a file's extent tree for whatever
reason it wants, whenever it wants, including on an unmount --- and
that would not be an invalid thing to do.  So to have an xfstests that
causes a test failure if a file system were to, say, do some cleanup
at mount or unmount time, or when the file is next opened, to merge
adjacent extents together (and hence change what is returned by
FIEMAP) might be strange, or even weird --- but is this any of user
space's business?  Or anything we want to enforce as wrong wrong wrong
by xfstests?

   - Ted
   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: send, fix file hole not being preserved due to inline extent

2017-04-06 Thread fdmanana
From: Filipe Manana 

Normally we don't have inline extents followed by regular extents, but
there's currently at least one harmless case where this happens. For
example, when the page size is 4Kb and compression is enabled:

  $ mkfs.btrfs -f /dev/sdb
  $ mount -o compress /dev/sdb /mnt
  $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
  $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar

In this case we get a compressed inline extent, representing 4Kb of
data, followed by a hole extent and then a regular data extent. The
inline extent was not expanded/converted to a regular extent exactly
because it represents 4Kb of data. This does not cause any apparent
problem (such as the issue solved by commit e1699d2d7bf6
("btrfs: add missing memset while reading compressed inline extents"))
except trigger an unexpected case in the incremental send code path
that makes us issue an operation to write a hole when it's not needed,
resulting in more writes at the receiver and wasting space at the
receiver.

So teach the incremental send code to deal with this particular case.

The issue can be currently triggered by running fstests btrfs/137 with
compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).

Signed-off-by: Filipe Manana 
---
 fs/btrfs/send.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 456c890..f66095a 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5184,13 +5184,19 @@ static int is_extent_unchanged(struct send_ctx *sctx,
while (key.offset < ekey->offset + left_len) {
ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
right_type = btrfs_file_extent_type(eb, ei);
-   if (right_type != BTRFS_FILE_EXTENT_REG) {
+   if (right_type != BTRFS_FILE_EXTENT_REG &&
+   right_type != BTRFS_FILE_EXTENT_INLINE) {
ret = 0;
goto out;
}
 
right_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
-   right_len = btrfs_file_extent_num_bytes(eb, ei);
+   if (right_type == BTRFS_FILE_EXTENT_INLINE) {
+   right_len = btrfs_file_extent_inline_len(eb, slot, ei);
+   right_len = PAGE_ALIGN(right_len);
+   } else {
+   right_len = btrfs_file_extent_num_bytes(eb, ei);
+   }
right_offset = btrfs_file_extent_offset(eb, ei);
right_gen = btrfs_file_extent_generation(eb, ei);
 
@@ -5204,6 +5210,19 @@ static int is_extent_unchanged(struct send_ctx *sctx,
goto out;
}
 
+   /*
+* We just wanted to see if when we have an inline extent, what
+* follows it is a regular extent (wanted to check the above
+* condition for inline extents too). This should normally not
+* happen but it's possible for example when we have an inline
+* compressed extent representing data with a size matching
+* the page size (currently the same as sector size).
+*/
+   if (right_type == BTRFS_FILE_EXTENT_INLINE) {
+   ret = 0;
+   goto out;
+   }
+
left_offset_fixed = left_offset;
if (key.offset < ekey->offset) {
/* Fix the right offset for 2a and 7. */
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: add missing memset while reading compressed inline extents

2017-04-06 Thread Filipe Manana
On Wed, Mar 22, 2017 at 2:37 AM, Qu Wenruo  wrote:
>
>
> At 03/09/2017 10:05 AM, Zygo Blaxell wrote:
>>
>> On Wed, Mar 08, 2017 at 10:27:33AM +, Filipe Manana wrote:
>>>
>>> On Wed, Mar 8, 2017 at 3:18 AM, Zygo Blaxell
>>>  wrote:

 From: Zygo Blaxell 

 This is a story about 4 distinct (and very old) btrfs bugs.

 Commit c8b978188c ("Btrfs: Add zlib compression support") added
 three data corruption bugs for inline extents (bugs #1-3).

 Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
 fixed bug #1:  uncompressed inline extents followed by a hole and more
 extents could get non-zero data in the hole as they were read.  The fix
 was to add a memset in btrfs_get_extent to zero out the hole.

 Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
 fixed bug #2:  compressed inline extents which contained non-zero bytes
 might be replaced with zero bytes in some cases.  This patch removed an
 unhelpful memset from uncompress_inline, but the case where memset is
 required was missed.

 There is also a memset in the decompression code, but this only covers
 decompressed data that is shorter than the ram_bytes from the extent
 ref record.  This memset doesn't cover the region between the end of the
 decompressed data and the end of the page.  It has also moved around a
 few times over the years, so there's no single patch to refer to.

 This patch fixes bug #3:  compressed inline extents followed by a hole
 and more extents could get non-zero data in the hole as they were read
 (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
 The fix is the same:  zero out the hole in the compressed case too,
 by putting a memset back in uncompress_inline, but this time with
 correct parameters.

 The last and oldest bug, bug #0, is the cause of the offending inline
 extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless
 quirk
 of behavior somewhere in the btrfs write code.  In a few special cases,
 an inline extent and hole are allowed to persist where they normally
 would be combined with later extents in the file.

 A fast reproducer for bug #0 is presented below.  A few offending
 extents
 are also created in the wild during large rsync transfers with the -S
 flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
 will produce a handful of offending files as well.  Once an offending
 file is created, it can present different content to userspace each
 time it is read.

 Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
 kernel back to v3.5 has this behavior.  There are fossil records of this
 bug's effects in commits all the way back to v2.6.32.  I have no reason
 to believe bug #0 wasn't present at the beginning of btrfs compression
 support in v2.6.29, but I can't easily test kernels that old to be sure.

 It is not clear whether bug #0 is worth fixing.  A fix would likely
 require injecting extra reads into currently write-only paths, and most
 of the exceptional cases caused by bug #0 are already handled now.

 Whether we like them or not, bug #0's inline extents followed by holes
 are part of the btrfs de-facto disk format now, and we need to be able
 to read them without data corruption or an infoleak.  So enough about
 bug #0, let's get back to bug #3 (this patch).

 An example of on-disk structure leading to data corruption:

 item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
 inode generation 50 transid 50 size 47424 nbytes 49141
 block group 0 mode 100644 links 1 uid 0 gid 0
 rdev 0 flags 0x0(none)
 item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
 inode ref index 3 namelen 10 name: DB_File.so
 item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
 inline extent data size 1341 ram 4085 compress(zlib)
 item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
 extent data disk byte 5367308288 nr 20480
 extent data offset 0 nr 45056 ram 45056
 extent compression(zlib)
>>>
>>>
>>> So this case is actually different from the reproducer below, because
>>> once a file has prealloc extents, future writes will never be
>>> compressed. That is, the extent at offset 4096 can not have compressed
>>> content using steps like the reproducer below. I can only imagine that
>>> after an falloc to create the extent at 4K, we cloned some compressed
>>> extent from some other file into that offset.
>>
>>
>> The above comes from one of my captures of the bug appearing in the wild,
>> not the reproducer from Liu Bo.  I'll make that c

Re: [PATCH v3] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-06 Thread Filipe Manana
On Thu, Apr 6, 2017 at 2:28 AM, Qu Wenruo  wrote:
> Btrfs allows inline file extent if and only if
> 1) It's at offset 0
> 2) It's smaller than min(max_inline, page_size)
>Although we don't specify if the size is before compression or after
>compression.
>At least according to current behavior, we are only limiting the size
>after compression.
> 3) It's the only file extent
>So that if we append existing inline extent, it should be converted
>to regular file extents.
>
> However several users in btrfs mail list have reported invalid inline
> file extent, which only meets the first two condition, but with regular
> file extents following.
>
> The bug is here for a long long time, so long that we have modified
> kernel and btrfs-progs to accept such case, but it's still not designed
> behavior, and must be detected and fixed.

So after looking at this better, I'm not convinced it's a problem.
Other than making btrfs/137 fail when compression is enabled, what
problems have you observed?

btrfs/137 fails due to the incremental send code not being prepared
for this case, which does not seem harmful to me because the inline
extent represents 4096 bytes of uncompressed data. It would be a
problem only if the uncompressed data was less than 4096 bytes.

So unless there's evidence that this particular case causes problems
somewhere, I don't think it's useful to have this test.

As for btrfs/137, I'm sending a fix for the incremental send code.

More comments below anyway.

>
> Signed-off-by: Qu Wenruo 
> ---
> v2:
>   All suggested by Eryu Guan
>   Use loop_counts instead of runtime to avoid naming confusion.
>   Fix whitespace issues
>   Use fsync instead of sync
>   Use $XFS_IO_PROG instead of calling xfs_io directly
>   Always output corruption possibility into seqres.full
>
> v3:
>   All suggested by Filipe Manana
>   Fix gramma errors
>   Use simpler reproducer
>   Add test case to compress group
> ---
>  tests/btrfs/140 | 111 
> 
>  tests/btrfs/140.out |   2 +
>  tests/btrfs/group   |   1 +
>  3 files changed, 114 insertions(+)
>  create mode 100755 tests/btrfs/140
>  create mode 100644 tests/btrfs/140.out
>
> diff --git a/tests/btrfs/140 b/tests/btrfs/140
> new file mode 100755
> index 000..183d9cd
> --- /dev/null
> +++ b/tests/btrfs/140
> @@ -0,0 +1,111 @@
> +#! /bin/bash
> +# FS QA Test 140
> +#
> +# Check if btrfs will create inline-then-regular file layout.
> +#
> +# Btrfs only allows inline file extent if file is small enough, and any
> +# incoming write enlarges the file beyond max_inline should replace inline
> +# extents with regular extents.
> +# This is a long standing bug, so fsck won't detect it and kernel will allow
> +# normal read on it, but should be avoided.
> +#
> +#---
> +# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_btrfs_command inspect-internal dump-tree
> +
> +inline_size=2048
> +page_size=$(get_page_size)
> +loop_counts=$(($LOAD_FACTOR * 4)) #The possibility is over 80%, just in case

Sorry I forgot about this before.
But why is the possibility not 100%? The code that skips the inline
extent expansion/conversion is 100% deterministic
(btrfs_file_write_iter() -> btrfs_cont_expand() ->
btrfs_truncate_block()) since the first extent always has a size of
4096 bytes.

So even if we end up having this test for some reason, there's no need
to have loops and the test could be added to the quick group.

> +
> +do_test()
> +{
> +   _scratch_mkfs > /dev/null 2>&1
> +
> +   # specify max_inline and compress explicitly to create

Re: [PATCH v2] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-06 Thread David Sterba
On Thu, Apr 06, 2017 at 05:05:16PM +0800, Qu Wenruo wrote:
> [BUG]
> Cycle mount btrfs can cause fiemap to return different result.
> Like:
>  # mount /dev/vdb5 /mnt/btrfs
>  # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:25088..25215   128   0x1
>  # umount /mnt/btrfs
>  # mount /dev/vdb5 /mnt/btrfs
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..31]: 25088..2511932   0x0
>1: [32..63]:25120..2515132   0x0
>2: [64..95]:25152..2518332   0x0
>3: [96..127]:   25184..2521532   0x1
> But after above fiemap, we get correct merged result if we call fiemap
> again.
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:25088..25215   128   0x1
> 
> [REASON]
> Btrfs will try to merge extent map when inserting new extent map.
> 
> btrfs_fiemap(start=0 len=(u64)-1)
> |- extent_fiemap(start=0 len=(u64)-1)
>|- get_extent_skip_holes(start=0 len=64k)
>|  |- btrfs_get_extent_fiemap(start=0 len=64k)
>| |- btrfs_get_extent(start=0 len=64k)
>||  Found on-disk (ino, EXTENT_DATA, 0)
>||- add_extent_mapping()
>||- Return (em->start=0, len=16k)
>|
>|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
>|
>|- get_extent_skip_holes(start=0 len=64k)
>|  |- btrfs_get_extent_fiemap(start=0 len=64k)
>| |- btrfs_get_extent(start=16k len=48k)
>||  Found on-disk (ino, EXTENT_DATA, 16k)
>||- add_extent_mapping()
>||  |- try_merge_map()
>|| Merge with previous em start=0 len=16k
>|| resulting em start=0 len=32k
>||- Return (em->start=0, len=32K)<< Merged result
>|- Stripe off the unrelated range (0~16K) of return em
>|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
>   ^^^ Causing split fiemap extent.
> 
> And since in add_extent_mapping(), em is already merged, in next
> fiemap() call, we will get merged result.
> 
> [FIX]
> Here we introduce a new structure, fiemap_cache, which records previous
> fiemap extent.
> 
> And will always try to merge current fiemap_cache result before calling
> fiemap_fill_next_extent().
> Only when we failed to merge current fiemap extent with cached one, we
> will call fiemap_fill_next_extent() to submit cached one.
> 
> So by this method, we can merge all fiemap extents.
> 
> It can also be done in fs/ioctl.c, however the problem is if
> fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
> extent.
> So I choose to merge it in btrfs.
> 
> Signed-off-by: Qu Wenruo 
> ---
> v2:
>   Since fiemap_extent_info has a limit for number of fiemap_extent, it's 
> possible
>   that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which 
> can
>   cause kernel warning if we fiemap is called on large compressed file.
> ---
>  fs/btrfs/extent_io.c | 116 
> ---
>  1 file changed, 110 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 28e81922a21c..84f4090dfaff 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4353,6 +4353,107 @@ static struct extent_map 
> *get_extent_skip_holes(struct inode *inode,
>   return NULL;
>  }
>  
> +/*
> + * To cache previous fiemap extent
> + *
> + * Will be used for merging fiemap extent
> + */
> +struct fiemap_cache {
> + bool cached;
> + u64 offset;
> + u64 phys;
> + u64 len;
> + u32 flags;
> +};
> +
> +/*
> + * Helper to submit fiemap extent.
> + *
> + * Will try to merge current fiemap extent specified by @offset, @phys,
> + * @len and @flags with cached one.
> + * And only when we fails to merge, cached one will be submitted as
> + * fiemap extent.
> + *
> + * Return 0 if merged or submitted.
> + * Return <0 for error.
> + */
> +static int submit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> + struct fiemap_cache *cache,
> + u64 offset, u64 phys, u64 len, u32 flags)
> +{
> + int ret;
> +
> + if (!cache->cached) {
> +assign:
> + cache->cached = true;
> + cache->offset = offset;
> + cache->phys = phys;
> + cache->len = len;
> + cache->flags = flags;
> + return 0;
> + }
> +
> + /*
> +  * Sanity check, extent_fiemap() should have ensured that new
> +  * fiemap extent won't overlap with cahced one.
> +  * NOTE: Physical address can overlap, due to compression
> +  */
> + WARN_ON(cache->offset + cache->len > offset);
> +
> + /*
> +  * Only merge fiemap extents if
> +

Re: [PATCH v2] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-06 Thread David Sterba
On Thu, Apr 06, 2017 at 05:05:16PM +0800, Qu Wenruo wrote:
> [BUG]
> Cycle mount btrfs can cause fiemap to return different result.
> Like:
>  # mount /dev/vdb5 /mnt/btrfs
>  # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:25088..25215   128   0x1
>  # umount /mnt/btrfs
>  # mount /dev/vdb5 /mnt/btrfs
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..31]: 25088..2511932   0x0
>1: [32..63]:25120..2515132   0x0
>2: [64..95]:25152..2518332   0x0
>3: [96..127]:   25184..2521532   0x1
> But after above fiemap, we get correct merged result if we call fiemap
> again.
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:25088..25215   128   0x1
> 
> [REASON]
> Btrfs will try to merge extent map when inserting new extent map.
> 
> btrfs_fiemap(start=0 len=(u64)-1)
> |- extent_fiemap(start=0 len=(u64)-1)
>|- get_extent_skip_holes(start=0 len=64k)
>|  |- btrfs_get_extent_fiemap(start=0 len=64k)
>| |- btrfs_get_extent(start=0 len=64k)
>||  Found on-disk (ino, EXTENT_DATA, 0)
>||- add_extent_mapping()
>||- Return (em->start=0, len=16k)
>|
>|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
>|
>|- get_extent_skip_holes(start=0 len=64k)
>|  |- btrfs_get_extent_fiemap(start=0 len=64k)
>| |- btrfs_get_extent(start=16k len=48k)
>||  Found on-disk (ino, EXTENT_DATA, 16k)
>||- add_extent_mapping()
>||  |- try_merge_map()
>|| Merge with previous em start=0 len=16k
>|| resulting em start=0 len=32k
>||- Return (em->start=0, len=32K)<< Merged result
>|- Stripe off the unrelated range (0~16K) of return em
>|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
>   ^^^ Causing split fiemap extent.
> 
> And since in add_extent_mapping(), em is already merged, in next
> fiemap() call, we will get merged result.
> 
> [FIX]
> Here we introduce a new structure, fiemap_cache, which records previous
> fiemap extent.
> 
> And will always try to merge current fiemap_cache result before calling
> fiemap_fill_next_extent().
> Only when we failed to merge current fiemap extent with cached one, we
> will call fiemap_fill_next_extent() to submit cached one.
> 
> So by this method, we can merge all fiemap extents.
> 
> It can also be done in fs/ioctl.c, however the problem is if
> fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
> extent.
> So I choose to merge it in btrfs.
> 
> Signed-off-by: Qu Wenruo 
> ---
> v2:
>   Since fiemap_extent_info has a limit for number of fiemap_extent, it's 
> possible
>   that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which 
> can
>   cause kernel warning if we fiemap is called on large compressed file.
> ---
>  fs/btrfs/extent_io.c | 116 
> ---
>  1 file changed, 110 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 28e81922a21c..84f4090dfaff 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4353,6 +4353,107 @@ static struct extent_map 
> *get_extent_skip_holes(struct inode *inode,
>   return NULL;
>  }
>  
> +/*
> + * To cache previous fiemap extent
> + *
> + * Will be used for merging fiemap extent
> + */
> +struct fiemap_cache {
> + bool cached;
> + u64 offset;
> + u64 phys;
> + u64 len;
> + u32 flags;

Please move bool cached after flags, for better packing.

> +};
> +
> +/*
> + * Helper to submit fiemap extent.
> + *
> + * Will try to merge current fiemap extent specified by @offset, @phys,
> + * @len and @flags with cached one.
> + * And only when we fails to merge, cached one will be submitted as
> + * fiemap extent.
> + *
> + * Return 0 if merged or submitted.
> + * Return <0 for error.
> + */
> +static int submit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> + struct fiemap_cache *cache,
> + u64 offset, u64 phys, u64 len, u32 flags)
> +{
> + int ret;
> +
> + if (!cache->cached) {
> +assign:
> + cache->cached = true;
> + cache->offset = offset;
> + cache->phys = phys;
> + cache->len = len;
> + cache->flags = flags;
> + return 0;
> + }
> +
> + /*
> +  * Sanity check, extent_fiemap() should have ensured that new
> +  * fiemap extent won't overlap with cahced one.
> +  * NOTE: Physical address can overlap, due to compression
> +  */
> + WARN_ON(cache->offset + cache->len > offset);
> +

Re: [PATCH 1/3] common/rc: test that xfs_io's falloc command supports specific flags

2017-04-06 Thread David Sterba
On Thu, Apr 06, 2017 at 03:20:43PM +0100, Filipe Manana wrote:
> On Thu, Apr 6, 2017 at 3:18 PM, Eryu Guan  wrote:
> > On Tue, Apr 04, 2017 at 07:34:29AM +0100, fdman...@kernel.org wrote:
> >> From: Filipe Manana 
> >>
> >> For example NFS 4.2 supports fallocate but it does not support its
> >> KEEP_SIZE flag, so we want to skip tests that use fallocate with that
> >> flag on filesystems that don't support it.
> >>
> >> Suggested-by: Eryu Guan 
> >> Signed-off-by: Filipe Manana 
> >> ---
> >>  common/rc | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/common/rc b/common/rc
> >> index e1ab2c6..3d0f089 100644
> >> --- a/common/rc
> >> +++ b/common/rc
> >> @@ -2021,8 +2021,8 @@ _require_xfs_io_command()
> >>   "chproj")
> >>   testio=`$XFS_IO_PROG -F -f -c "chproj 0" $testfile 2>&1`
> >>   ;;
> >> - "falloc" )
> >> - testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
> >> + "falloc*" )
> >
> > This doesn't work as expected with quotes. I can remove the quotes at
> > commit time though.
> 
> Hum, it did work for me, strange.

Quoted globs don't work the same way as unquoted inside case, this

#!/bin/sh
i=abc
case $i in
abc*) echo notquoted;;
"abc*") echo quoted;;
esac
---

will return 'notquoted'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix invalid dereference in btrfs_retry_endio

2017-04-06 Thread David Sterba
On Wed, Apr 05, 2017 at 02:04:19PM -0700, Liu Bo wrote:
> When doing directIO repair, we have this oops
> 
> [ 1458.532816] general protection fault:  [#1] SMP
> ...
> [ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
> [ 1458.536893] task: 88082a42d100 task.stack: c90002b3c000
> [ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
> ...
> [ 1458.543261] Call Trace:
> [ 1458.543958]  ? rcu_read_lock_sched_held+0xc4/0xd0
> [ 1458.544374]  bio_endio+0xed/0x100
> [ 1458.544750]  end_workqueue_fn+0x3c/0x40 [btrfs]
> [ 1458.545257]  normal_work_helper+0x9f/0x900 [btrfs]
> [ 1458.545762]  btrfs_endio_repair_helper+0x12/0x20 [btrfs]
> [ 1458.546224]  process_one_work+0x34d/0xb70
> [ 1458.546570]  ? process_one_work+0x29e/0xb70
> [ 1458.546938]  worker_thread+0x1cf/0x960
> [ 1458.547263]  ? process_one_work+0xb70/0xb70
> [ 1458.547624]  kthread+0x17d/0x180
> [ 1458.547909]  ? kthread_create_on_node+0x70/0x70
> [ 1458.548300]  ret_from_fork+0x31/0x40
> 
> It turns out that btrfs_retry_endio is trying to get inode from a directIO
> page.
> 
> This fixes the problem by using the preservd inode pointer, done->inode.
> btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.
> 
> Also cleanup unused @start(which is too trivial for a separate patch).
> 
> Cc: David Sterba 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 

> ---
>  fs/btrfs/inode.c | 14 --
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 3ec5a05..8e71ed7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7910,7 +7910,6 @@ struct btrfs_retry_complete {
>  static void btrfs_retry_endio_nocsum(struct bio *bio)
>  {
>   struct btrfs_retry_complete *done = bio->bi_private;
> - struct inode *inode;
>   struct bio_vec *bvec;
>   int i;
>  
> @@ -7918,12 +7917,12 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
>   goto end;
>  
>   ASSERT(bio->bi_vcnt == 1);
> - inode = bio->bi_io_vec->bv_page->mapping->host;
> - ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(inode));
> + ASSERT(bio->bi_io_vec->bv_len == btrfs_inode_sectorsize(done->inode));

I was interested how it got here in the first place, inode from mapping
hasn't been used in the initial support for DIO repair. So it was added
in the subpage blocksize preparation patchset (2dabb3248453b), slipped
through the review.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] common/rc: test that xfs_io's falloc command supports specific flags

2017-04-06 Thread Filipe Manana
On Thu, Apr 6, 2017 at 3:18 PM, Eryu Guan  wrote:
> On Tue, Apr 04, 2017 at 07:34:29AM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> For example NFS 4.2 supports fallocate but it does not support its
>> KEEP_SIZE flag, so we want to skip tests that use fallocate with that
>> flag on filesystems that don't support it.
>>
>> Suggested-by: Eryu Guan 
>> Signed-off-by: Filipe Manana 
>> ---
>>  common/rc | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/common/rc b/common/rc
>> index e1ab2c6..3d0f089 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -2021,8 +2021,8 @@ _require_xfs_io_command()
>>   "chproj")
>>   testio=`$XFS_IO_PROG -F -f -c "chproj 0" $testfile 2>&1`
>>   ;;
>> - "falloc" )
>> - testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
>> + "falloc*" )
>
> This doesn't work as expected with quotes. I can remove the quotes at
> commit time though.

Hum, it did work for me, strange.

But please do, thanks.

>
> Thanks,
> Eryu
>
>> + testio=`$XFS_IO_PROG -F -f -c "$command 0 1m" $testfile 2>&1`
>>   ;;
>>   "fpunch" | "fcollapse" | "zero" | "fzero" | "finsert" | "funshare")
>>   testio=`$XFS_IO_PROG -F -f -c "pwrite 0 20k" -c "fsync" \
>> --
>> 2.7.0.rc3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe fstests" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] common/rc: test that xfs_io's falloc command supports specific flags

2017-04-06 Thread Eryu Guan
On Tue, Apr 04, 2017 at 07:34:29AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> For example NFS 4.2 supports fallocate but it does not support its
> KEEP_SIZE flag, so we want to skip tests that use fallocate with that
> flag on filesystems that don't support it.
> 
> Suggested-by: Eryu Guan 
> Signed-off-by: Filipe Manana 
> ---
>  common/rc | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/common/rc b/common/rc
> index e1ab2c6..3d0f089 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -2021,8 +2021,8 @@ _require_xfs_io_command()
>   "chproj")
>   testio=`$XFS_IO_PROG -F -f -c "chproj 0" $testfile 2>&1`
>   ;;
> - "falloc" )
> - testio=`$XFS_IO_PROG -F -f -c "falloc 0 1m" $testfile 2>&1`
> + "falloc*" )

This doesn't work as expected with quotes. I can remove the quotes at
commit time though.

Thanks,
Eryu

> + testio=`$XFS_IO_PROG -F -f -c "$command 0 1m" $testfile 2>&1`
>   ;;
>   "fpunch" | "fcollapse" | "zero" | "fzero" | "finsert" | "funshare")
>   testio=`$XFS_IO_PROG -F -f -c "pwrite 0 20k" -c "fsync" \
> -- 
> 2.7.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-06 Thread Qu Wenruo
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
 # mount /dev/vdb5 /mnt/btrfs
 # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:25088..25215   128   0x1
 # umount /mnt/btrfs
 # mount /dev/vdb5 /mnt/btrfs
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..31]: 25088..2511932   0x0
   1: [32..63]:25120..2515132   0x0
   2: [64..95]:25152..2518332   0x0
   3: [96..127]:   25184..2521532   0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:25088..25215   128   0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   | |- btrfs_get_extent(start=0 len=64k)
   ||  Found on-disk (ino, EXTENT_DATA, 0)
   ||- add_extent_mapping()
   ||- Return (em->start=0, len=16k)
   |
   |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
   |
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   | |- btrfs_get_extent(start=16k len=48k)
   ||  Found on-disk (ino, EXTENT_DATA, 16k)
   ||- add_extent_mapping()
   ||  |- try_merge_map()
   || Merge with previous em start=0 len=16k
   || resulting em start=0 len=32k
   ||- Return (em->start=0, len=32K)<< Merged result
   |- Stripe off the unrelated range (0~16K) of return em
   |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
  ^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.

It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.

Signed-off-by: Qu Wenruo 
---
v2:
  Since fiemap_extent_info has a limit for number of fiemap_extent, it's 
possible
  that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which 
can
  cause kernel warning if we fiemap is called on large compressed file.
---
 fs/btrfs/extent_io.c | 116 ---
 1 file changed, 110 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e81922a21c..84f4090dfaff 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4353,6 +4353,107 @@ static struct extent_map *get_extent_skip_holes(struct 
inode *inode,
return NULL;
 }
 
+/*
+ * To cache previous fiemap extent
+ *
+ * Will be used for merging fiemap extent
+ */
+struct fiemap_cache {
+   bool cached;
+   u64 offset;
+   u64 phys;
+   u64 len;
+   u32 flags;
+};
+
+/*
+ * Helper to submit fiemap extent.
+ *
+ * Will try to merge current fiemap extent specified by @offset, @phys,
+ * @len and @flags with cached one.
+ * And only when we fails to merge, cached one will be submitted as
+ * fiemap extent.
+ *
+ * Return 0 if merged or submitted.
+ * Return <0 for error.
+ */
+static int submit_fiemap_extent(struct fiemap_extent_info *fieinfo,
+   struct fiemap_cache *cache,
+   u64 offset, u64 phys, u64 len, u32 flags)
+{
+   int ret;
+
+   if (!cache->cached) {
+assign:
+   cache->cached = true;
+   cache->offset = offset;
+   cache->phys = phys;
+   cache->len = len;
+   cache->flags = flags;
+   return 0;
+   }
+
+   /*
+* Sanity check, extent_fiemap() should have ensured that new
+* fiemap extent won't overlap with cahced one.
+* NOTE: Physical address can overlap, due to compression
+*/
+   WARN_ON(cache->offset + cache->len > offset);
+
+   /*
+* Only merge fiemap extents if
+* 1) Their logical addresses are continuous
+*
+* 2) Their physical addresses are continuous
+*So truly compressed (physical size smaller than logical size)
+*extents won't get merged with each other
+*
+* 3) Share same flags except FIE

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-06 Thread Jan Kara
On Thu 06-04-17 11:12:02, NeilBrown wrote:
> On Wed, Apr 05 2017, Jan Kara wrote:
> >> If you want to ensure read-only files can remain cached over a crash,
> >> then you would have to mark a file in some way on stable storage
> >> *before* allowing any change.
> >> e.g. you could use the lsb.  Odd i_versions might have been changed
> >> recently and crash-count*large-number needs to be added.
> >> Even i_versions have not been changed recently and nothing need be
> >> added.
> >> 
> >> If you want to change a file with an even i_version, you subtract
> >>   crash-count*large-number
> >> to the i_version, then set lsb.  This is written to stable storage before
> >> the change.
> >> 
> >> If a file has not been changed for a while, you can add
> >>   crash-count*large-number
> >> and clear lsb.
> >> 
> >> The lsb of the i_version would be for internal use only.  It would not
> >> be visible outside the filesystem.
> >> 
> >> It feels a bit clunky, but I think it would work and is the best
> >> combination of Jan's idea and your requirement.
> >> The biggest cost would be switching to 'odd' before an changes, and the
> >> unknown is when does it make sense to switch to 'even'.
> >
> > Well, there is also a problem that you would need to somehow remember with
> > which 'crash count' the i_version has been previously reported as that is
> > not stored on disk with my scheme. So I don't think we can easily use your
> > scheme.
> 
> I don't think there is a problem here maybe I didn't explain
> properly or something.
> 
> I'm assuming there is a crash-count that is stored once per filesystem.
> This might be a disk-format change, or maybe the "Last checked" time
> could be used with ext4 (that is a bit horrible though).
> 
> Every on-disk i_version has a flag to choose between:
>   - use this number as it is, but update it on-disk before any change
>   - add multiple of current crash-count to this number before use.
>   If you crash during an update, the i_version is thus automatically
>   increased.
> 
> To change from the first option to the second option you subtract the
> multiple of the current crash-count (which might make the stored
> i_version negative), and flip the bit.
> To change from the second option to the first, you add the multiple
> of the current crash-count, and flip the bit.
> In each case, the externally visible i_version does not change.
> Nothing needs to be stored except the per-inode i_version and the per-fs
> crash_count. 

Right, I didn't realize you would subtract crash counter when flipping the
bit and then add it back when flipping again. That would work.

> > So the options we have are:
> >
> > 1) Keep i_version as is, make clients also check for i_ctime.
> >Pro: No on-disk format changes.
> >Cons: After a crash, i_version can go backwards (but when file changes
> >i_version, i_ctime pair should be still different) or not, data can be
> >old or not.
> 
> I like to think of this approach as using the i_version as an extension
> to the i_ctime.
> i_ctime doesn't necessarily change on every file modification, either
> because it is not a modification that is meant to change i_ctime, or
> because i_ctime doesn't have the resolution to show a very small change
> in time, or because the clock that is used to update i_ctime doesn't
> have much resolution.
> So when a change happens, if the stored c_time changes, set i_version to
> zero, otherwise increment i_version.
> Then the externally visible i-version is a combination of the stored
> c_time and the stored i_version.
> If you only used 1-second ctime resolution for versioning purposes, you
> could provide a 64bit i_version as 34 bits of ctime and 30 bits of
> changes-in-one-second.
> It is important that the resolution of ctime used is less that the
> fastest possible restart after a crash.
> 
> I don't think that i_version going backwards should be a problem, as
> long as an old version means exactly the same old data.  Presumably
> journalling would ensure that the data and ctime/version are updated
> atomically.

So as Dave and I wrote earlier in this thread, journalling does not ensure
data vs ctime/version consistency (well, except for ext4 in data=journal
mode but people rarely run that due to performance implications). So you
can get old data and new version as well as new data and old version after
a crash. The only thing filesystems guarantee is that you will not see
uninitialized blocks and that fsync makes both data & ctime/version
persistent. But as Bruce wrote for NFS open-to-close semantics this may be
actually good enough.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html